RU2781355C1

RU2781355C1 - Scalar-vector processor

Info

Publication number: RU2781355C1
Application number: RU2021132215A
Authority: RU
Inventors: Ярослав Ярославович Петричкович; Татьяна Владимировна Солохина; Денис Александрович Кузнецов; Андрей Александрович Беляев; Юрий Николаевич Александров; Дмитрий Александрович Деревянко; Иван Андреевич Беляев; Юлия Викторовна Миронова; Виталий Сергеевич Гаврилов
Filing date: 2021-11-03
Publication date: 2022-10-11

Abstract

FIELD: computing technology.

SUBSTANCE: technical solution relates to the field of microprocessor computing. The technical result is achieved by the scalar-vector processor containing a reduction unit connected with the scalar and vector channels of the processor and implementing the functions of interaction thereof in operations where the scalar channel forms and/or utilises the scalar required and/or formed by the vector channel of the processor; the reduction unit also executes operations on the vector in general, i.e., permutation operations (shuffle), LUT transformations, and histogram calculations; the scalar and vector channels of the processor are additionally combined by an annular bus enabling data exchange thereon simultaneously with the execution of computational operations in the scalar and vector channels of the processor and in the reduction unit.

EFFECT: increase in the operation speed and the volume of processed data due to the parallel scalar and vector calculations.

44 cl, 7 dwg, 20 tbl

Description

Изобретение относится к области микропроцессоров, а именно к скалярно-векторным процессорам, и может быть использовано при построении архитектуры процессорных IP-ядер, ориентированных на решение задач цифровой обработки сигналов, включая приложения искусственного интеллекта и нейронных сетей.The invention relates to the field of microprocessors, namely to scalar-vector processors, and can be used to build the architecture of IP processor cores focused on solving digital signal processing tasks, including applications of artificial intelligence and neural networks.

Постоянное повышение сложности решаемых задач в области цифровой обработки сигналов и увеличение объемов обрабатываемых данных приводят к непрерывному росту требований к вычислительной производительности выполняющих эти задачи микропроцессорных систем. Основным методом повышения производительности вычислений является их распараллеливание. Возможность распараллеливания обеспечивают тем, что во многих задачах обработки сигналов требуется, как правило, выполнение большого объема одинаковых вычислительных процедур по отношению к большим массивам обрабатываемых данных. Массивы однотипных данных векторизуют, и дальнейшую высокопроизводительную обработку проводитят уже не над отдельными элементами, а над векторами. Архитектуры микропроцессоров, выполняющих подобную обработку, известны под названием SIMD-архитектур (SIMD - Single Instruction, Multiple Data).The constant increase in the complexity of the tasks being solved in the field of digital signal processing and the increase in the volume of processed data lead to a continuous increase in the requirements for the computing performance of the microprocessor systems performing these tasks. The main method for improving the performance of computations is their parallelization. The possibility of parallelization is ensured by the fact that in many signal processing tasks, as a rule, the execution of a large amount of identical computational procedures is required in relation to large arrays of processed data. Arrays of the same type of data are vectorized, and further high-performance processing is no longer carried out on individual elements, but on vectors. Microprocessor architectures that perform such processing are known as SIMD architectures (SIMD - Single Instruction, Multiple Data).

Сложность, однако, заключается в том, что ни в одном реальном приложении не удается достичь стопроцентной векторизации. Известен закон Амдала, согласно которому общее ускорение, полученное в результате векторизации на векторном процессоре с Р элементами обработки, в зависимости от доли кода f, которая может быть векторизована, равно 1/(1-f+f/P). Таким образом, в реальных прикладных задачах, наряду с векторной обработкой, всегда присутствует и скалярная часть вычислений.The difficulty, however, lies in the fact that no real application can achieve 100% vectorization. Amdahl's law is known, according to which the total acceleration obtained as a result of vectorization on a vector processor with P processing elements, depending on the fraction of the code f that can be vectorized, is equal to 1/(1-f+f/P). Thus, in real applied problems, along with vector processing, there is always a scalar part of calculations.

На практике применяют различные подходы к организации такого рода смешанных вычислений. Например, для реализации нейронных сетей широко используют гетерогенные вычислительные системы, содержащие в своем составе центральный процессор (CPU), выполняющий верхний уровень задачи и связанные с этим скалярные вычисления, и графический процессор (GPU), функцией которого является выполнение массивно-параллельной обработки данных. Недостатком такого подхода является то, что передача данных от одного процессора другому связана со значительными временными задержками, что сказывается отрицательным образом на реально достигаемой производительности.In practice, various approaches are used to organize this kind of mixed computing. For example, to implement neural networks, heterogeneous computing systems are widely used, which include a central processing unit (CPU), which performs the upper level of tasks and related scalar calculations, and a graphic processor (GPU), whose function is to perform massively parallel data processing. The disadvantage of this approach is that the transfer of data from one processor to another is associated with significant time delays, which negatively affects the actual performance achieved.

Это привело к появлению архитектур, в которых некоторое количество скалярных и векторных вычислительных секций (ядер) объединяют в составе одного микропроцессора. Одной из главных проблем таких архитектур является организация эффективного взаимодействия между скалярным и векторным каналами микропроцессора. Дело в том, что во многих прикладных задачах, связанных с обработкой векторных данных, возникает потребность в выполнении операций, в которых скалярный канал формирует и/или потребляет скаляр, требуемый и/или формируемый векторным каналом микропроцессора. Таким образом, скалярный канал процессора может подготовить или дополнительно обработать скаляры, необходимые для векторного канала или созданные векторным каналом, гарантируя, что векторный канал может продолжать потоковую обработку векторов без лишних торможений.This led to the emergence of architectures in which a number of scalar and vector computing sections (cores) are combined in one microprocessor. One of the main problems of such architectures is the organization of effective interaction between the scalar and vector channels of the microprocessor. The fact is that in many applied tasks related to the processing of vector data, there is a need to perform operations in which a scalar channel generates and/or consumes a scalar required and/or generated by the vector channel of the microprocessor. In this way, the processor's scalar channel can prepare or further process the scalars needed for the vector channel or created by the vector channel, ensuring that the vector channel can continue to stream vectors without undue lag.

Примером операций, в которые вовлечены и скалярный и векторный каналы процессора, являются операции редукции - к ним относятся, в частности, определение и вывод минимального или максимального значения вектора, сумма или произведение элементов вектора и т.п.An example of operations involving both scalar and vector channels of the processor are reduction operations - these include, in particular, the determination and output of the minimum or maximum value of a vector, the sum or product of vector elements, etc.

В задачах обработки изображений широко применяют и другие операции, выполняемые над вектором в целом - операции перестановок (shuffle), вычисление гистограмм, табличные преобразования (LUT, Look-Up Table). Известные способы реализации таких процедур имеют свои достоинства и недостатки. Программная реализация на основе стандартного набора команд не позволяет достичь высокой производительности, а аппаратная реализация в виде ускорителей требует дополнительных аппаратурных затрат и обладает ограниченной гибкостью. По указанным причинам поиск эффективных способов выполнения такого рода вычислительных процедур остается по-прежнему актуальным.In image processing tasks, other operations performed on the vector as a whole are also widely used - operations of permutations (shuffle), calculation of histograms, table transformations (LUT, Look-Up Table). Known methods for implementing such procedures have their advantages and disadvantages. A software implementation based on a standard set of instructions does not allow achieving high performance, while a hardware implementation in the form of accelerators requires additional hardware costs and has limited flexibility. For these reasons, the search for efficient ways to perform such computational procedures remains relevant.

Известно решение (патент US 5659706), в котором описывается скалярно-векторный процессор с отдельной скалярным и векторным каналами. Каждый из каналов процессора разделен на функциональные блоки. Недостатком этого решения является то, что между функциональными блоками скалярного и векторного канала процессора нет тесного взаимодействия. Оба канала работают полностью независимо, и это приводит к дополнительным временным потерям при передаче данных от одной части процессора к другой, соответственно, к ухудшению быстродействия процессора.Known solution (patent US 5659706), which describes a scalar-vector processor with a separate scalar and vector channels. Each of the processor channels is divided into functional blocks. The disadvantage of this solution is that there is no close interaction between the functional blocks of the scalar and vector channel of the processor. Both channels work completely independently, and this leads to additional time losses when transferring data from one part of the processor to another, and, accordingly, to a deterioration in processor performance.

Известно другое решение (патент US 5822606 А), в котором описана одна из первых архитектур сигнальных процессоров, содержащая одновременно функционирующие скалярное и векторные вычислительные ядра. Недостатком этой архитектуры является то, что, как и в предыдущем случае, скалярное и векторные вычислительные ядра непосредственно между собой не взаимодействуют, обмены данными между ними выполняются через внешнюю память, что связано со значительными задержками, соответственно, к ухудшению быстродействия процессора.Another solution is known (patent US 5822606 A), which describes one of the first signal processor architectures containing simultaneously functioning scalar and vector computing cores. The disadvantage of this architecture is that, as in the previous case, the scalar and vector computing cores do not interact directly with each other, data exchanges between them are performed through external memory, which is associated with significant delays, respectively, to the deterioration of the processor speed.

В патенте US 2004015677 А1 описана архитектура скалярно-векторного процессора цифровой обработки сигналов, в которой предусматривается выполнение некоторых операций редукции в последовательном стиле, путем передачи данных от одной SIMD-секции к другой. Недостатком данной архитектуры является невысокая производительность, достигаемая при последовательной организации вычислений.US 2004015677 A1 describes a scalar vector digital signal processor architecture that performs some serial style reduction operations by passing data from one SIMD section to another. The disadvantage of this architecture is the low performance achieved with sequential organization of calculations.

Известна архитектура (патент US 2005/0240644 А1) скалярно-векторного процессора, который включает в себя набор функциональных (вычислительных) блоков, содержащих взаимодействующие между собой векторный и скалярный каналы. Недостатком этой архитектуры является то, что взаимодействие между векторным и скалярным каналами процессора ограничено рамками конкретного функционального блока. Кроме того, данное решение не предусматривает поддержку операций редукции.Known architecture (patent US 2005/0240644 A1) scalar-vector processor, which includes a set of functional (computing) blocks containing interacting vector and scalar channels. The disadvantage of this architecture is that the interaction between the vector and scalar channels of the processor is limited to a particular function block. In addition, this solution does not provide support for reduction operations.

В патенте US 2020/142704 А1 описана архитектура скалярно-векторного процессора, в которой наряду с SIMD-распараллеливанием, используется также параллелизм на уровне команд по принципу VLIW (Very Long Instruction Word) как в скалярном, так и в векторном канале процессора, что повышает общую производительность. Однако никаких механизмов взаимодействия между скалярным и векторным каналами процессора не предусмотрено.US 2020/142704 A1 describes the architecture of a scalar-vector processor, which, along with SIMD parallelization, also uses instruction-level parallelism according to the VLIW (Very Long Instruction Word) principle both in the scalar and in the vector channel of the processor, which increases overall performance. However, there are no mechanisms for interaction between the scalar and vector channels of the processor.

Наиболее близким к заявленному изобретению является архитектурный подход, описанный в патенте US 2021/0216318 А1. В данном патенте предложено целое семейство вариантов архитектур скалярно-векторного процессора, в том числе такие, в которых поддерживается взаимодействие скалярного и векторного каналов процессора и предусмотрено выполнение операций редукции. Данные архитектуры скалярно-векторного процессора выбраны в качестве прототипов заявленного изобретения. Однако операции редукции в данном патенте реализуют на базе векторных вычислительных секций за счет дополнительных связей между секциями. Это ухудшает масштабируемость архитектуры и делает невозможным одновременное выполнение операций редукции и других векторных вычислений. Кроме того, предложенный подход не предусматривает поддержку выполнения других операций над вектором в целом - перестановки, вычисление гистограмм, табличные преобразования (LUT).Closest to the claimed invention is the architectural approach described in US 2021/0216318 A1. This patent proposes a whole family of scalar-vector processor architectures, including those that support the interaction of scalar and vector processor channels and provide for reduction operations. These architectures of the scalar vector processor are selected as prototypes of the claimed invention. However, the reduction operations in this patent are implemented on the basis of vector computing sections due to additional links between sections. This worsens the scalability of the architecture and makes it impossible to simultaneously perform reduction operations and other vector calculations. In addition, the proposed approach does not provide support for performing other operations on the vector as a whole - permutations, calculation of histograms, table transformations (LUT).

Техническим результатом изобретения является создание скалярно-векторного процессора, который обладает повышенной эффективностью, скоростью работы, функциональностью и универсальностью за счет того, что: в его составе содержится блок редукции, соединенный со скалярным и векторным каналами процессора и реализующий функции их взаимодействия в разнообразных операциях, в которых скалярный канал формирует и/или потребляет скаляр, требуемый и/или формируемый векторным каналом процессора; блок редукции выполняет, кроме того, различные операции над вектором в целом - операции перестановок (shuffle), LUT-преобразования, вычисления гистограмм; скалярный и векторный каналы процессора объединены дополнительно кольцеобразной шиной, позволяющей производить по ней обмен данными одновременно с выполнением вычислительных операций в скалярном и векторном каналах процессора и в блоке редукции.The technical result of the invention is the creation of a scalar-vector processor, which has increased efficiency, speed, functionality and versatility due to the fact that: it contains a reduction unit connected to the scalar and vector channels of the processor and realizing the functions of their interaction in various operations, in which the scalar channel generates and/or consumes a scalar required and/or generated by the vector channel of the processor; the reduction block performs, in addition, various operations on the vector as a whole - operations of permutations (shuffle), LUT-transformations, calculation of histograms; The scalar and vector channels of the processor are additionally united by a ring-shaped bus, which makes it possible to exchange data over it simultaneously with the execution of computational operations in the scalar and vector channels of the processor and in the reduction unit.

Поставленный технический результат достигнут путем создания скалярно-векторного процессора 100, содержащего соединенные кольцевой шиной CDB 112 скалярный и векторный каналы 105 и 107 обработки данных, которые соединены с блоком редукции VRED 104, а также с памятью данных первого уровня DMEM/L1D$ 103, которая соединена с кэш-памятью второго уровня L2$ 101, которая соединена с внешним интерфейсом процессора, который имеет доступ к внешней памяти вычислительной системы, а также соединена с памятью программ первого уровня РМЕМ/L1I$ 102, выход которой соединен с входом блока выборки команд FETCH 109, выход которого соединен с входом блока декодирования команд DECODE 110, первый выход которого соединен с входами скалярного и векторного каналов, а второй выход соединен с блоком программного управления PCTRL 111, выход которого соединен с входом памяти программ первого уровня РМЕМ/L1I$ 102, причемThe set technical result is achieved by creating a scalar-vector processor 100 containing scalar and vector data processing channels 105 and 107 connected by a ring bus CDB 112, which are connected to the reduction unit VRED 104, as well as to the first-level data memory DMEM/L1D $ 103, which connected to the cache memory of the second level L2 $ 101, which is connected to the external interface of the processor, which has access to the external memory of the computing system, and is also connected to the program memory of the first level PMEM / L1I $ 102, the output of which is connected to the input of the FETCH command fetch unit 109, the output of which is connected to the input of the command decoding unit DECODE 110, the first output of which is connected to the inputs of the scalar and vector channels, and the second output is connected to the PCTRL 111 program control unit, the output of which is connected to the input of the program memory of the first level PMEM / L1I $ 102, and

- память программ первого уровня РМЕМ/L1I$ 102 и память данных первого уровня DMEM/L1D$ 103 выполнены с возможностью формирования обращений и передачи их в- the program memory of the first level RMEM/L1I$ 102 and the data memory of the first level DMEM/L1D$ 103 are made with the possibility of generating calls and transferring them to

- кэш-память второго уровня L2$ 101, которая выполнена с возможностью обслуживания обращений из памяти программ первого уровня PMEM/L1I$ 102 и памяти данных первого уровня DMEM/L1D$ 103, а также загрузки данных через внешний интерфейс из внешней памяти вычислительной системы и передачи данных в память данных первого уровня DMEM/L1D$ 103 и память программ первого уровня РМЕМ/L1I$ 102;- cache memory of the second level L2$ 101, which is configured to serve calls from the program memory of the first level PMEM/L1I$ 102 and data memory of the first level DMEM/L1D$ 103, as well as to load data via an external interface from the external memory of the computer system and transferring data to the first level data memory DMEM/L1D$ 103 and the first level program memory PMEM/L1I$ 102;

- блок выборки команд FETCH 109 выполнен с возможностью выборки команд из памяти программ РМЕМ/L1I$ 102 и передачи их в- the FETCH 109 command fetch unit is configured to fetch commands from the PMEM/L1I$ 102 program memory and transfer them to

- блок декодирования команд DECODE ПО, выполненный с возможностью декодирования команд и формирования команд программного управления для исполнительных устройств процессора и передачи их в- a block for decoding commands DECODE software, configured to decode commands and generate software control commands for the executive devices of the processor and transfer them to

- блок PCTRL, который выполнен с возможностью выполнения команд программного управления.- a PCTRL block, which is configured to execute program control instructions.

В предпочтительном варианте осуществления процессора память данных первого уровня DMEM/L1D$ 103 выполнена в виде кэш-памяти первого уровня L1D$ или в виде тесно связанной ТСМ (Tightly-Coupled Memory) статической памяти DMEM.In the preferred embodiment of the processor, the DMEM/L1D$ data memory 103 is implemented as an L1D$ cache or TCM (Tightly-Coupled Memory) DMEM static memory.

В предпочтительном варианте осуществления процессора память программ первого уровня PMEM/L1I$ 102 выполнена в виде кэш-памяти первого уровня L1I$ или в виде тесно связанной ТСМ (Tightly-Coupled Memory) статической памяти РМЕМ.In the preferred embodiment of the processor, the first level program memory PMEM/L1I$ 102 is implemented as an L1I$ first cache memory or as a tightly coupled TCM (Tightly-Coupled Memory) PMEM static memory.

В предпочтительном варианте осуществления процессора на уровне вычислительного ядра имеет гарвардскую архитектуру с возможностью одновременного доступа к памяти программ первого уровня PMEM/L1I$ 102 и памяти данных первого уровня DMEM/L1D$ 103 по отдельным шинам.In the preferred embodiment, the processor at the level of the computing core has a Harvard architecture with the ability to simultaneously access the first level program memory PMEM/L1I$ 102 and the first level data memory DMEM/L1D$ 103 on separate buses.

В предпочтительном варианте осуществления процессора кэш-память второго уровня L2$ 101 имеет фон-неймановскую архитектуру.In the preferred embodiment of the processor, the second level L2$ cache 101 has a von Neumann architecture.

В предпочтительном варианте осуществления процессора команды программного управления выбраны из набора команд, содержащего команды программных переходов и команды программных циклов.In the preferred embodiment of the processor, the program control instructions are selected from an instruction set containing program jump instructions and program loop instructions.

В предпочтительном варианте осуществления процессора команды объединены в инструкции, которые организованы в виде VLIW-пакета 201 (VLIW - Very Long Instruction Word).In the preferred embodiment of the processor, the instructions are combined into instructions that are organized as a VLIW package 201 (VLIW - Very Long Instruction Word).

В предпочтительном варианте осуществления процессора VLIW-пакет 201 содержит до восьми команд, из которых до четырех команд предназначены для исполнительных устройств скалярного канала обработки данных и до четырех команд предназначены для исполнительных устройств векторного канала обработки данных.In the preferred embodiment of the processor, the VLIW packet 201 contains up to eight instructions, of which up to four instructions are for scalar data channel agents and up to four instructions are for vector data channel agents.

В предпочтительном варианте осуществления процессора VLIW-пакет 201 содержит до двух команд скалярных обменов данными и до двух векторных команд обмена данными с памятью данных DMEM/L1D$ 103.In the preferred embodiment of the processor, VLIW package 201 contains up to two scalar data exchange instructions and up to two vector data exchange instructions with DMEM/L1D$ data memory 103.

В предпочтительном варианте осуществления процессор имеет систему команд, состоящую из команд программного управления, команд исполнительных устройств скалярного канала обработки данных и векторного канала обработки данных, а также команд блока редукции VRED 104.In the preferred embodiment, the processor has an instruction set consisting of software control instructions, instructions for the execution units of the scalar data processing channel and the vector data processing channel, as well as commands for the reduction unit VRED 104.

В предпочтительном варианте осуществления процессора скалярный канал 105 содержит одну скалярную вычислительную секцию 106.In the preferred embodiment of the processor, scalar channel 105 contains one scalar compute section 106.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит скалярный регистровый файл RF 301, который является многопортовым и в котором хранятся обрабатываемые скалярные данные.In the preferred embodiment of the processor, the scalar compute section 106 contains a scalar register file RF 301 that is multi-ported and stores scalar data to be processed.

В предпочтительном варианте осуществления процессора скалярный регистровый файл RF 301 содержит порты, связанные со скалярным каналом 105 обработки данных и выполненные с возможностью обмена данными с памятью данных DMEM/L1D$ 103.In the preferred embodiment of the processor, the RF scalar register file 301 contains ports associated with the scalar data processing channel 105 and configured to communicate with the DMEM/L1D$ data memory 103.

В предпочтительном варианте осуществления процессора скалярный регистровый файл RF 301 содержит порты, связанные с исполнительными устройствами скалярной вычислительной секции 106 скалярного канала 105 обработки данных, выполненные с возможностью передачи исходных данных для выполнения вычислительных операций и записи результатов операций обратно в скалярный регистровый файл RF 301.In the preferred embodiment of the processor, the RF scalar register file 301 comprises ports associated with the scalar computing section 106 actuators of the scalar data processing channel 105, configured to transmit input data to perform computational operations and write the results of the operations back to the RF 301 scalar register file.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блоки обработки данных SLSE0 310, SLSE1 311, которые выполнены с возможностью обеспечения обмена данными между памятью данных DMEM/L1D$ 103 и скалярным регистровым файлом RF 301, в том числе выполнения команд пересылок данных между памятью данных DMEM/L1D$ 103 и скалярным регистровым файлом RF 301.In the preferred embodiment of the processor, the scalar computing section 106 contains data processing units SLSE0 310, SLSE1 311, which are configured to provide data exchange between the DMEM/L1D $ 103 data memory and the RF 301 scalar register file, including executing data transfer commands between memory data DMEM/L1D$ 103 and scalar register file RF 301.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блоки обработки данных ALU0 302, ALU1 303, ALU2 304, ALU3 305, выполняющие арифметические и логические операции над числами с фиксированной запятой.In the preferred embodiment of the processor, the scalar computing section 106 includes data processing units ALU0 302, ALU1 303, ALU2 304, ALU3 305 that perform arithmetic and logical operations on fixed-point numbers.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блоки обработки данных FALU0 306, FALU1 307, выполняющие арифметические и логические операции над числами с плавающей запятой.In the preferred embodiment of the processor, the scalar computing section 106 includes processing units FALU0 306, FALU1 307 that perform arithmetic and logical operations on floating point numbers.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блоки обработки данных SMU0 308, SMU1 309, выполняющие операции умножения над числами с фиксированной и плавающей запятой.In the preferred embodiment of the processor, the scalar computing section 106 includes data processing units SMU0 308, SMU1 309 that perform multiplication operations on fixed and floating point numbers.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блок обработки данных SH 312, выполняющий операции логического и арифметического сдвига.In the preferred embodiment of the processor, the scalar computing section 106 includes a data processing unit SH 312 that performs logical and arithmetic shift operations.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блок обработки данных CONV 315, выполняющий операции преобразования типов данных.In the preferred embodiment of the processor, the scalar computing section 106 includes a data processing unit CONV 315 that performs data type conversion operations.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блок обработки данных DIV 313, выполняющий операции деления.In the preferred embodiment of the processor, the scalar compute section 106 includes a DIV 313 processing unit that performs division operations.

В предпочтительном варианте осуществления процессора скалярная вычислительная секция 106 содержит блок обработки данных MF 314, выполняющий операции вычисления трансцендентных математических функций. Базовый набор операций, выполняемых блоком MF 314, приведен в таблице 9.In the preferred embodiment of the processor, the scalar computing section 106 includes a data processing unit MF 314 that performs transcendental mathematical function calculation operations. The basic set of operations performed by the MF 314 block is shown in Table 9.

В предпочтительном варианте осуществления процессора векторный канал 107 состоит из нескольких векторных вычислительных секций 108, количество которых соответствует разрядности обрабатываемого вектора.In the preferred embodiment of the processor, the vector channel 107 consists of several vector computing sections 108, the number of which corresponds to the bit width of the vector being processed.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит векторный регистровый файл VRF 401, который является многопортовым и мультиформатным, и в котором хранятся обрабатываемые векторные данные.In the preferred embodiment of the processor, vector computing section 108 contains a vector register file VRF 401, which is multi-port and multi-format, and which stores vector data to be processed.

В предпочтительном варианте осуществления процессора векторный регистровый файл VRF 401 является мультиформатным, так что каждый 64-разрядный регистр 500 векторного регистрового файла VRF 401 может хранить либо одно 64-разрядное значение 501, либо два 32-разрядных значения 502, либо четыре 16-разрядных значения 503, либо восемь 8-разрядных значений 504.In the preferred embodiment of the processor, the vector VRF register file 401 is multi-format such that each 64-bit register 500 of the vector VRF register file 401 can store either one 64-bit value 501, or two 32-bit values 502, or four 16-bit values. 503, or eight 8-bit 504 values.

В предпочтительном варианте осуществления процессора векторный регистровый файл VRF 401 содержит порты, связанные с внешним интерфейсом векторного канала 107 и выполненные с возможностью обмена данными с памятью данных DMEM/L1D$ 103.In the preferred embodiment of the processor, the vector register file VRF 401 contains ports associated with the external interface of the vector channel 107 and configured to communicate with the data memory DMEM/L1D$ 103.

В предпочтительном варианте осуществления процессора векторный регистровый файл VRF 401 содержит порты, связанные с исполнительными устройствами векторной вычислительной секции 108, выполненные с возможностью передачи исходных данных для выполнения вычислительных операций и записи результатов обратно.In the preferred embodiment of the processor, the vector register file VRF 401 contains ports associated with the execution units of the vector computing section 108, configured to transmit initial data to perform computational operations and write the results back.

В предпочтительном варианте осуществления процессора векторный регистровый файл VRF 401 выполнен с возможностью работы с различными форматами данных.In the preferred embodiment of the processor, the vector register file VRF 401 is configured to work with various data formats.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит блоки VLSE0 412, VLSE1 413, которые выполнены с возможностью обеспечения обмена данными между памятью данных DMEM/L1D$ 103 и векторным регистровым файлом VRF 401, в том числе и выполнения команд пересылок данных между памятью данных DMEM/L1D$ 103 и векторным регистровым файлом VRF 401.In the preferred embodiment of the processor, the vector computing section 108 contains blocks VLSE0 412, VLSE1 413, which are configured to provide data exchange between the data memory DMEM/L1D $ 103 and the vector register file VRF 401, including the execution of data transfer commands between the data memory DMEM/L1D$ 103 and vector register file VRF 401.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит блоки VALU0 403, VALU1 404, VALU2 405, VALU3 406, выполненные с возможностью осуществления арифметических и логических операции над числами с фиксированной запятой.In the preferred embodiment of the processor, vector computing section 108 includes blocks VALU0 403, VALU1 404, VALU2 405, VALU3 406, configured to perform arithmetic and logical operations on fixed-point numbers.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит блоки VFALU0 407, VFALU1 408, выполненные с возможностью осуществления арифметических и логических операции над числами с плавающей запятой.In the preferred embodiment of the processor, vector computing section 108 includes blocks VFALU0 407, VFALU1 408, configured to perform arithmetic and logical operations on floating point numbers.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит блоки VMU0 409, VMU1 410, выполненные с возможностью осуществления операций умножения и умножения с накоплением над числами с фиксированной и плавающей запятой.In the preferred embodiment of the processor, vector computing section 108 includes blocks VMU0 409, VMU1 410 configured to perform multiplication and multiplication-accumulate operations on fixed and floating point numbers.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит векторный регистровый файл регистров-аккумуляторов VAC 402, выполненный с возможностью хранения данных, получаемых и используемых в результате выполнения операций умножения с накоплением, выполняемых блоками векторных умножителей VMU0 409, VMU1 410.In the preferred embodiment of the processor, vector computing section 108 contains a vector accumulator register file VAC 402 configured to store data obtained and used as a result of performing multiply-accumulate operations performed by vector multiplier units VMU0 409, VMU1 410.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит блок VSH 411, выполненный с возможностью осуществления операции логического и арифметического сдвига над векторными операндами.In the preferred embodiment of the processor, vector computing section 108 includes a VSH 411 configured to perform logical and arithmetic shift operations on the vector operands.

В предпочтительном варианте осуществления процессора векторная вычислительная секция 108 содержит блок VCONV 414, выполненный с возможностью осуществления операции преобразования типов данных над векторными операндами.In the preferred embodiment of the processor, vector computing section 108 includes a VCONV block 414 configured to perform a data type conversion operation on vector operands.

В предпочтительном варианте осуществления процессора блок редукции VRED 104 выполнен с возможностью вычисления функций редукции, и при этом обеспечения повышенной эффективности, скорости работы, функциональности и универсальности процессора.In the preferred embodiment of the processor, the reduction unit VRED 104 is configured to compute reduction functions while providing increased efficiency, speed, functionality, and versatility of the processor.

В предпочтительном варианте осуществления процессора блок редукции VRED 104 выполнен с возможностью вычисления функций редукции, и при этом реализации функций взаимодействия скалярной и векторной частей процессора в разнообразных операциях, в которых скалярный канал 105 формирует и/или потребляет скаляр, требуемый и/или формируемый векторным каналом 107.In the preferred embodiment of the processor, the reduction unit VRED 104 is configured to compute the reduction functions, while implementing the interaction functions of the scalar and vector parts of the processor in various operations in which the scalar channel 105 generates and/or consumes the scalar required and/or generated by the vector channel 107.

В предпочтительном варианте осуществления процессора блок редукции VRED 104 содержит блок RALU 601, выполненный с возможностью осуществления арифметико-логических межсекционных операций редукции.In the preferred embodiment of the processor, the reduction unit VRED 104 includes a RALU 601 configured to perform arithmetic-logical intersectional reduction operations.

В предпочтительном варианте осуществления процессора блок редукции VRED 104 содержит блок SHUFFLE 602, выполненный с возможностью осуществления операций межсекционных перестановок.In the preferred embodiment of the processor, the reduction unit VRED 104 includes a SHUFFLE 602 capable of performing cross-sectional permutation operations.

В предпочтительном варианте осуществления процессора блок редукции VRED 104 содержит блок LUT 603, выполненный с возможностью осуществления операций межсекционньгх табличных преобразований.In the preferred embodiment of the processor, the VRED reducer 104 includes a LUT 603 configured to perform cross-section table transformations.

В предпочтительном варианте осуществления процессора блок редукции VRED 104 содержит блок HIST 604, выполненный с возможностью осуществления операций вычисления гистограмм.In the preferred embodiment of the processor, the VRED reducer 104 includes a HIST 604 configured to perform histogram calculation operations.

В предпочтительном варианте осуществления процессора кольцевая шина CDB (Circular Data Bus) 112, выполнена с возможностью осуществления обмена данными одновременно с осуществлением вычислительных операций в скалярном и векторном каналах 105, 107 и в блоке редукции VRED 104.In the preferred embodiment of the processor, the CDB (Circular Data Bus) 112 is configured to exchange data simultaneously with computational operations in the scalar and vector channels 105, 107 and in the reduction unit VRED 104.

В предпочтительном варианте осуществления процессора кольцевая шина CDB 112, выполнена с возможностью осуществления команд циклического сдвига, в результате выполнения которых регистр Ri скалярного регистрового файла RF 301 смещается в регистр Vj векторного регистрового файла VRF 401 нулевой векторной вычислительной секции 108: Vj.0=Ri; регистр Vj векторного регистрового файла VRF 401 старшей (N-1) векторной вычислительной секции 108 смещается в регистр Ri скалярного регистрового файла RF 301: Ri=Vj.N-1; регистры Vj векторных регистровых файлов VRF 401 остальных векторных вычислительных секций 108 смещаются на одну секцию в сторону старших секций: Vj.k=Vj.k-1, k=1,2,…,N-1.In the preferred embodiment of the processor, the ring bus CDB 112 is configured to execute cyclic shift instructions that shift the register Ri of the scalar register file RF 301 into the register Vj of the vector register file VRF 401 of the zero vector computing section 108: Vj.0=Ri; the register Vj of the vector register file VRF 401 of the senior (N-1) vector computing section 108 is shifted to the register Ri of the scalar register file RF 301: Ri=Vj.N-1; the registers Vj of the vector register files VRF 401 of the remaining vector computing sections 108 are shifted by one section towards the higher sections: Vj.k=Vj.k-1, k=1,2,…,N-1.

В предпочтительном варианте осуществления процессора кольцевая шина CDB 112, выполнена с возможностью последовательного перемещения данных из векторного канала 107 в скалярный канал 105 с целью выполнения операций, имеющихся только в скалярном канале 105, с последующим возвращением преобразованных данных в векторный канал 107.In the preferred embodiment of the processor, the ring bus CDB 112 is configured to sequentially move data from vector channel 107 to scalar channel 105 to perform operations only available in scalar channel 105, and then return the converted data to vector channel 107.

Для лучшего понимания заявленного изобретения далее приводится его подробное описание с соответствующими графическими материалами.For a better understanding of the claimed invention, the following is a detailed description with the corresponding drawings.

Фиг. 1. Структурная схема скалярно-векторного процессора, выполненная согласно изобретению.Fig. 1. Structural diagram of a scalar-vector processor, made according to the invention.

Фиг. 2. Структурная схема VLIW-инструкции скалярно-векторного процессора, выполненная согласно изобретению.Fig. 2. Block diagram of the VLIW instruction of the scalar vector processor, made according to the invention.

Фиг. 3. Структурная схема скалярной вычислительной секции скалярного канала скалярно-векторного процессора, выполненная согласно изобретению.Fig. 3. Structural diagram of the scalar computing section of the scalar channel of the scalar vector processor, made according to the invention.

Фиг. 4. Структурная схема векторной вычислительной секции векторного канала скалярно-векторного процессора, выполненная согласно изобретению.Fig. 4. Structural diagram of the vector computing section of the vector channel of the scalar vector processor, made according to the invention.

Фиг. 5. Структурная схема мультиформатного векторного регистрового файла в составе вычислительной секции векторного канала, выполненная согласно изобретению.Fig. 5. Structural diagram of a multi-format vector register file as part of the computational section of a vector channel, made according to the invention.

Фиг. 6. Структурная схема блока редукции, выполненная согласно изобретению.Fig. 6. Structural diagram of the reduction block, made according to the invention.

Фиг. 7. Кольцевая шина для межсекционных скалярно-векторных обменов данными, выполненная согласно изобретению.Fig. 7. Ring bus for intersectional scalar-vector data exchange, made according to the invention.

Табл. 1. Базовый набор команд устройства программного управления, выполненный согласно изобретению.Tab. 1. The basic set of commands of the program control device, made according to the invention.

Табл. 2. Базовый набор команд блока обменов с памятью скалярного канала, выполненный согласно изобретению.Tab. 2. The basic set of commands of the block of exchanges with the memory of the scalar channel, made according to the invention.

Табл. 3. Базовый набор команд блока арифметико-логических операций с фиксированной запятой скалярного канала, выполненный согласно изобретению.Tab. 3. The basic set of commands of the block of arithmetic-logical operations with a fixed point of the scalar channel, made according to the invention.

Табл. 4. Базовый набор команд блока арифметико-логических операций с плавающей запятой скалярного канала, выполненный согласно изобретению.Tab. 4. The basic set of commands of the block of floating-point arithmetic-logical operations of the scalar channel, made according to the invention.

Табл. 5. Базовый набор команд блока умножения с фиксированной и плавающей запятой скалярного канала, выполненный согласно изобретению.Tab. 5. The basic set of commands of the multiplication unit with fixed and floating point of the scalar channel, made according to the invention.

Табл. 6. Базовый набор команд блока сдвига скалярного канала, выполненный согласно изобретению.Tab. 6. The basic set of commands of the scalar channel shift block, made according to the invention.

Табл. 7. Базовый набор команд блока преобразования типов скалярного канала, выполненный согласно изобретению.Tab. 7. The basic set of commands of the scalar channel type conversion block, made according to the invention.

Табл. 8. Базовый набор команд блока деления скалярного канала, выполненный согласно изобретению.Tab. 8. The basic set of commands for the division block of the scalar channel, made according to the invention.

Табл. 9. Базовый набор команд блока вычисления трансцендентных функций скалярного канала, выполненный согласно изобретению.Tab. 9. The basic set of commands for the block for calculating the transcendental functions of the scalar channel, made according to the invention.

Табл. 10. Базовый набор команд блока обменов с памятью векторного канала, выполненный согласно изобретению.Tab. 10. The basic set of commands of the block of exchanges with the memory of the vector channel, made according to the invention.

Табл. 11. Базовый набор команд блока арифметико-логических операций с фиксированной запятой векторного канала, выполненный согласно изобретению.Tab. 11. The basic set of commands of the block of arithmetic-logical operations with a fixed point of the vector channel, made according to the invention.

Табл. 12. Базовый набор команд блока арифметико-логических операций с плавающей запятой векторного канала, выполненный согласно изобретению.Tab. 12. The basic set of instructions for the block of floating-point arithmetic-logical operations of the vector channel, made according to the invention.

Табл. 13. Базовый набор команд блока умножения с фиксированной и плавающей запятой векторного канала.Tab. 13. The basic set of instructions for the multiplication block with fixed and floating point of the vector channel.

Табл. 14. Базовый набор команд блока сдвига векторного канала, выполненный согласно изобретению.Tab. 14. Basic vector channel shift block instruction set according to the invention.

Табл. 15. Базовый набор команд блока преобразования типов векторного канала, выполненный согласно изобретению.Tab. 15. The basic set of commands of the vector channel type conversion block, made according to the invention.

Табл. 16. Базовый набор команд арифметико-логического устройства межсекционной редукции.Tab. 16. Basic set of commands for the arithmetic logic unit of intersectional reduction.

Табл. 17. Базовый набор команд блока межсекционных перестановок, выполненный согласно изобретению.Tab. 17. The basic set of commands of the block of intersectional permutations, made according to the invention.

Табл. 18. Базовый набор команд блока межсекционных табличных преобразований (LUT-преобразований), выполненный согласно изобретению.Tab. 18. The basic set of commands block intersectional table transformations (LUT-transformations), made according to the invention.

Табл. 19. Базовый набор команд блока вычисления гистограмм, выполненный согласно изобретению.Tab. 19. The basic set of commands for the histogram calculation block, made according to the invention.

Табл. 20. Базовый набор команд межсекционного скалярно-векторного сдвига, выполненный согласно изобретению.Tab. 20. Basic set of instructions for intersectional scalar-vector shift, made according to the invention.

Архитектура заявленного скалярно-векторного процессора ориентирована прежде всего на решение задач цифровой обработки сигналов, связанных с массивно-параллельными вычислениями, включая приложения искусственного интеллекта и нейронных сетей.The architecture of the claimed scalar vector processor is focused primarily on solving digital signal processing problems associated with massively parallel computing, including applications of artificial intelligence and neural networks.

В состав процессора 100 (Фиг. 1) входит скалярный канал 105 (Scalar Channel) и векторный канал 107 (Vector Channel) обработки данных. Скалярный канал 105 включает в себя одну скалярную вычислительную секцию 106 (Scalar Unit), в то время как векторный канал включает несколько векторных вычислительных секций 108 (Vector Lane).The processor 100 (FIG. 1) includes a scalar channel 105 (Scalar Channel) and a vector channel 107 (Vector Channel) of data processing. The scalar channel 105 includes one scalar computing section 106 (Scalar Unit), while the vector channel includes several vector computing sections 108 (Vector Lane).

Взаимодействие процессора с памятью организовано традиционным способом. На уровне вычислительного ядра процессора реализуется гарвардская архитектура с возможностью одновременного доступа к памяти программ и данных по отдельным шинам. При этом память программ и данных может быть реализована как в виде кэш-памяти первого уровня (соответственно L1I$ 102 и L1D$ 103), так и в виде тесно связанной (Tightly-Coupled Memory, ТСМ) статической памяти (соответственно РМЕМ 102 и DMEM 103). На верхнем уровне реализуется фон-неймановская архитектура, в которой кэш-память второго уровня L2$ 101 обслуживает обращения кэш-памяти программ и данных первого уровня, и через внешний интерфейс имеет доступ к внешней памяти системы.The interaction of the processor with memory is organized in the traditional way. At the level of the processor core, the Harvard architecture is implemented with the possibility of simultaneous access to program memory and data via separate buses. At the same time, program and data memory can be implemented both in the form of a first-level cache memory (L1I$ 102 and L1D$ 103, respectively), and in the form of tightly coupled (Tightly-Coupled Memory, TCM) static memory (RMEM 102 and DMEM, respectively). 103). At the top level, a von Neumann architecture is implemented, in which the L2 $ 101 cache memory of the second level serves the cache memory of programs and data of the first level, and through the external interface has access to the external memory of the system.

Инструкции, считываемые из программной памяти PMEM/L1I$ 102 при помощи устройства выборки FETCH 109, поступают в блок DECODE 110, который декодирует их и формирует сигналы управления для исполнительных устройств процессора.Instructions read from the program memory PMEM/L1I$ 102 using the fetch device FETCH 109 are sent to the DECODE 110 block, which decodes them and generates control signals for the processor's actuators.

Инструкции организованы в виде VLIW-пакетов (VLIW - Very Long Instruction Word), содержащих несколько одновременно исполняемых команд как для скалярного, так и для векторного каналов 105, 107 процессора. На Фиг. 2 показана структура VLIW-пакета 201. Для каждой команды в VLIW-пакете 201 предусмотрено место, называемое слотом. Всего в VLIW-пакете имеется четыре слота 202-205 для команд скалярного канала 105 и четыре слота 206-209 для команд векторного канала 107. Таким образом, одновременно может выполняться до восьми команд. Каждый из восьми слотов 202-209 VLIW-пакета 201 может содержать команды определенного типа, предназначенные для соответствующего набора исполнительных устройств 210-217. Состав и общее количество команд для каждого VLIW-пакета могут быть различны.Instructions are organized in the form of VLIW-packages (VLIW - Very Long Instruction Word), containing several simultaneously executable commands for both scalar and vector channels 105, 107 of the processor. On FIG. 2 shows the structure of the VLIW packet 201. Each instruction in the VLIW packet 201 has a location called a slot. In total, the VLIW package has four slots 202-205 for scalar channel 105 instructions and four slots 206-209 for vector channel 107 instructions. Thus, up to eight instructions can be executed simultaneously. Each of the eight slots 202-209 of the VLIW package 201 may contain commands of a certain type, intended for the corresponding set of actuators 210-217. The composition and total number of commands for each VLIW package may be different.

Команды программного управления, в число которых входят команды программных переходов и команды программных циклов, выполняются с помощью блока программного управления PCTRL 111.Program control instructions, which include program jump instructions and program loop instructions, are executed using the PCTRL 111 program control unit.

В таблице 1 приведен базовый набор команд программного управления, выполняемых с помощью блока PCTRL 111.Table 1 shows the basic set of program control commands executed using the PCTRL 111 block.

К командам скалярного канала 105 процессора относятся команды скалярных обращений к памяти данных (SLSE0 310, SLSE1 311) и команды на выполнение скалярных вычислительных операций - арифметико-логических операций с фиксированной запятой (ALU0 302, ALU1 303, ALU2 304, ALU3 305), арифметико-логических операций с плавающей запятой (FALU0 306, FALU1 307), умножения с фиксированной и плавающей запятой (SMU0 308, SMU1 309), сдвига (SH 312), преобразования типов (CONV 315), деления (DIV 313), вычисления трансцендентных математических функций (MF 314).The commands of the scalar channel 105 of the processor include commands for scalar accesses to the data memory (SLSE0 310, SLSE1 311) and commands for performing scalar computational operations - fixed-point arithmetic-logical operations (ALU0 302, ALU1 303, ALU2 304, ALU3 305), arithmetic -logical floating point operations (FALU0 306, FALU1 307), fixed and floating point multiplication (SMU0 308, SMU1 309), shift (SH 312), type conversion (CONV 315), division (DIV 313), transcendental mathematical calculations functions (MF 314).

К командам векторного канала 107 процессора относятся команды векторных обращений к памяти данных (VLSE0 412, VLSE 413) и команды на выполнение векторных вычислительных операций - арифметико-логических операций с фиксированной запятой (VALU0 403, VALU1 404, VALU2 405, VALU3 406), арифметико-логических операций с плавающей запятой (VFALU0 407, VFALU1 408), умножения с фиксированной и плавающей запятой (VMU0 409, VMU1 410), сдвига (VSH 411), преобразования типов (VCONV 414), а также операций редукции (VRED 104).The instructions of the vector channel 107 of the processor include instructions for vector access to data memory (VLSE0 412, VLSE 413) and instructions for performing vector computing operations - fixed-point arithmetic-logical operations (VALU0 403, VALU1 404, VALU2 405, VALU3 406), arithmetic -logical floating point operations (VFALU0 407, VFALU1 408), fixed and floating point multiplication (VMU0 409, VMU1 410), shift (VSH 411), type conversion (VCONV 414), and reduction operations (VRED 104).

В совокупности команды программного управления, команды исполнительных устройств скалярного и векторного каналов 105, 107 процессора, а также команды блока редукции 104 формируют полную систему команд процессора.Together, the program control commands, the commands of the actuators of the scalar and vector channels 105, 107 of the processor, as well as the commands of the reduction unit 104 form a complete system of processor commands.

Структура скалярной вычислительной секции 106 скалярного канала 105 процессора приведена на Фиг. 3. Центральным элементом скалярной вычислительной секции 106 является многопортовый скалярный регистровый файл RF 301, в котором хранятся обрабатываемые скалярные данные. Через порты RF 301, связанные с внешним интерфейсом скалярного канала 105 процессора с помощью контроллеров скалярных обращений к памяти данных SLSE1 311, SLSE2 происходят обмены данными между памятью данных DMEM/L1D$ 103 и скалярным каналом 105 процессора. Через порты RF 301, связанные с исполнительными устройствами скалярной вычислительной секции 106 процессора, передают исходные данные для выполняемых вычислительных операций и записываются их результаты. К числу исполнительных устройств скалярной вычислительной секции 106 процессора относятся: четыре блока арифметико-логических устройств с фиксированной запятой ALU0 302, ALU1 303, ALU2 304, ALU3 305; два блока арифметико-логических устройств с плавающей запятой FALU0 306, FALU1 307; два блока умножителей с фиксированной и плавающей запятой SMU0 308, SMU1 309; блок сдвига SH 312; блок преобразователя типов CONV 315; блок делителя DIV 313; блок вычисления трансцендентных математических функций MF 314. Каждый из указанных блоков выполняет соответствующий набор скалярных вычислительных операций. Одновременно при указанной структуре VLIW-пакета может выполняться до четырех скалярных вычислительных операций, включая две операции скалярных обменов данными с памятью данных первого уровня DMEM/L1D$ 103.The structure of the scalar computing section 106 of the scalar channel 105 of the processor is shown in FIG. 3. The centerpiece of the scalar computational section 106 is the multi-port scalar register file RF 301, which stores the scalar data to be processed. Through the ports RF 301 connected to the external interface of the scalar channel 105 of the processor using the controllers of scalar accesses to the data memory SLSE1 311, SLSE2 data exchanges occur between the data memory DMEM/L1D $ 103 and the scalar channel 105 of the processor. Through the RF ports 301 associated with the executive devices of the scalar computing section 106 of the processor, transmit the initial data for the computational operations performed and record their results. Among the executive devices of the scalar computing section 106 of the processor are: four blocks of fixed-point arithmetic logic units ALU0 302, ALU1 303, ALU2 304, ALU3 305; two units of floating-point arithmetic logic units FALU0 306, FALU1 307; two blocks of multipliers with fixed and floating point SMU0 308, SMU1 309; shear block SH 312; type converter unit CONV 315; divider block DIV 313; block for calculating transcendental mathematical functions MF 314. Each of these blocks performs a corresponding set of scalar computational operations. At the same time, with the specified structure of the VLIW package, up to four scalar computational operations can be performed, including two operations of scalar data exchanges with the data memory of the first level DMEM/L1D$ 103.

В таблице 2 приведен базовый набор команд скалярных обращений к памяти данных 103, выполняемых блоками SLSE0 310, SLSE1 311.Table 2 shows the basic set of instructions for scalar accesses to data memory 103, performed by blocks SLSE0 310, SLSE1 311.

В таблице 3 приведен базовый набор команд скалярных арифметико-логических операций с фиксированной запятой, выполняемых блоками ALU0 302, ALU1 303, ALU2 304, ALU3 305.Table 3 shows the basic set of instructions for fixed-point scalar arithmetic-logical operations performed by blocks ALU0 302, ALU1 303, ALU2 304, ALU3 305.

В таблице 4 приведен базовый набор команд скалярных арифметико-логических операций с плавающей запятой, выполняемых блоками FALU0 306, FALU1 307.Table 4 shows the basic set of instructions for floating-point scalar arithmetic-logical operations performed by blocks FALU0 306, FALU1 307.

В таблице 5 приведен базовый набор скалярных команд умножения с фиксированной и плавающей запятой, выполняемых блоками SMU0 308, SMU1 309.Table 5 shows the basic set of fixed and floating point scalar multiply instructions executed by SMU0 308, SMU1 309.

В таблице 6 приведен базовый набор скалярных команд сдвига, выполняемых с помощью блока SH 312.Table 6 lists the basic set of scalar shift commands executed by the SH 312 block.

В таблице 7 приведен базовый набор скалярных команд преобразования типов, выполняемых с помощью блока CONV 315.Table 7 lists the basic set of scalar type conversion instructions performed by the CONV 315 block.

В таблице 8 приведен базовый набор скалярных команд деления, выполняемых с помощью блока DIV 313.Table 8 shows the basic set of scalar division instructions executed with the DIV 313 block.

В таблице 9 приведен базовый набор скалярных команд вычисления трансцендентных математических функций, выполняемых с помощью блока MF 314.Table 9 shows the basic set of scalar commands for calculating transcendental mathematical functions performed using the MF 314 block.

В таблицах 1-20 используются следующие обозначения:Tables 1-20 use the following designations:

R, Ri, Ra, Rt, Rs, Rd - скалярные регистры данных;R, Ri, Ra, Rt, Rs, Rd - scalar data registers;

V, Vi, Va, Vt, Vs, Vd - векторные регистры данных;V, Vi, Va, Vt, Vs, Vd - vector data registers;

VAi - векторные регистры-аккумуляторы;VAi - vector accumulator registers;

.b, .h, .l, .d - спецификаторы формата данных:.b, .h, .l, .d - data format specifiers:

.b - byte (8 разрядов);.b - byte (8 bits);

.h - halfword (16 разрядов);.h - halfword (16 bits);

.l - long (32 разряда);.l - long (32 bits);

.d - double (64 разряда);.d - double (64 bits);

i8, i16, i32, i64 - целочисленные знаковые 8/16/32/64-разрядные форматы;i8, i16, i32, i64 - signed integer 8/16/32/64-bit formats;

u8, u16, u32, u64 - целочисленные беззнаковые 8/16/32/64-разрядные форматы;u8, u16, u32, u64 - integer unsigned 8/16/32/64-bit formats;

#imm - непосредственное значение;#imm - immediate value;

#N - непосредственное N-разрядное значение;#N - immediate N-bit value;

trunk_N - отсечение до N разрядов;trunk _N - clipping up to N bits;

zext_N→M - расширение числа с N до М разрядов путем заполнения недостающих старших разрядов нулями;zext _N→M - extension of the number from N to M digits by filling in the missing high digits with zeros;

sext_N→M - расширение числа с N до М разрядов путем заполнения недостающих старших разрядов знаковым разрядом;sext _N→M - extension of the number from N to M bits by filling in the missing high bits with a sign bit;

{,} - конкатенация (объединение) нескольких операндов;{,} - concatenation (union) of several operands;

T[i], S[i], D[i], V[i] - элементы векторов.T[i], S[i], D[i], V[i] - elements of vectors.

Векторный канал 107 процессора включает в себя несколько векторных вычислительных секций 108, общее количество которых определяется разрядностью обрабатываемого вектора. Структура векторной вычислительной секции 108 векторного канала 107 представлена на Фиг. 4. В ее состав входит многопортовый мультиформатный векторный регистровый файл VRF 401, предназначенный для хранения обрабатываемых векторных данных. Через порты VRF 401, связанные с внешним интерфейсом векторного канала 107 процессора с помощью контроллеров векторных обращений к памяти данных VLSE0 412, VLSE1 413 производится загрузка/выгрузка данных из/в память данных DMEM/L1D$ 103. Через порты VRF 401, связанные с исполнительными устройствами векторной вычислительной секции 108 процессора, векторные данные передают исполнительным устройствам, и полученные результаты снова записывают в векторный регистровый файл VRF 401.The vector channel 107 of the processor includes several vector computing sections 108, the total number of which is determined by the bit length of the processed vector. The structure of vector computing section 108 of vector channel 107 is shown in FIG. 4. It includes a multi-port multi-format vector register file VRF 401, designed to store processed vector data. Through the VRF 401 ports associated with the external interface of the vector channel 107 of the processor using the vector data memory access controllers VLSE0 412, VLSE1 413, data is loaded / unloaded from / to the data memory DMEM / L1D$ 103. Through the VRF 401 ports associated with the executive vector computing section 108 of the processor, the vector data is transmitted to the execution units, and the results are again written to the vector register file VRF 401.

Особенностью векторного регистрового файла VRF 401 является возможность работы с различными форматами данных, как это показано на Фиг. 5. Каждый 64-разрядный регистр 500 из векторного регистрового файла VRF 401 может хранить либо одно 64-разрядное значение 501, либо два 32-разрядных значения 502, либо четыре 16-разрядных значения 503, либо восемь 8-разрядных значений 504.A feature of the VRF 401 vector register file is the ability to work with various data formats, as shown in FIG. 5. Each 64-bit register 500 of the VRF vector register file 401 can store either one 64-bit value 501, or two 32-bit values 502, or four 16-bit values 503, or eight 8-bit values 504.

В состав векторной вычислительной секции 108 векторного канала 107 входит также векторный регистровый файл регистров-аккумуляторов VAC 402, предназначенный для хранения данных, получаемых и используемых в результате выполнения операций умножения с накоплением, выполняемых двумя блоками векторных умножителей VMU0 409, VMU1 410.The composition of the vector computing section 108 of the vector channel 107 also includes a vector register file of accumulator registers VAC 402, designed to store data obtained and used as a result of performing multiplication and accumulation operations performed by two blocks of vector multipliers VMU0 409, VMU1 410.

К числу исполнительных устройств вычислительной секции 108 векторного канала 107 процессора относятся также: четыре блока векторных арифметико-логических устройств с фиксированной запятой VALU0 403, VALU1 404, VALU2 405, VALU3 406; два блока векторных арифметико-логических устройств с плавающей запятой VFALU0 407, VFALU1 408; блок векторного сдвига VSH 411; блок преобразователя типов VCONV 414; блок вычисления функций редукции VRED 104. Каждый из указанных блоков выполняет соответствующий набор векторных вычислительных операций. Одновременно при указанной структуре VLIW-пакета выполняют до четырех векторных вычислительных операций, включая две операции векторных обменов данными с памятью DMEM/L1D$ 103.The execution units of the computing section 108 of the vector channel 107 of the processor also include: four blocks of vector fixed-point arithmetic logic units VALU0 403, VALU1 404, VALU2 405, VALU3 406; two units of floating-point vector arithmetic-logical units VFALU0 407, VFALU1 408; vector shift block VSH 411; VCONV 414 type converter block; reduction function calculation unit VRED 104. Each of these units performs a corresponding set of vector computing operations. At the same time, with the specified structure of the VLIW package, up to four vector computing operations are performed, including two operations of vector data exchanges with DMEM/L1D $ 103 memory.

В таблице 10 приведен базовый набор команд векторных обращений к памяти данных, выполняемых блоками VLSE0 412, VLSE1 413.Table 10 shows the basic set of instructions for vector data memory accesses performed by the VLSE0 412, VLSE1 413 blocks.

В таблице 11 приведен базовый набор команд векторных арифметико-логических операций с фиксированной запятой, выполняемых блоками VALU0 403, VALU1 404, VALU2 405, VALU3 406.Table 11 shows the basic set of instructions for fixed-point vector arithmetic-logical operations performed by blocks VALU0 403, VALU1 404, VALU2 405, VALU3 406.

В таблице 12 приведен базовый набор команд векторных арифметико-логических операций с плавающей запятой, выполняемых блоками VFALU0 407, VFALU1 408.Table 12 shows the basic set of floating-point vector arithmetic-logical operations performed by VFALU0 407, VFALU1 408 blocks.

В таблице 13 приведен базовый набор векторных команд умножения с фиксированной и плавающей запятой, выполняемых блоками VMU0 409, VMU1 410.Table 13 shows the basic set of fixed and floating point vector multiplication instructions executed by VMU0 409, VMU1 410.

В таблице 14 приведен базовый набор векторных команд сдвига, выполняемых с помощью блока VSH 411.Table 14 lists the basic set of vector shift commands that can be executed using the VSH 411 block.

В таблице 15 приведен базовый набор векторных команд преобразования типов, выполняемых с помощью блока VCONV 414.Table 15 lists the basic set of vector type conversion instructions performed by the VCONV 414 block.

Важнейшей особенностью рассматриваемой архитектуры скалярно-векторного процессора является наличие в ее составе блока редукции VRED 104, связывающего между собой скалярный и векторный каналы 105, 107 процессора, и предназначенную как для организации обменов между ними, так и для выполнения операций, использующих и/или формирующих одновременно и скалярные, и векторные данные. Структурная схема блока редукции VRED 104 представлена на Фиг. 6. В его состав входят следующие исполнительные устройства: блок арифметико-логических межсекционных операций редукции RALU 601, блок межсекционных перестановок SHUFFLE 602, блок межсекционных табличных преобразований LUT 603, блок вычисления гистограмм HIST 604.The most important feature of the considered architecture of the scalar-vector processor is the presence in its composition of the reduction unit VRED 104, which connects the scalar and vector channels 105, 107 of the processor, and is intended both for organizing exchanges between them and for performing operations that use and / or form both scalar and vector data. The block diagram of the reduction unit VRED 104 is shown in Fig. 6. It consists of the following actuators: RALU 601 block of arithmetic-logical intersectional reduction operations, block of intersectional permutations SHUFFLE 602, block of intersectional table transformations LUT 603, block of calculation of histograms HIST 604.

В таблице 16 приведен базовый набор команд арифметико-логического устройства межсекционной редукции RALU 601, входящего в состав блока редукции VRED 104.Table 16 shows the basic set of commands for the arithmetic logic unit of the intersectional reduction RALU 601, which is part of the reduction unit VRED 104.

В таблице 17 приведен базовый набор команд блока межсекционных перестановок SHUFFLE 602, входящего в состав блока редукции VRED 104.Table 17 shows the basic set of commands for the SHUFFLE 602 inter-sectional permutation block, which is part of the VRED 104 reduction block.

В таблице 18 приведен базовый набор команд блока межсекционных табличных преобразований LUT 603, входящего в состав блока редукции VRED 104.Table 18 shows the basic set of commands of the LUT 603 intersectional table transformation block, which is part of the VRED 104 reduction block.

В таблице 19 приведен базовый набор команд блока вычисления гистограмм HIST 604, входящего в состав блока редукции VRED 104.Table 19 shows the basic set of commands for the histogram calculation block HIST 604, which is part of the reduction block VRED 104.

Еще одним механизмом, объединяющим скалярный и векторный каналы процессора, является кольцевая шина CDB (Circular Data Bus) 112, изображенная на Фиг. 7.Another mechanism that combines the scalar and vector channels of the processor is the CDB (Circular Data Bus) 112 shown in FIG. 7.

Одной из команд, выполняемой с помощью кольцевой шины CDB 112, является команда циклического сдвига VPUSHRD Ri, Vj, в результате выполнения которой скалярный регистр Ri перемещается в векторный регистр Vj нулевой векторной вычислительной секции 108: Vj.0=Ri; векторный регистр Vj старшей векторной вычислительной секции 108 перемещается в скалярный регистр Ri: Ri=Vj.N; в остальных векторных вычислительных секциях 108 данные векторных регистров смещаются на одну секцию в сторону старших секций: Vj.k=Vj.k-1, k=1,2,…,N. Этот механизм может быть использован, например, для последовательного перемещения данных из векторного канала 107 процессора в скалярный канал 105 с целью выполнения специфических операций, имеющихся только в скалярном канале 105 (например, для вычисления трансцендентных математических функций), с последующим возвращением преобразованных данных в векторный канал 107 процессора.One of the instructions executed by the ring bus CDB 112 is the cyclic shift instruction VPUSHRD Ri, Vj, as a result of which the scalar register Ri is moved to the vector register Vj of the zero vector computing section 108: Vj.0=Ri; the vector register Vj of the senior vector computing section 108 is moved to the scalar register Ri: Ri=Vj.N; in the remaining vector computing sections 108, the data of the vector registers are shifted by one section towards the higher sections: Vj.k=Vj.k-1, k=1,2,…,N. This mechanism can be used, for example, to sequentially transfer data from the vector channel 107 of the processor to the scalar channel 105 in order to perform specific operations that are available only in the scalar channel 105 (for example, to calculate transcendental mathematical functions), with the subsequent return of the transformed data to the vector channel 107 of the processor.

В таблице 20 приведен базовый набор команд межсекционного скалярно-векторного сдвига, выполняемых с помощью кольцевой шины данных CDB 112.Table 20 shows the basic set of cross-sectional scalar-vector shift commands performed using the CDB 112 ring data bus.

Таким образом, рассматриваемая архитектура, по сравнению с ранее известными архитектурами скалярно-векторных процессоров, обладает значительно более широкими возможностями для организации эффективного взаимодействия между скалярным и векторным каналами 105, 107 процессора, тем самым обеспечивая более высокую производительность выполняемых скалярно-векторных вычислений.Thus, the architecture under consideration, in comparison with the previously known architectures of scalar-vector processors, has much more opportunities for organizing effective interaction between the scalar and vector channels 105, 107 of the processor, thereby providing higher performance of the scalar-vector calculations performed.

Состав и функциональность исполнительных устройств скалярного и векторного каналов 105, 107 процессора могут быть дополнены или сокращены в зависимости от области применения процессора.The composition and functionality of the actuators of the scalar and vector channels 105, 107 of the processor can be supplemented or reduced depending on the application of the processor.

Хотя описанный выше вариант выполнения заявленного изобретения был изложен с целью иллюстрации заявленного изобретения, специалистам ясно, что возможны различные модификации, добавления и замены, не выходящие из объема и смысла заявленного изобретения, раскрытого в прилагаемой формуле изобретения.Although the embodiment of the claimed invention described above has been set forth for the purpose of illustrating the claimed invention, it will be clear to those skilled in the art that various modifications, additions and substitutions are possible without departing from the scope and spirit of the claimed invention as disclosed in the appended claims.

Мнемоника командыCommand mnemonic ОписаниеDescription B #16
B #32
B Ra.LB#16
B#32
B Ra.L Программный переход по относительному адресу #16, #32, Ra.LProgram jump to relative address #16, #32, Ra.L J #16
J #32
J Ra.LJ#16
J#32
J Ra.L Программный переход по абсолютному адресу #16, #32, Ra.LSoft jump to absolute address #16, #32, Ra.L BS #16, Ri.L
BS #32, Ri.L
BS Ra.L, Ri.LBS #16 Ri.L
BS #32 Ri.L
BS Ra.L, Ri.L Программный переход по относительному адресу #16, #32, Ra.L с сохранением адреса возврата в регистр Ri.L (все 32 регистра RF)Soft jump to relative address #16, #32, Ra.L while saving return address to Ri.L register (all 32 RF registers) JS #16, Ri.L
JS #32, Ri.L
JS Ra.L, Ri.LJS #16, Ri.L
JS #32, Ri.L
JS Ra.L, Ri.L Программный переход по абсолютному адресу #16, #32, Ra.L с сохранением адреса возврата в регистр Ri.L (все 32 регистра RF)Soft jump to absolute address #16, #32, Ra.L while saving return address to Ri.L register (all 32 RF registers) DO #4, #16
DO #16, #32DO #4, #16
DO #16, #32 Запуск цикла с относительным адресом окончания #16, #32 и непосредственным числом повторенийLoop start with relative end address #16, #32 and immediate number of repetitions DO R.L, #16
DO R.L, #32D.O.R.L., #16
D.O.R.L., #32 Запуск цикла с относительным адресом окончания #16, #32 и числом повторений из регистра R.LLoop start with relative end address #16, #32 and number of repetitions from register R.L ENDDOENDDO Останов цикла по условию. Происходит немедленное прерывание цикла и выход на инструкцию, идущую следом за последней инструкцией цикла.Stop the loop by condition. There is an immediate interruption of the loop and exit to the instruction following the last instruction of the loop.

Табл. 1.Tab. one.

Мнемоника командыCommand mnemonic ОписаниеDescription LDB (A), RLDB(A),R Загрузка i8 из памяти, расширение знаком до i32Load i8 from memory, sign extension to i32 LDBU (A), RLDBU(A), R Загрузка u8 из памяти, расширение нулем до u32Load u8 from memory, zero-extend to u32 LDH (A), RLDH(A),R Загрузка i16 из памяти, расширение знаком до i32Load i16 from memory, sign extension to i32 LDHU (A), RLDHU(A),R Загрузка u16 из памяти, расширение нулем до u32Load u16 from memory, zero-extend to u32 LDL (A), RLDL(A), R Загрузка слова i32 из памяти (без преобразования)Load i32 word from memory (no conversion) LDD (A), RLDD(A),R Загрузка двойного слова i64 из памяти (без преобразования)Load i64 dword from memory (no conversion) STB R, (A)STB R, (A) Сохранение u8 в памятьSaving u8 to memory STH R, (A)STH R, (A) Сохранение u16 в памятьSaving u16 to memory STL R, (A)STL R, (A) Сохранение u32 в памятьSaving u32 to memory STD R, (A)STD R, (A) Сохранение u64 в памятьSaving u64 to memory

Табл. 2.Tab. 2.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Скалярные операции целочисленного сложения/вычитанияScalar integer addition/subtraction operations ADDD Rt, Rs, Rd
ADDD #imm, Rs, Rd
ADDD.SAT Rt, Rs, RdADDD Rt, Rs, Rd
ADDD #imm, Rs, Rd
ADDD.SAT Rt, Rs, Rd Rd.d = (Rt.d(#imm) + Rs.d)
Rd.d = sat64(Rt.d + Rs.d)Rd.d = (Rt.d(#imm) + Rs.d)
Rd.d = sat64(Rt.d + Rs.d) Сложение двух операндов, знаковое, с опциональной сатурацией, i64+i64→i64Addition of two operands, signed, with optional saturation, i64+i64→i64 ADDL Rt, Rs, Rd
ADDL #imm, Rs, Rd
ADDL.SAT Rt, Rs, RdADDL Rt, Rs, Rd
ADDL #imm, Rs, Rd
ADDL.SAT Rt, Rs, Rd Rd = (Rt(#imm) + Rs)
Rd = sat32(Rt + Rd)Rd = (Rt(#imm) + Rs)
Rd = sat32(Rt + Rd) Сложение двух операндов, знаковое, с опциональной сатурацией, i32+i32→i32Addition of two operands, signed, with optional saturation, i32+i32→i32 ADDL.SCL Rt, Rs, Rd
ADDL.SCL.RND Rt,Rs,RdADDL.SCL Rt, Rs, Rd
ADDL.SCL.RND Rt,Rs,Rd Rd = (Rt + Rs)>>1
Rd = (Rt + Rs + 1) >>1Rd = (Rt + Rs)>>1
Rd = (Rt + Rs + 1) >>1 Сложение со сдвигом, знаковое, с опциональным округлением, i32+i32→i32Shift addition, signed, with optional rounding, i32+i32→i32 ADDL.RND Rt, Rs, RdADDL.RND Rt, Rs, Rd Rd.l = (Rt.l + Rs.l).rndRd.l = (Rt.l + Rs.l).rnd Сложение с округлением i32+i32→i32
Округляются младшие 16 бит.Addition with rounding i32+i32→i32
The lower 16 bits are rounded off. SUBD Rt, Rs, Rd
SUBD #imm, Rs, Rd
SUBD.SAT Rt, Rs, RdSubd Rt, Rs, Rd
Subd #imm, Rs, Rd
SUBD.SAT Rt, Rs, Rd Rd.d = (Rs.d - Rt.d(#imm))
Rd.d = sat64(Rs.d - Rt.d)Rd.d = (Rs.d - Rt.d(#imm))
Rd.d = sat64(Rs.d - Rt.d) Вычитание двух операндов, знаковое, с опциональной сатурацией, i64 - i64 → i64Subtraction of two operands, signed, with optional saturation, i64 - i64 → i64 SUBL.SCL Rt, Rs, Rd
SUBL.SCL.RND Rt,Rs,RdSUBL.SCL Rt, Rs, Rd
SUBL.SCL.RND Rt,Rs,Rd Rd = (Rs - Rt)>>1
Rd = (Rs - Rt - 1) >>1Rd = (Rs - Rt)>>1
Rd = (Rs - Rt - 1) >>1 Вычитание со сдвигом, знаковое, с опциональным округлением i32+i32→i32Shift subtraction, signed, with optional rounding i32+i32→i32 SUBL.RND Rt, Rs, RdSUBL.RND Rt, Rs, Rd Rd.l = (Rs.l - Rt.l).rndRd.l = (Rs.l - Rt.l).rnd Вычитание с округлением i32+i32→i32
Округляются младшие 16 бит.Subtraction with rounding i32+i32→i32
The lower 16 bits are rounded off. NEGD Rs, Rd
NEGD.SAT Rs, RdNEGD Rs, Rd
NEGD.SAT Rs, Rd Rd = -Rs
Rd = sat32(-Rs)Rd = -Rs
Rd = sat32(-Rs) Отрицание результата с опциональной сатурацией результата, i64Result negation with optional result saturation, i64 NEGL Rs, Rd
NEGL.SAT Rs, RdNEGL Rs, Rd
NEGL.SAT Rs, Rd Rd = -Rs
Rd = sat32(-Rs)Rd = -Rs
Rd = sat32(-Rs) Отрицание результата с опциональной сатурацией результата, i32Result negation with optional result saturation, i32 Скалярные операции вычисления абсолютной величиныScalar Absolute Value Calculation Operations ABSD Rs, Rd
ABSD.SAT Rs, RdABSD Rs, Rd
ABSD.SAT Rs, Rd Rd = |Rs|
Rd = sat64(|Rs|)Rd = |Rs|
Rd = sat64(|Rs|) Вычисление модуля 64 битного целого, знакового, с опциональной сатурациейModulo 64 bit integer, signed, with optional saturation ABSL Rs, Rd
ABSL.SAT Rs, RdABSL Rs, Rd
ABSL.SAT Rs, Rd Rd = |Rs|
Rd = sat32(|Rs|)Rd = |Rs|
Rd = sat32(|Rs|) Вычисление модуля 32 битного целого, знакового, с опциональной сатурациейModulo 32 bit integer, signed, with optional saturation Скалярные операции вычисления целочисленного максимума/минимумаScalar Integer Maximum/Minimum Calculation Operations MAXD Rt, Rs, RdMAXD Rt, Rs, Rd Rd = max(Rt, Rs)Rd = max(Rt, Rs) Максимум из двух элементов, i64 → i64Maximum of two elements, i64 → i64 MAXDU Rt, Rs, RdMAXDU Rt, Rs, Rd Rd = maxu(Rt, Rs)Rd = maxu(Rt, Rs) Максимум из двух элементов, u64 → u64Maximum of two elements, u64 → u64 MAXMD Rt, Rs, RdMAXMD Rt, Rs, Rd Rd = maxm(Rt, Rs)Rd = maxm(Rt, Rs) Выбор числа с большим модулем, i64 → i64Choosing a number with a large modulus, i64 → i64 MIND Rt, Rs, RdMIND Rt, Rs, Rd Rd = min(Rt, Rs)Rd = min(Rt, Rs) Минимум из двух элементов, i64 → i64Minimum of two elements, i64 → i64 MINDU Rt, Rs, RdMINDU Rt, Rs, Rd Rd = minu(Rt, Rs)Rd = minu(Rt, Rs) Минимум из двух элементов, u64 → u64Minimum of two elements, u64 → u64 MINMD Rt, Rs, RdMINMD Rt, Rs, Rd Rd = minm(Rt, Rs)Rd = minm(Rt, Rs) Выбор числа с меньшим модулем, i64 → i64Choosing a number with a smaller modulus, i64 → i64 MAXL Rt, Rs, RdMAXL Rt, Rs, Rd Rd = max(Rt, Rs)Rd = max(Rt, Rs) Максимум из двух элементов, i32 → i32Maximum of two elements, i32 → i32 MAXLU Rt, Rs, RdMAXLU Rt, Rs, Rd Rd = maxu(Rt, Rs)Rd = maxu(Rt, Rs) Максимум из двух элементов, u32 → u32Maximum of two elements, u32 → u32 MAXML Rt, Rs, RdMAXML Rt, Rs, Rd Rd = maxm(Rt, Rs)Rd = maxm(Rt, Rs) Выбор числа с большим модулем, i32 → i32Choosing a number with a large modulus, i32 → i32 MINL Rt, Rs, RdMINL Rt, Rs, Rd Rd = min(Rt, Rs)Rd = min(Rt, Rs) Минимум из двух элементов, i32 → i32Minimum of two elements, i32 → i32 MINLU Rt, Rs, RdMINLU Rt, Rs, Rd Rd = minu(Rt, Rs)Rd = minu(Rt, Rs) Минимум из двух элементов, u32 → u32Minimum of two elements, u32 → u32 MINML Rt, Rs, RdMINML Rt, Rs, Rd Rd = minm(Rt, Rs)Rd = minm(Rt, Rs) Выбор числа с меньшим модулем, i32 → i32Choosing a number with a smaller modulus, i32 → i32 Скалярные логические операцииScalar Boolean Operations ANDD Rt/#imm, Rs, RdANDD Rt/#imm, Rs, Rd Rd = Rt(#imm) & RsRd = Rt(#imm) & Rs Поэлементное логическое «И»Element-by-element logical "AND" ORD Rt/#imm, Rs, RdORD Rt/#imm, Rs, Rd Rd = Rt(#imm) | RsRd = Rt(#imm) | Rs Поэлементное логическое «ИЛИ»Element-by-element logical "OR" EORD Rt, Rs, RdEORD Rt, Rs, Rd Rd = Rt ^ RsRd = Rt^Rs Поэлементное логическое исключающее «ИЛИ»Element-by-element logical exclusive "OR" INSD Rt, Rs, RdINSD Rt, Rs, Rd Rd = (~Rt & Rs) | (Rt & Rd)Rd = (~Rt & Rs) | (Rt & Rd) Объединение по маскеMerge by mask ANDL Rt/#imm, Rs, RdANDL Rt/#imm, Rs, Rd Rd = Rt(#imm) & RsRd = Rt(#imm) & Rs Поэлементное логическое «И»Element-by-element logical "AND" ORL Rt/#imm, Rs, RdORL Rt/#imm, Rs, Rd Rd = Rt(#imm) | RsRd = Rt(#imm) | Rs Поэлементное логическое «ИЛИ»Element-by-element logical "OR" EORL Rt, Rs, RdEORL Rt, Rs, Rd Rd = Rt ^ RsRd = Rt^Rs Поэлементное логическое исключающее «ИЛИ»Element-by-element logical exclusive "OR" INSL Rt, Rs, RdINSL Rt, Rs, Rd Rd = (~Rt & Rs) | (Rt & Rd)Rd = (~Rt & Rs) | (Rt & Rd) Объединение по маскеMerge by mask NOTL Rs, RdNOTL Rs, Rd Rd = ~RsRd = ~Rs Отрицание результатаNegative result

Табл. 3.Tab. 3.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Скалярные операции сложения/вычитания с плавающей запятойScalar Floating-Point Add/Subtract FADD Rt, Rs, RdFADD Rt, Rs, Rd Rd = Rt + RsRd = Rt + Rs Сложение двух чисел, f32 + f32 → f32Adding two numbers, f32 + f32 → f32 DADD Rt, Rs, RdDADD Rt, Rs, Rd Rd = Rt + RsRd = Rt + Rs Сложение двух чисел, f64 + f64 → f64Adding two numbers, f64 + f64 → f64 FSUB Rt, Rs, RdFSUB Rt, Rs, Rd Rd = Rs - RtRd = Rs - Rt Вычитание двух чисел, f32 - f32 → f32Subtraction of two numbers, f32 - f32 → f32 DSUB Rt, Rs, RdDSUB Rt, Rs, Rd Rd = Rs - RtRd = Rs - Rt Вычитание двух чисел, f64 - f64 → f64Subtraction of two numbers, f64 - f64 → f64 Скалярные операции максимума/минимума с плавающей запятойScalar maximum/minimum floating point operations FMAX Rt, Rs, RdFMAX Rt, Rs, Rd Rd = max(Rt, Rs)Rd = max(Rt, Rs) Максимум двух чисел, f32Maximum of two numbers, f32 DMAX Rt, Rs, RdDMAX Rt, Rs, Rd Rd = max(Rt, Rs)Rd = max(Rt, Rs) Максимум двух чисел, f64Maximum of two numbers, f64 FMIN Rt, Rs, RdFMIN Rt, Rs, Rd Rd = min(Rt, Rs)Rd = min(Rt, Rs) Минимум двух чисел, f32Minimum of two numbers, f32 DMIN Rt, Rs, RdDMIN Rt, Rs, Rd Rd.d = min(Rt, Rs)Rd.d = min(Rt, Rs) Минимум двух чисел, f64Minimum of two numbers, f64

Табл. 4.Tab. four.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Скалярные операции целочисленного умноженияScalar operations of integer multiplication MPYLLO Rt, Rs, Rd
MPYLLO #imm, Rs, RdMPYLLO Rt, Rs, Rd
MPYLLO #imm, Rs, Rd Rd = trunk₃₂(Rs * Rt)
Rd = trunk₃₂(Rs * #imm)Rd = trunk ₃₂ (Rs * Rt)
Rd = trunk ₃₂ (Rs * #imm) Умножение i32*i32→i64, использование младших 32 разрядов, i32Multiply i32*i32→i64, use lower 32 bits, i32 MPYLULO Rt, Rs, Rd
MPYLULO #imm, Rs, RdMPYLULO Rt, Rs, Rd
MPYLULO #imm, Rs, Rd Rd = trunk₃₂(Rs * Rt)
Rd = trunk₃₂(Rs * #imm)Rd = trunk ₃₂ (Rs * Rt)
Rd = trunk ₃₂ (Rs * #imm) Умножение u32*u32→u64, использование младших 32 разрядов, u32Multiply u32*u32→u64, use lower 32 bits, u32 MPYLHI Rt, Rs, Rd
MPYLHI #imm, Rs, Rd
MPYLHI.RND Rt, Rs, RdMPYLHI Rt, Rs, Rd
MPYLHI #imm, Rs, Rd
MPYLHI.RND Rt, Rs, Rd Rd = (Rs * Rt)>>32
Rd = (Rs * #imm)>>32
Rd = (Rs * #imm + 0x8000_0000)>>32Rd = (Rs * Rt)>>32
Rd = (Rs * #imm)>>32
Rd = (Rs * #imm + 0x8000_0000)>>32 Умножение, использование старших 32 разрядов, i32*i32→i64, с опциональным округлениемMultiplication, high 32 bits, i32*i32→i64, with optional rounding MPYLUHI Rt, Rs, Rd
MPYLUHI #imm, Rs, RdMPYLUHI Rt, Rs, Rd
MPYLUHI #imm, Rs, Rd Rd = (Rs * Rt)>>32
Rd = (Rs * #imm)>>32Rd = (Rs * Rt)>>32
Rd = (Rs * #imm)>>32 Умножение, использование старших 32 разрядов, u32*u32→u32Multiplication, using high 32 bits, u32*u32→u32 MPYL Rt, Rs, Rdd
MPYL #imm, Rs, RddMPYL Rt, Rs, Rdd
MPYL #imm, Rs, Rdd Rdd = Rs * Rt
Rdd = Rs * #immRdd = Rs * Rt
Rdd = Rs * #imm Умножение, запись полного результата в парный регистр, i32*i32→i64Multiplication, write full result to pair register, i32*i32→i64 MPYLU Rt, Rs, Rdd
MPYLU #imm, Rs, RddMPYLU Rt, Rs, Rdd
MPYLU #imm, Rs, Rdd Rdd = Rs * Rt
Rdd = Rs * #immRdd = Rs * Rt
Rdd = Rs * #imm Умножение, запись полного результата в парный регистр, u32*u32→u64Multiplication, write full result to pair register, u32*u32→u64 MPYDLO Rt, Rs, RdMPYDLO Rt, Rs, Rd Rd = trunk₆₄(Rs * Rt)Rd = trunk ₆₄ (Rs * Rt) Умножение i64*i64→i128, использование младших 64 разрядов, i64Multiply i64*i64→i128, use lower 64 bits, i64 MPYDHI Rt, Rs, RdMPYDHI Rt, Rs, Rd Rd = (Rs * Rt)>>128Rd = (Rs * Rt)>>128 Умножение, использование старших 64 разрядов, i64*i64→i128Multiplication, using high 64 bits, i64*i64→i128 MPYDULO Rt, Rs, RdMPYDULO Rt, Rs, Rd Rd = trunk₆₄(Rs * Rt)Rd = trunk ₆₄ (Rs * Rt) Умножение u64*u64→u128, использование младших 64 разрядов, u64Multiply u64*u64→u128, use lower 64 bits, u64 MPYDUHI Rt, Rs, RdMPYDUHI Rt, Rs, Rd Rd = (Rs * Rt)>>128Rd = (Rs * Rt)>>128 Умножение, использование старших 64 разрядов, u64*u64→u128Multiplication, high 64 bits, u64*u64→u128 Скалярные операции умножения с плавающей запятойScalar Floating-Point Multiplications FMPY Rt, Rs, RdFMPY Rt, Rs, Rd Rd = Rt*RsRd = Rt*Rs Умножение двух чисел, f32*f32 → f32Multiplication of two numbers, f32*f32 → f32 DMPY Rt, Rs, RdDMPY Rt, Rs, Rd Rd.d = Rt.d*Rs.dRd.d = Rt.d*Rs.d Умножение двух чисел, f64*f64 → f64Multiplication of two numbers, f64*f64 → f64 FMADD Rt, Rs, Rd
FMADD Rt, Rs, Rr, RdFMADD Rt, Rs, Rd
FMADD Rt, Rs, Rr, Rd Rd = Rd + Rt*Rs
Rd = Rr + Rt*RsRd = Rd + Rt*Rs
Rd = Rr + Rt*Rs Сложение с произведением, f32*f32+f32→f32Product addition, f32*f32+f32→f32 FMSUB Rs, Rt, RdFMSUB Rs, Rt, Rd Rd = Rd - Rt*RsRd = Rd - Rt*Rs Вычитание произведения, f32*f32-f32→f32Product subtraction, f32*f32-f32→f32

Табл. 5.Tab. 5.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Скалярные операции сдвигаScalar shift operations ASRD Rt,Rs, Rd
ASRD #5u, Rs, Rd
ASRD1 #5u, Rs, RdASRD Rt,Rs, Rd
ASRD #5u, Rs, Rd
ASRD1 #5u, Rs, Rd Rd.d = Rs.d >> Rt
Rd.d = Rs.d >> #5u
Rd.d = Rs.d >> (#5u+32)Rd.d = Rs.d >> Rt
Rd.d = Rs.d >>#5u
Rd.d = Rs.d >>(#5u+32) Арифметический сдвиг вправоArithmetic right shift LSLD Rt,Rs, Rd
LSLD #5u,Rs, Rd
LSLD1 #5u,Rs, RdLSLD Rt,Rs, Rd
LSLD #5u,Rs, Rd
LSLD1 #5u,Rs, Rd Rd.d = Rs.d <<< Rt
Rd.d = Rs.d <<< #5u
Rd.d = Rs.d <<< (#5u+32)Rd.d = Rs.d <<< Rt
Rd.d = Rs.d <<<#5u
Rd.d = Rs.d <<<(#5u+32) Логический сдвиг влевоLogical shift left LSRD Rt,Rs, Rd
LSRD #u5,Rs, Rd
LSRD1 #u5,Rs, RdLSRD Rt,Rs, Rd
LSRD #u5,Rs, Rd
LSRD1 #u5,Rs, Rd Rd.d = Rs.d >>> Rt
Rd.d = Rs.d >>> #5u
Rd.d = Rs.d >>> (#5u+32)Rd.d = Rs.d >>> Rt
Rd.d = Rs.d >>>#5u
Rd.d = Rs.d >>>(#5u+32) Логический сдвиг вправоLogical right shift LSLD.SAT Rt(#u5),Rs, Rd
LSLD1.SAT #u5,Rs, RdLSLD.SAT Rt(#u5),Rs, Rd
LSLD1.SAT #u5,Rs, Rd Rd = sat64(Rs <<< (#5u))
Rd = sat64(Rs <<< (#5u+32))Rd = sat64(Rs <<<(#5u))
Rd = sat64(Rs <<<(#5u+32)) Логический сдвиг влево с сатурацией. При выходе за пределы разрядной сетки любого единичного бита устанавливается максимальное беззнаковое значение.Logical left shift with saturation. When leaving the bit grid of any single bit, the maximum unsigned value is set. ROLD Rt/#5u, Rs, RdROLD Rt/#5u, Rs, Rd Rd = (Rs <<< Rt) | (Rs>>>(64-Rt))Rd = (Rs <<< Rt) | (Rs>>>(64-Rt)) Циклический сдвиг влево, сетка 64 бит
Обычное поведение при #5 = 0: результат Rd = Rs.Rotate left, 64-bit grid
Normal behavior when #5 = 0: result Rd = Rs. RORD Rt/#5u, Rs, RdRORD Rt/#5u, Rs, Rd Rd = (Rs >>> Rt) | (Rs<<<(64-Rt))Rd = (Rs >>> Rt) | (Rs<<<(64-Rt)) Циклический сдвиг вправо, сетка 64 бит
Особое поведение при #5 = 0: результат Rd = {Rs.L[0], Rs.L[1]} (два слова 32-бит меняются местами).Rotate right, 64-bit grid
Special behavior when #5 = 0: result Rd = {Rs.L[0], Rs.L[1]} (two 32-bit words are swapped). ASRL Rt(#u5),Rs, RdASRL Rt(#u5),Rs, Rd Rd = Rs >> RtRd = Rs >> Rt Арифметический сдвиг вправоArithmetic right shift LSLL Rt(#u5),Rs, RdLSLL Rt(#u5),Rs, Rd Rd = Rs <<< RtRd = Rs <<< Rt Логический сдвиг влевоLogical shift left LSRL Rt(#u5),Rs, RdLSRL Rt(#u5),Rs, Rd Rd = Rs >>> RtRd = Rs >>> Rt Логический сдвиг вправоLogical right shift LSLL.SAT Rt(#u5),Rs, RdLSLL.SAT Rt(#u5),Rs, Rd Rd = sat32(Rs <<< Rt)Rd = sat32(Rs <<< Rt) Логический сдвиг влево с сатурацией. При выходе за пределы разрядной сетки любого единичного бита устанавливается максимальное беззнаковое значение.Logical left shift with saturation. When leaving the bit grid of any single bit, the maximum unsigned value is set. ROLL Rt, Rs, Rd
ROLL #5u, Rs, RdROLL Rt, Rs, Rd
ROLL #5u, Rs, Rd Rd = (Rs <<< Rt) | (Rs>>>(32-Rt))Rd = (Rs <<< Rt) | (Rs>>>(32-Rt)) Циклический сдвиг влево, сетка 32 битRotate left, 32-bit grid RORL Rt, Rs, Rd
RORL #5u, Rs, RdRORL Rt, Rs, Rd
RORL #5u, Rs, Rd Rd = (Rs >>> Rt) | (Rs<<<(32-Rt))Rd = (Rs >>> Rt) | (Rs<<<(32-RT)) Циклический сдвиг вправо, сетка 32 битRotate right, 32 bit grid

Табл. 6.Tab. 6.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Скалярные операции расширения целочисленных типовScalar Extension Operations of Integer Types CVBLU Rt, Rd
CVBL Rt, RdCVBLU Rt, Rd
CVBL Rt, Rd Rd.l = zext_8→32(Rt.b)
Rd.l = sext_8→32(Rt.b)Rd.l = zext _8→32 (Rt.b)
Rd.l = sext _8→32 (Rt.b) Расширение байта до целого, i8 → i32, знаковое и беззнаковоеByte to integer expansion, i8 → i32, signed and unsigned CVHLU Rt, Rd
CVHL Rt, RdCVHLU Rt, Rd
CVHL Rt, Rd Rd.l = zext_16→32(Rt.h)
Rd.l = sext_16→32(Rt.h)Rd.l = zext _16→32 (Rt.h)
Rd.l = sext _16→32 (Rt.h) Расширение короткого целого до целого, i16 → i32, знаковое и беззнаковоеShort integer to integer expansion, i16 → i32, signed and unsigned CVBDU Rt, Rs
CVBD Rt, RdCVBDU Rt, Rs
CVBD Rt, Rd Rd.d= zext_8→64(Rt.b)
Rd.d = sext_8→64(Rt.b)Rd.d= zext _8→64 (Rt.b)
Rd.d = sext _8→64 (Rt.b) Расширение байта до целого, i8 → i64, знаковое и беззнаковоеByte to integer expansion, i8 → i64, signed and unsigned CVHDU Rt, Rs
CVHD Rt, RdCVHDU Rt, Rs
CVHD Rt, Rd Rd.d= zext_16→64(Rt.h)
Rd.d = sext_16→64(Rt.h)Rd.d= zext _16→64 (Rt.h)
Rd.d = sext _16→64 (Rt.h) Расширение байта до целого, i16 → i64, знаковое и беззнаковоеByte to integer expansion, i16 → i64, signed and unsigned CVLDU Rt, Rs
CVLD Rt, RdCVLDU Rt, Rs
CVLD Rt, Rd Rd.d= zext_32→64(Rt.l)
Rd.d = sext_32→64(Rt.l)Rd.d= zext _32→64 (Rt.l)
Rd.d = sext _32→64 (Rt.l) Расширение байта до целого, i32 → i64, знаковое и беззнаковоеByte to integer expansion, i32 → i64, signed and unsigned Скалярные операции усечения целочисленных типовScalar Truncation of Integer Types CVLB Rt, Rd
CVLB.sat Rt, Rd
CVLBU.sat Rt, Rd
CVLH Rt, Rd
CVLH.sat Rt, Rd
CVLHU.sat Rt, Rd
CVDB Rt, Rd
CVDB.sat Rt, Rd
CVDBU.sat Rt, Rd
CVDH Rt, Rd
CVDH.sat Rt, Rd
CVDHU.sat Rt, Rd
CVDL Rt, Rd
CVDL.sat Rt, Rd
CVDLU.sat Rt, RdCVLB Rt, Rd
CVLB.sat Rt, Rd
CVLBU.sat Rt, Rd
CVLH Rt, Rd
CVLH.sat Rt, Rd
CVLHU.sat Rt, Rd
CVDB Rt, Rd
CVDB.sat Rt, Rd
CVDBU.sat Rt, Rd
CVDH Rt, Rd
CVDH.sat Rt, Rd
CVDHU.sat Rt, Rd
CVDL Rt, Rd
CVDL.sat Rt, Rd
CVDLU.sat Rt, Rd (на примере CVLB)
CVLB Rt, Rd Rd=sext_8→32(trunk8(Rt))
CVLB.sat Rt, Rd
Rd=sext_8→32 (sat8(Rt))
CVLBU.sat Rt, Rd
Rd=zext_8→32 (usat8(Rt))(on the example of CVLB)
CVLB Rt, Rd Rd=sext _8→32 (trunk8(Rt))
CVLB.sat Rt, Rd
Rd=sext _8→32 (sat8(Rt))
CVLBU.sat Rt, Rd
Rd=zext _8→32 (usat8(Rt)) (на примере CVLB)
Усечение исходных данных (L) до запрашиваемой величины (B), с опциональной сатурацией. Полученное число (B) расширяется своим знаком (для знаковых расширений) или нулем (для беззнаковых расширений - с суффиксом U) до исходной величины (L)(on the example of CVLB)
Truncation of raw data (L) to the requested value (B), with optional saturation. The resulting number (B) is expanded by its sign (for signed extensions) or zero (for unsigned extensions - with the U suffix) to the original value (L) Скалярные операции преобразования типов с плавающей запятойScalar floating-point type conversions FDCV Rt, RdFDCV Rt, Rd Rd = float_to_double(Rt)Rd = float_to_double(RT) Преобразование float32 в float64Convert float32 to float64 DFCV Rt, RdDFCV Rt, Rd Rd = double_to_float(Rt)Rd = double_to_float(RT) Преобразование float64 в float32Convert float64 to float32 FHCV Rt, RdFHCV Rt, Rd Rd[0].h = float_to_halffloat(Rt)
Rd[1].h = 0Rd[0].h = float_to_halffloat(Rt)
Rd[1].h = 0 Преобразование float32 в float16
Старшие 16 бит регистра Rd заполняются нулем.Convert float32 to float16
The upper 16 bits of register Rd are filled with zero. DHCV Rt, RdDHCV Rt, Rd Rd[0].h = double_to_halffloat(Rt)
Rd[1].h = 0Rd[0].h = double_to_halffloat(Rt)
Rd[1].h = 0 Преобразование float64 в float16
Старшие 16 бит регистра Rd заполняются нулем.Convert float64 to float16
The upper 16 bits of register Rd are filled with zero. HFCV Rt, RdHFCV Rt, Rd Rd = halffloat_to_float(Rt)Rd = halffloat_to_float(RT) Преобразование float16 в float32Convert float16 to float32 HDCV Rt, RdHDCV Rt, Rd Rd = halffloat_to_doubleRt)Rd = halffloat_to_doubleRt) Преобразование float16 в float64Convert float16 to float64 Скалярные операции преобразования типов INT->FLTScalar type conversion operations INT->FLT CVDF Rt, RdCVDF Rt, Rd Rd.f32 = int64_to_float32(Rt.d)Rd.f32 = int64_to_float32(Rt.d) Преобразование int64 в float32Convert int64 to float32 CVDFU Rt, RdCVDFU Rt, Rd Rd.f32 = uint64_to_float32(Rt.d)Rd.f32 = uint64_to_float32(Rt.d) Преобразование uint64 в float32Convert uint64 to float32 CVDD Rt, RdCVDD Rt, Rd Rd.f64= int64_to_float64(Rt.d)Rd.f64= int64_to_float64(Rt.d) Преобразование int64 в float64Convert int64 to float64 CVDDU Rt, RdCVDDU Rt, Rd Rd.f64 = uint64_to_float64(Rt.d)Rd.f64 = uint64_to_float64(Rt.d) Преобразование из uint64 в float64Convert from uint64 to float64 CVIF Rt, RdCVIF Rt, Rd Rd.f32 = int32_to_float32(Rt.d)Rd.f32 = int32_to_float32(Rt.d) Преобразование из int32 в float32Convert from int32 to float32 CVIFU Rt, RdCVIFU Rt, Rd Rd.f32 = uint32_to_float32(Rt.d)Rd.f32 = uint32_to_float32(Rt.d) Преобразование из uint32 в float32Convert from uint32 to float32 CVID Rt, RdCVID Rt, Rd Rd.f64 = int32_to_float64(Rt.d)Rd.f64 = int32_to_float64(Rt.d) Преобразование из int32 в float64Convert from int32 to float64 CVIDU Rt, RdCVIDU Rt, Rd Rd.f64 = uint32_to_float64(Rt.d)Rd.f64 = uint32_to_float64(Rt.d) Преобразование из uint32 в float64Convert from uint32 to float64 CVHF Rt, RdCVHF Rt, Rd Rd.f32 = int16_to_float32(Rt.d)Rd.f32 = int16_to_float32(Rt.d) Преобразование из int16 в float32Convert from int16 to float32 CVHFU Rt, RdCVHFU Rt, Rd Rd.f32 = uint16_to_float32(Rt.d)Rd.f32 = uint16_to_float32(Rt.d) Преобразование из uint16 в float16Convert from uint16 to float16 Скалярные операции преобразования типов FLT-> INTScalar type conversion operations FLT->INT FCVD Rt, RdFCVD Rt, Rd Rd.d = float32_to_int64(Rt.d)Rd.d = float32_to_int64(Rt.d) Преобразование из float32 в int64Convert from float32 to int64 FCVDU Rt, RdFCVDU Rt, Rd Rd.ud = float32_to_uint64(Rt.d)Rd.ud = float32_to_uint64(Rt.d) Преобразование из float32 в uint64Convert from float32 to uint64 FCVI Rt, Rd
FCVI.floor Rt, Rd
FCVI.round Rt, Rd
FCVI.ceil Rt, Rd
FCVI.trunc Rt, RdFCVI Rt, Rd
FCVI.floor Rt, Rd
FCVI.round Rt, Rd
FCVI.ceil Rt, Rd
FCVI.trunc Rt, Rd Rd.l = float32_to_int32(Rt.d)Rd.l = float32_to_int32(Rt.d) Преобразование из float32 в int32
Опциональное округлениеConvert from float32 to int32
Optional rounding FCVIU Rt, Rd
FCVIU.floor Rt, Rd
FCVIU.round Rt, Rd
FCVIU.ceil Rt, Rd
FCVIU.trunc Rt, RdFCVIU Rt, Rd
FCVIU.floor Rt, Rd
FCVIU.round Rt, Rd
FCVIU.ceil Rt, Rd
FCVIU.trunc Rt, Rd Rd.ul = float32_to_uint32(Rt.d)Rd.ul = float32_to_uint32(Rt.d) Преобразование из float32 в uint32
Опциональное округлениеConvert from float32 to uint32
Optional rounding DCVD Rt, Rd
DCVD.floor Rt, Rd
DCVD.round Rt, Rd
DCVD.ceil Rt, Rd
DCVD.trunc Rt, RdDCVD Rt, Rd
DCVD.floor Rt, Rd
DCVD.round Rt, Rd
DCVD.ceil Rt, Rd
DCVD.trunc Rt, Rd Rd.d = float64_to_int64(Rt.d)Rd.d = float64_to_int64(Rt.d) Преобразование из float64 в int64
Опциональное округлениеConvert from float64 to int64
Optional rounding DCVDU Rt, Rd
DCVDU.floor Rt, Rd
DCVDU.round Rt, Rd
DCVDU.ceil Rt, Rd
DCVDU.trunc Rt, RdDCVDU Rt, Rd
DCVDU.floor Rt, Rd
DCVDU.round Rt, Rd
DCVDU.ceil Rt, Rd
DCVDU.trunc Rt, Rd Rd.ud = float64_to_uint64(Rt.d)Rd.ud = float64_to_uint64(Rt.d) Преобразование из float64 в uint64
Опциональное округлениеConvert from float64 to uint64
Optional rounding DCVI Rt, RdDCVI Rt, Rd Rd.l = float64_to_int32(Rt.d)Rd.l = float64_to_int32(Rt.d) Преобразование из float64 в int32Convert from float64 to int32 DCVIU Rt, RdDCVIU Rt, Rd Rd.ul = float64_to_uint32(Rt.d)Rd.ul = float64_to_uint32(Rt.d) Преобразование из float64 в uint32Convert from float64 to uint32 FCVH Rt, RdFCVH Rt, Rd Rd.l = float32_to_int16(Rt.d)Rd.l = float32_to_int16(Rt.d) Преобразование из float32 в int16Convert from float32 to int16 FCVHU Rt, RdFCVHU Rt, Rd Rd.ul = float32_to_uint16(Rt.d)Rd.ul = float32_to_uint16(Rt.d) Преобразование из float32 в uint16Convert from float32 to uint16

Табл. 7.Tab. 7.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Скалярные операции целочисленного деленияScalar integer division operations DIVLDIVL Rd.L = Rt.L/Rs.LRd.L = Rt.L/Rs.L Деление i32:
if (Rs == 0)
REM = 0
DIV = (Rt >= 0)? 0x7FFFFFFF: 0x80000000
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = DIVDivision i32:
if (Rs == 0)
REM = 0
DIV = (Rt >= 0)? 0x7FFFFFFF: 0x80000000
else
DIV = Rt / Rs
REM = Rt % Rs
Rd=DIV REMLREML Rd.L = Rt.L%Rs.LRd.L = Rt.L%Rs.L Остаток от деления i32:
if (Rs == 0)
REM = 0
DIV = (Rt >= 0)? 0x7FFFFFFF: 0x80000000
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = REMRemainder from dividing i32:
if (Rs == 0)
REM = 0
DIV = (Rt >= 0)? 0x7FFFFFFF: 0x80000000
else
DIV = Rt / Rs
REM = Rt % Rs
Rd=REM DIVREMLDIVREML Rd.D = {Rt.L%Rs.L , Rt.L/Rs.L}Rd.D = {Rt.L%Rs.L , Rt.L/Rs.L} Остаток от деления i32, деление i32:
if (Rs == 0)
REM = 0
DIV = (Rt >= 0)? 0x7FFFFFFF: 0x80000000
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = {REM, DIV}Remainder of i32 division, i32 division:
if (Rs == 0)
REM = 0
DIV = (Rt >= 0)? 0x7FFFFFFF: 0x80000000
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = {REM, DIV} DIVLUDIVLU Rd.L = Rt.L/Rs.LRd.L = Rt.L/Rs.L Деление u32:
if (Rs == 0)
DIV = 0xFFFFFFFF
REM = 0
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = DIVDivision u32:
if (Rs == 0)
DIV = 0xFFFFFFFF
REM = 0
else
DIV = Rt / Rs
REM = Rt % Rs
Rd=DIV REMLUREMLU Rd.L = Rt.L%Rs.LRd.L = Rt.L%Rs.L Остаток от деления u32:
if (Rs == 0)
DIV = 0xFFFFFFFF
REM = 0
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = REMRemainder from dividing u32:
if (Rs == 0)
DIV = 0xFFFFFFFF
REM = 0
else
DIV = Rt / Rs
REM = Rt % Rs
Rd=REM DIVREMLUDIVREMLU Rd.D = {Rt.L%Rs.L , Rt.L/Rs.L}Rd.D = {Rt.L%Rs.L , Rt.L/Rs.L} Остаток от деления u32, деление u32:
if (Rs == 0)
DIV = 0xFFFFFFFF
REM = 0
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = {REM, DIV}Remainder from division u32, division u32:
if (Rs == 0)
DIV = 0xFFFFFFFF
REM = 0
else
DIV = Rt / Rs
REM = Rt % Rs
Rd = {REM, DIV}

Табл. 8.Tab. eight.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Скалярные операции вычисления трансцендентных математических функцийScalar operations for calculating transcendental mathematical functions FEXP2 Rt, RdFEXP2 Rt, Rd Rd.L = 2.0^Rt Rd.L = 2.0 ^Rt Вычисление значения показательной функции c основанием 2: Z=2^X.
Вход (X) и выход (Z) имеют 32-разрядный формат плавающей точки: (float32)X, (float32)ZCalculation of the value of the exponential function with base 2: Z=2 ^X .
Input (X) and output (Z) are 32-bit floating point format: (float32)X, (float32)Z FLOG Rt, RdFLOG Rt, Rd Rd.L = ln(Rt)Rd.L = ln(Rt) Вычисление значения натурального логарифма:
Z=Ln X.
Вход (X) и выход (Z) имеют 32-разрядный формат плавающей точки: (float32)X, (float32)ZCalculating the value of the natural logarithm:
Z=LnX.
Input (X) and output (Z) are 32-bit floating point format: (float32)X, (float32)Z FSQRT Rt, RdFSQRT Rt, Rd Rd.L = sqrt_approx(Rt)Rd.L = sqrt_approx(Rt) Извлечение квадратного корня:
Z=√X.
Вход (X) и выход (Z) имеют 32-разрядный формат плавающей точки:
(float32)X, (float32)Z.
Ограничения и особые случаи:
FSQRT(0) = 0
FSQRT(-0) = -0
FSQRT(A) = NaN, A < 0 Extracting the square root:
Z=√X.
The input (X) and output (Z) are in 32-bit floating point format:
(float32)X, (float32)Z.
Restrictions and special cases:
FSQRT(0) = 0
FSQRT(-0) = -0
FSQRT(A) = NaN, A < 0 FISQRT Rt, RdFISQRT Rt, Rd Rd.L = isqrt_approx(Rt)Rd.L = isqrt_approx(Rt) Вычисление обратной величины квадратного корня:
Z=1./√X.
Вход (X) и выход (Z) имеют 32-разрядный формат плавающей точки: (float32)X, (float32)ZCalculating the reciprocal of the square root:
Z=1./√X.
Input (X) and output (Z) are 32-bit floating point format: (float32)X, (float32)Z FRECIP Rt, RdFRECIP Rt, Rd Rd.L = recip_approx(Rt)Rd.L = recipe_approx(Rt) Вычисление обратной величины:
Z=1./X.
Вход (X) и выход (Z) имеют 32-разрядный формат плавающей точки: (float32)X, (float32)ZCalculating the reciprocal:
Z=1./X.
Input (X) and output (Z) are 32-bit floating point format: (float32)X, (float32)Z

Табл. 9.Tab. 9.

Мнемоника командыCommand mnemonic ОписаниеDescription VLDVLD Загрузка без преобразования типа данныхLoad without data type conversion VLDBH
VLDBHUVLDBH
VLDBHU Загрузка данных с расширением типа данных знаком или нулем, i8(u8) → i16(u16)Load data with data type extension sign or null, i8(u8) → i16(u16) VLDBL
VLDBLUVLDBL
VLDBL Загрузка данных с расширением типа данных знаком или нулем, i8(u8) → i32(u32)Load data with data type extension sign or null, i8(u8) → i32(u32) VLDHL
VLDHLUVLDHL
VLDHLU Загрузка данных с расширением типа данных знаком или нулем, i16(u16) → i32(u32)Load data with data type extension sign or null, i16(u16) → i32(u32) VLDLD
VLDLDUVLDLD
VLDLDU Загрузка данных с расширением типа данных знаком или нулем, i32(u32) → i64(u64)Load data with data type extension sign or null, i32(u32) → i64(u64) VSTVST Сохранение без преобразования типа данныхSaving without data type conversion VSTHBUVSTHBU Сохранение с обрезкой типа данных, u16 → u8Save with data type truncation, u16 → u8 VSTLBUVSTLBU Сохранение с обрезкой типа данных, u32 → u8Save with data type truncation, u32 → u8 VSTLHUVSTLHU Сохранение с обрезкой типа данных, u32 → u16Save with data type truncation, u32 → u16 VSTDLUVSTDLU Сохранение с обрезкой типа данных, u64 → u32Save with data type truncation, u64 → u32

Табл. 10.Tab. ten.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Векторные операции целочисленного сложения/вычитанияVector integer addition/subtraction operations VADDD Vt, Vs, Vd
VADDD.SAT Vt, Vs, Vd
VADDDU.SAT Vt, Vs, VdVADDD Vt, Vs, Vd
VADDD.SAT Vt, Vs, Vd
VADDDU.SAT Vt, Vs, Vd D[i] = T[i] + S[i], i = 0
D[i] = sat32(T[i] + S[i]),i=0
D[i] = usat32(T[i] + S[i]), i = 0, unsignedD[i] = T[i] + S[i], i = 0
D[i] = sat32(T[i] + S[i]),i=0
D[i] = usat32(T[i] + S[i]), i = 0, unsigned Сложение i64 = i64 + i64 знаковые (беззнаковые u64+u64→u64),
Опциональная сатурацияAddition i64 = i64 + i64 signed (unsigned u64+u64→u64),
Optional saturation VADDD.SCL Vt, Vs, Vd
VADDD.SCL.RND Vt, Vs, Vd
VADDDU.SCL Vt, Vs, Vd
VADDDU.SCL.RND Vt, Vs, VdVADDD.SCL Vt, Vs, Vd
VADDD.SCL.RND Vt, Vs, Vd
VADDDU.SCL Vt, Vs, Vd
VADDDU.SCL.RND Vt, Vs, Vd D[i] = (T[i] + S[i]) >> 1, i=0
D[i] = rnd(T[i] + S[i]) >> 1, i = 0
D[i] = (T[i] + S[i]) >> 1, i = 0, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0, unsignedD[i] = (T[i] + S[i]) >> 1, i=0
D[i] = rnd(T[i] + S[i]) >> 1, i = 0
D[i] = (T[i] + S[i]) >> 1, i = 0, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0, unsigned Сложение i64 = i64 + i64 знаковые (беззнаковые u64+u64→u64),
Опциональное округлениеAddition i64 = i64 + i64 signed (unsigned u64+u64→u64),
Optional rounding VADDL Vt, Vs, Vd
VADDL #32, Vs, Vd
VADDL.SAT Vt, Vs, Vd
VADDLU.SAT Vt, Vs, VdVADDL Vt, Vs, Vd
VADDL #32, Vs, Vd
VADDL.SAT Vt, Vs, Vd
VADDLU.SAT Vt, Vs, Vd D[i] = T[i] + S[i], i = 0:1
D[i] = #I2 + S[i], i = 0:1
D[i] = sat32(T[i] + S[i]), i = 0:1
D[i] = usat32(T[i] + S[i]), i = 0:1, unsignedD[i] = T[i] + S[i], i = 0:1
D[i] = #I2 + S[i], i = 0:1
D[i] = sat32(T[i] + S[i]), i = 0:1
D[i] = usat32(T[i] + S[i]), i = 0:1, unsigned Сложение i32 = i32 + i32 знаковые (беззнаковые u32+u32→u32),
Опциональная сатурацияAdd i32 = i32 + i32 signed (unsigned u32+u32→u32),
Optional saturation VADDL.SCL Vt, Vs, Vd
VADDL.SCL.RND Vt, Vs, Vd
VADDLU.SCL Vt, Vs, Vd
VADDLU.SCL.RND Vt, Vs, VdVADDL.SCL Vt, Vs, Vd
VADDL.SCL.RND Vt, Vs, Vd
VADDLU.SCL Vt, Vs, Vd
VADDLU.SCL.RND Vt, Vs, Vd D[i] = (T[i] + S[i]) >> 1, i = 0:1
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:1
D[i] = (T[i] + S[i]) >> 1, i = 0:1, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:1, unsignedD[i] = (T[i] + S[i]) >> 1, i = 0:1
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:1
D[i] = (T[i] + S[i]) >> 1, i = 0:1, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:1, unsigned Сложение i32 = i32 + i32 знаковые (беззнаковые u32+u32→u32),
Опциональное округлениеAdd i32 = i32 + i32 signed (unsigned u32+u32→u32),
Optional rounding VADDH Vt, Vs, Vd
VADDH #IMM16, Vs, Vd
VADDH.SAT Vt, Vs, Vd
VADDHU.SAT Vt, Vs, VdVADDH Vt, Vs, Vd
VADDH #IMM16, Vs, Vd
VADDH.SAT Vt, Vs, Vd
VADDHU.SAT Vt, Vs, Vd D[i] = T[i] + S[i], i = 0:3
D[i] = #IMM16 + S[i], i = 0:3
D[i] = sat16(T[i] + S[i]), i = 0:3
D[i] = usat16(T[i] + S[i]), i = 0:3, unsignedD[i] = T[i] + S[i], i = 0:3
D[i] = #IMM16 + S[i], i = 0:3
D[i] = sat16(T[i] + S[i]), i = 0:3
D[i] = usat16(T[i] + S[i]), i = 0:3, unsigned Сложение i16 = i16 + i16 знаковые (беззнаковые u32+u32→u32),
Опциональная сатурацияAddition i16 = i16 + i16 signed (unsigned u32+u32→u32),
Optional saturation VADDH.SCL Vt, Vs, Vd
VADDH.SCL.RND Vt, Vs, Vd
VADDHU.SCL Vt, Vs, Vd
VADDHU.SCL.RND Vt, Vs, VdVADDH.SCL Vt, Vs, Vd
VADDH.SCL.RND Vt, Vs, Vd
VADDHU.SCL Vt, Vs, Vd
VADDHU.SCL.RND Vt, Vs, Vd D[i] = (T[i] + S[i]) >> 1, i = 0:3
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:3
D[i] = (T[i] + S[i]) >> 1, i = 0:3, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:3, unsignedD[i] = (T[i] + S[i]) >> 1, i = 0:3
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:3
D[i] = (T[i] + S[i]) >> 1, i = 0:3, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:3, unsigned Сложение i16 = i16 + i16 знаковые (беззнаковые u32+u32→u32),
Опциональное округлениеAddition i16 = i16 + i16 signed (unsigned u32+u32→u32),
Optional rounding VADDB Vt, Vs, Vd
VADDB #IMM8, Vs, Vd
VADDB.SAT Vt, Vs, Vd
VADDBU.SAT Vt, Vs, VdVADDB Vt, Vs, Vd
VADDB #IMM8, Vs, Vd
VADDB.SAT Vt, Vs, Vd
VADDBU.SAT Vt, Vs, Vd D[i] = T[i] + S[i], i = 0:7
D[i] = #IMM8 + S[i], i = 0:7
D[i] = sat8(T[i] + S[i]), i = 0:7
D[i] = usat8(T[i] + S[i]), i = 0:7, unsignedD[i] = T[i] + S[i], i = 0:7
D[i] = #IMM8 + S[i], i = 0:7
D[i] = sat8(T[i] + S[i]), i = 0:7
D[i] = usat8(T[i] + S[i]), i = 0:7, unsigned Сложение i8 = i8 + i8 знаковые (беззнаковые u8+u8→u8),
Опциональная сатурацияAddition i8 = i8 + i8 signed (unsigned u8+u8→u8),
Optional saturation VADDB.SCL Vt, Vs, Vd
VADDB.SCL.RND Vt, Vs, Vd
VADDBU.SCL Vt, Vs, Vd
VADDBU.SCL.RND Vt, Vs, VdVADDB.SCL Vt, Vs, Vd
VADDB.SCL.RND Vt, Vs, Vd
VADDBU.SCL Vt, Vs, Vd
VADDBU.SCL.RND Vt, Vs, Vd D[i] = (T[i] + S[i]) >> 1, i = 0:7
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:7
D[i] = (T[i] + S[i]) >> 1, i = 0:7, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:7, unsignedD[i] = (T[i] + S[i]) >> 1, i = 0:7
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:7
D[i] = (T[i] + S[i]) >> 1, i = 0:7, unsigned
D[i] = rnd(T[i] + S[i]) >> 1, i = 0:7, unsigned Сложение i8 = i8 + i8 знаковые (беззнаковые u8+u8→u8),
Опциональное округлениеAddition i8 = i8 + i8 signed (unsigned u8+u8→u8),
Optional rounding VSUBD Vt, Vs, Vd
VSUBD.SAT Vt, Vs, Vd
VSUBDU.SAT Vt, Vs, VdVSUBD Vt, Vs, Vd
VSUBD.SAT Vt, Vs, Vd
VSUBDU.SAT Vt, Vs, Vd D[i] = S[i] - S[i], i = 0
D[i] = sat64(S[i] - T[i]), i = 0
D[i] = usat64(S[i] - T[i]), i = 0, unsigned
Vt: = {T1, T0}, i64/u64, Vs = {S0, S0}, i64/u64
Vd = {D1, D0}, i64/u64 D[i] = S[i] - S[i], i = 0
D[i] = sat64(S[i] - T[i]), i = 0
D[i] = usat64(S[i] - T[i]), i = 0, unsigned
Vt: = {T1, T0}, i64/u64, Vs = {S0, S0}, i64/u64
Vd = {D1, D0}, i64/u64 Вычитание i64 = i64 - i64 знаковые (беззнаковые u64-u64→u64),
Опциональная сатурацияSubtraction i64 = i64 - i64 signed (unsigned u64-u64→u64),
Optional saturation VSUBL Vt, Vs, Vd
VSUBL #32, Vs, Vd
VSUBL.SAT Vt, Vs, Vd
VSUBLU.SAT Vt, Vs, VdVSUBL Vt, Vs, Vd
VSUBL #32, Vs, Vd
VSUBL.SAT Vt, Vs, Vd
VSUBLU.SAT Vt, Vs, Vd D[i] = S[i] - T[i], i = 0:1
D[i] = S[i] - #32, i = 0:1
D[i] = sat32(S[i] - T[i]), i = 0:1
D[i] = usat32(S[i] - T[i]), i = 0:1, unsignedD[i] = S[i] - T[i], i = 0:1
D[i] = S[i] - #32, i = 0:1
D[i] = sat32(S[i] - T[i]), i = 0:1
D[i] = usat32(S[i] - T[i]), i = 0:1, unsigned Вычитание i32 = i32 - i32 знаковые (беззнаковые u32-u32→u32),
Опциональная сатурацияSubtraction i32 = i32 - i32 signed (unsigned u32-u32→u32),
Optional saturation VSUBL.SCL Vt, Vs, Vd
VSUBL.SCL.RND Vt, Vs, Vd
VSUBL.SCL.RND.SAT Vt, Vs, VdVSUBL.SCL Vt, Vs, Vd
VSUBL.SCL.RND Vt, Vs, Vd
VSUBL.SCL.RND.SAT Vt, Vs, Vd D[i] = (S[i] - T[i]) >> 1, i = 0:1
D[i] = rnd(S[i] - T[i] + 1) >> 1, i = 0:1
D[i] = sat32(rnd(S[i] - T[i]) >> 1), i = 0:1D[i] = (S[i] - T[i]) >> 1, i = 0:1
D[i] = rnd(S[i] - T[i] + 1) >> 1, i = 0:1
D[i] = sat32(rnd(S[i] - T[i]) >> 1), i = 0:1 Вычитание i32 = i32 - i32 знаковые,
Опциональное округление и сатурацияSubtraction i32 = i32 - i32 signed,
Optional rounding and saturation VSUBH Vt, Vs, Vd
VSUBH #IMM16, Vs, Vd
VSUBH.SAT Vt, Vs, Vd
VSUBHU.SAT Vt, Vs, VdVSUBH Vt, Vs, Vd
VSUBH #IMM16, Vs, Vd
VSUBH.SAT Vt, Vs, Vd
VSUBHU.SAT Vt, Vs, Vd D[i] = S[i] - S[i], i = 0:3
D[i] = S[i] - #IMM16, i = 0:1
D[i] = sat16(S[i] - T[i]), i = 0:1
D[i] = usat16(S[i] - T[i]), i = 0:3, unsigned D[i] = S[i] - S[i], i = 0:3
D[i] = S[i] - #IMM16, i = 0:1
D[i] = sat16(S[i] - T[i]), i = 0:1
D[i] = usat16(S[i] - T[i]), i = 0:3, unsigned Вычитание i16 = i16 - i16 знаковые (беззнаковые u16-u16→u16),
Опциональная сатурацияSubtraction i16 = i16 - i16 signed (unsigned u16-u16→u16),
Optional saturation VSUBH.SCL Vt, Vs, Vd
VSUBH.SCL.RND Vt, Vs, Vd
VSUBH.SCL.RND.SAT Vt, Vs, VdVSUBH.SCL Vt, Vs, Vd
VSUBH.SCL.RND Vt, Vs, Vd
VSUBH.SCL.RND.SAT Vt, Vs, Vd D[i] = (S[i] - T[i]) >> 1, i = 0:1
D[i] = rnd(S[i] - T[i]) >> 1, i = 0:1
D[i] = sat16(rnd(S[i] - T[i]) >> 1), i = 0:1D[i] = (S[i] - T[i]) >> 1, i = 0:1
D[i] = rnd(S[i] - T[i]) >> 1, i = 0:1
D[i] = sat16(rnd(S[i] - T[i]) >> 1), i = 0:1 Вычитание i16 = i16 - i16 знаковые,
Опциональное округление и сатурацияSubtraction i16 = i16 - i16 signed,
Optional rounding and saturation VSUBB Vt, Vs, Vd
VSUBB #IMM8, Vs, Vd
VSUBB.SAT Vt, Vs, Vd
VSUBBU.SAT Vt, Vs, VdVSUBB Vt, Vs, Vd
VSUBB #IMM8, Vs, Vd
VSUBB.SAT Vt, Vs, Vd
VSUBBU.SAT Vt, Vs, Vd D[i] = S[i] - S[i], i = 0:7
D[i] = S[i] - #IMM8, i = 0:7
D[i] = sat8(S[i] - T[i]), i = 0:7
D[i] = usat8(S[i] - T[i]), i = 0:7, unsigned D[i] = S[i] - S[i], i = 0:7
D[i] = S[i] - #IMM8, i = 0:7
D[i] = sat8(S[i] - T[i]), i = 0:7
D[i] = usat8(S[i] - T[i]), i = 0:7, unsigned Вычитание i8 = i8 - i8 знаковые (беззнаковые u8-u8→u8),
Опциональная сатурацияSubtraction i8 = i8 - i8 signed (unsigned u8-u8→u8),
Optional saturation VSUBB.SCL Vt, Vs, Vd
VSUBB.SCL.RND Vt, Vs, Vd
VSUBB.SCL.RND.SAT Vt, Vs, VdVSUBB.SCL Vt, Vs, Vd
VSUBB.SCL.RND Vt, Vs, Vd
VSUBB.SCL.RND.SAT Vt, Vs, Vd D[i] = (S[i] - T[i]) >> 1, i = 0:7
D[i] = rnd(S[i] - T[i]) >> 1, i = 0:7
D[i] = sat8(rnd(S[i] - T[i]) >> 1), i = 0:7D[i] = (S[i] - T[i]) >> 1, i = 0:7
D[i] = rnd(S[i] - T[i]) >> 1, i = 0:7
D[i] = sat8(rnd(S[i] - T[i]) >> 1), i = 0:7 Вычитание i8 = i8 - i8 знаковые,
Опциональное округление и сатурацияSubtraction i8 = i8 - i8 signed,
Optional rounding and saturation Векторные операции вычисления абсолютной величиныVector operations for calculating the absolute value VABSD Vs, Vd
VABSD.SAT Vs, VdVABSD Vs, Vd
VABSD.SAT Vs, Vd D[i] = abs(S[i]), i=0
D[i] = sat64(abs(S[i])), i=0D[i] = abs(S[i]), i=0
D[i] = sat64(abs(S[i])), i=0 Модуль целого, i64, с опциональной сатурациейWhole module, i64, with optional saturation VABSL Vs, Vd
VABSL.SAT Vs, VdVABSL Vs, Vd
VABSL.SAT Vs, Vd D[i] = abs(S[i]), i=0:1
D[i] = sat32(abs(S[i])), i=0:1D[i] = abs(S[i]), i=0:1
D[i] = sat32(abs(S[i])), i=0:1 Модуль целого, i32, с опциональной сатурациейWhole module, i32, with optional saturation VABSH Vs, Vd
VABSH.SAT Vs, VdVABSH Vs, Vd
VABSH.SAT Vs, Vd D[i] = abs(S[i]), i=0:3
D[i] = sat16(abs(S[i])), i=0:3D[i] = abs(S[i]), i=0:3
D[i] = sat16(abs(S[i])), i=0:3 Модуль целого, i16, с опциональной сатурациейWhole module, i16, with optional saturation VABSB Vs, Vd
VABSB.SAT Vs, VdVABSB Vs, Vd
VABSB.SAT Vs, Vd D[i] = abs(S[i]), i=0:7
D[i] = sat8(abs(S[i])), i=0:7D[i] = abs(S[i]), i=0:7
D[i] = sat8(abs(S[i])), i=0:7 Модуль целого, i8, с опциональной сатурациейWhole module, i8, with optional saturation Векторные операции вычисления целочисленного максимума/минимумаVector Integer Maximum/Minimum Operations VMAXD Vt, Vs, Vd
VMAXDU Vt, Vs, VdVMAXD Vt, Vs, Vd
VMAXDU Vt, Vs, Vd D[i] = max(T[i], S[i]), i=0
D[i] = umax(T[i], S[i]), i=0D[i] = max(T[i], S[i]), i=0
D[i] = umax(T[i], S[i]), i=0 Поэлементный максимум, i64, u64Element Max, i64, u64 VMAXL Vt, Vs, Vd
VMAXLU Vt, Vs, VdVMAXL Vt, Vs, Vd
VMAXLU Vt, Vs, Vd D[i] = max(T[i], S[i]), i=0:1
D[i] = umax(T[i], S[i]), i=0:1D[i] = max(T[i], S[i]), i=0:1
D[i] = umax(T[i], S[i]), i=0:1 Поэлементный максимум, i32, u32Element maximum, i32, u32 VMAXH Vt, Vs, Vd
VMAXHU Vt, Vs, VdVMAXH Vt, Vs, Vd
VMAXHU Vt, Vs, Vd D[i] = max(T[i], S[i]), i=0:3
D[i] = umax(T[i], S[i]), i=0:3D[i] = max(T[i], S[i]), i=0:3
D[i] = umax(T[i], S[i]), i=0:3 Поэлементный максимум, i16, u16Element maximum, i16, u16 VMAXB Vt, Vs, Vd
VMAXBU Vt, Vs, VdVMAXB Vt, Vs, Vd
VMAXBU Vt, Vs, Vd D[i] = max(T[i], S[i]), i=0:7
D[i] = umax(T[i], S[i]), i=0:7D[i] = max(T[i], S[i]), i=0:7
D[i] = umax(T[i], S[i]), i=0:7 Поэлементный максимум, i8, u8Element Max, i8, u8 VMIND Vt, Vs, Vd
VMINDU Vt, Vs, VdVMIND Vt, Vs, Vd
VMINDU Vt, Vs, Vd D[i] = min(T[i], S[i]), i=0
D[i] = umin(T[i], S[i]), i=0D[i] = min(T[i], S[i]), i=0
D[i] = umin(T[i], S[i]), i=0 Поэлементный минимум, i64, u64Element minimum, i64, u64 VMINL Vt, Vs, Vd
VMINLU Vt, Vs, VdVMINL Vt, Vs, Vd
Vminlu Vt, Vs, Vd D[i] = min(T[i], S[i]), i=0:1
D[i] = umin(T[i], S[i]), i=0:1D[i] = min(T[i], S[i]), i=0:1
D[i] = umin(T[i], S[i]), i=0:1 Поэлементный минимум, i32, u32Element minimum, i32, u32 VMINH Vt, Vs, Vd
VMINHU Vt, Vs, VdVMINH Vt, Vs, Vd
VMINHU Vt, Vs, Vd D[i] = min(T[i], S[i]), i=0:3
D[i] = umin(T[i], S[i]), i=0:3D[i] = min(T[i], S[i]), i=0:3
D[i] = umin(T[i], S[i]), i=0:3 Поэлементный минимум, i16, u16Element minimum, i16, u16 VMINB Vt, Vs, Vd
VMINBU Vt, Vs, VdVMINB Vt, Vs, Vd
VMINBU Vt, Vs, Vd D[i] = min(T[i], S[i]), i=0:7
D[i] = umin(T[i], S[i]), i=0:7D[i] = min(T[i], S[i]), i=0:7
D[i] = umin(T[i], S[i]), i=0:7 Поэлементный минимум, i8, u8Element minimum, i8, u8 VMAX2L Vt, Vs, Vd
VMAX2LU Vt, Vs, VdVMAX2L Vt, Vs, Vd
VMAX2LU Vt, Vs, Vd D[0] = T[0]; then D[i] = max(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = maxu(D[0], S[i]), i=0, …, n-1D[0] = T[0]; then D[i] = max(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = maxu(D[0], S[i]), i=0, …, n-1 Поиск текущего максимум, i32/u32, n=2Search for current maximum, i32/u32, n=2 VMAX4H Vt, Vs, Vd
VMAX4HU Vt, Vs, VdVMAX4H Vt, Vs, Vd
VMAX4HU Vt, Vs, Vd D[0] = T[0]; then D[i] = max(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = maxu(D[0], S[i]), i=0, …, n-1D[0] = T[0]; then D[i] = max(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = maxu(D[0], S[i]), i=0, …, n-1 Поиск текущего максимум, i16/u16, n=4Search for the current maximum, i16/u16, n=4 VMAX8B Vt, Vs, Vd
VMAX8BU Vt, Vs, VdVMAX8B Vt, Vs, Vd
VMAX8BU Vt, Vs, Vd D[0] = T[0]; then D[i] = max(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = maxu(D[0], S[i]), i=0, …, n-1D[0] = T[0]; then D[i] = max(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = maxu(D[0], S[i]), i=0, …, n-1 Поиск текущего максимум, i8/u8, n=8Search for the current maximum, i8/u8, n=8 VMIN2L Vt, Vs, Vd
VMIN2LU Vt, Vs, VdVMIN2L Vt, Vs, Vd
VMIN2LU Vt, Vs, Vd D[0] = T[0]; then D[i] = min(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = minu(D[0], S[i]), i=0, …, n-1D[0] = T[0]; then D[i] = min(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = minu(D[0], S[i]), i=0, …, n-1 Поиск текущего минимума, i32/u32, n=2Current low search, i32/u32, n=2 VMIN4H Vt, Vs, Vd
VMIN4HU Vt, Vs, VdVMIN4H Vt, Vs, Vd
VMIN4HU Vt, Vs, Vd D[0] = T[0]; then D[i] = min(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = minu(D[0], S[i]), i=0, …, n-1D[0] = T[0]; then D[i] = min(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = minu(D[0], S[i]), i=0, …, n-1 Поиск текущего минимума, i16/u16, n=4Finding the current low, i16/u16, n=4 VMIN8B Vt, Vs, Vd
VMIN8BU Vt, Vs, VdVMIN8B Vt, Vs, Vd
VMIN8BU Vt, Vs, Vd D[0] = T[0]; then D[i] = min(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = minu(D[0], S[i]), i=0, …, n-1D[0] = T[0]; then D[i] = min(D[0], S[i]), i=0, …, n-1
D[0] = T[0]; then D[i] = minu(D[0], S[i]), i=0, …, n-1 Поиск текущего минимума, i8/u8, n=8Finding the current low, i8/u8, n=8 Векторные логические операцииVector boolean operations VAND Vt, Vs, VdVAND Vt, Vs, Vd D = T & SD=T&S Поэлементное логическое «И», i8Element-by-element logical "AND", i8 VANDC Vt, Vs, Vd
VANDI Vt, Vs, VdVANDC Vt, Vs, Vd
Vandi Vt, Vs, Vd D = ~T & S
D = ~ (T & S)D = ~T&S
D=~(T&S) Поэлементное логическое «И» с инверсией
одного из операндов или результата, i8Element-by-element logical "AND" with inversion
one of the operands or the result, i8 VOR Vt, Vs, VdVOR Vt, Vs, Vd D= T | SD= T | S Поэлементное логическое «ИЛИ», i8Element-by-element logical "OR", i8 VORC Vt, Vs, Vd
VORI Vt, Vs, VdVORC Vt, Vs, Vd
VORI Vt, Vs, Vd D = ~T | S
D = ~ (T | S)D = ~T | S
D=~(T|S) Поэлементное логическое «ИЛИ» с инверсией
одного из операндов или результата, i8Element-by-element logical "OR" with inversion
one of the operands or the result, i8 VEOR Vt, Vs, VdVEOR Vt, Vs, Vd D = T ^ SD=T^S Поэлементное логическое исключающее «ИЛИ», i8Element-by-element logical exclusive "OR", i8 VNOT Vs, VdVNOT Vs, Vd D = ~SD = ~S Поэлементное отрицание результата, i8Element-wise negation of the result, i8

Табл. 11.Tab. eleven.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Векторные операции сложения/вычитания с плавающей запятойVector Floating-Point Add/Subtract VHADD Vt, Vs, VdVHADD Vt, Vs, Vd VT = {T3, T2, T1, T0}
VS = {S3, S2, S1, S0}
VD = {S3 + T2, S2 + T3, S1 + T0, S0 + T1}VT = {T3, T2, T1, T0}
VS = {S3, S2, S1, S0}
VD = {S3 + T2, S2 + T3, S1 + T0, S0 + T1} Сложение двух чисел, f16 + f16 → f16Adding two numbers, f16 + f16 → f16 VFADD Vt, Vs, VdVFADD Vt, Vs, Vd VT = {T1, T0}
VS = {S1, S0}
Vd={S1 + T1, S0 + T0}VT = {T1, T0}
VS = {S1, S0}
Vd={S1 + T1, S0 + T0} Сложение двух чисел, f32 + f32 → f32Adding two numbers, f32 + f32 → f32 VDADD Vt, Vs, VdVDADD Vt, Vs, Vd VT = {T0}
VS = {S0}
Vd = {S0 + T0}VT = {T0}
VS = {S0}
Vd = {S0 + T0} Сложение двух чисел, f64 + f64 → f64Adding two numbers, f64 + f64 → f64 VHSUB Vt, Vs, VdVHSUB Vt, Vs, Vd VT = {T3, T2, T1, T0}
VS = {S3, S2, S1, S0}
VD = {S3 - T2, S2 - T3, S1 - T0, S0 - T1}VT = {T3, T2, T1, T0}
VS = {S3, S2, S1, S0}
VD = {S3 - T2, S2 - T3, S1 - T0, S0 - T1} Вычитание двух чисел, f16 - f16 → f16Subtraction of two numbers, f16 - f16 → f16 VFSUB Vt, Vs, VdVFSUB Vt, Vs, Vd VT = {T1, T0}
VS = {S1, S0}
Vd = {S1 - T1, S0 - T0}VT = {T1, T0}
VS = {S1, S0}
Vd = {S1 - T1, S0 - T0} Вычитание двух чисел, f32 ‒ f32 → f32Subtraction of two numbers, f32 ‒ f32 → f32 VDSUB Vt, Vs, VdVDSUB Vt, Vs, Vd VT = {T0}
VS = {S0}
Vd = {S0 - T0}VT = {T0}
VS = {S0}
Vd = {S0 - T0} Вычитание двух чисел, f64 - f64 → f64Subtraction of two numbers, f64 - f64 → f64 Векторные операции максимума/минимума с плавающей запятойFloating Point Vector Maximum/Minimum Operations VHMAX Vt, Vs, VdVHMAX Vt, Vs, Vd Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0…3
D[i] = max(T[i], S[i])Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0…3
D[i] = max(T[i], S[i]) Максимум двух чисел, f16Maximum of two numbers, f16 VFMAX Vt, Vs, VdVFMAX Vt, Vs, Vd Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0…1
D[i] = max(T[i], S[i])Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0…1
D[i] = max(T[i], S[i]) Максимум двух чисел, f32Maximum of two numbers, f32 VDMAX Vt, Vs, VdVDMAX Vt, Vs, Vd Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0
D[i] = max(T[i], S[i])Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0
D[i] = max(T[i], S[i]) Максимум двух чисел, f64Maximum of two numbers, f64 VHMIN Vt, Vs, VdVHMIN Vt, Vs, Vd Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, f16, i=0…3
D[i] = min(T[i], S[i])Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, f16, i=0…3
D[i] = min(T[i], S[i]) Минимум двух чисел, f16Minimum of two numbers, f16 VFMIN Vt, Vs, VdVFMIN Vt, Vs, Vd Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0…1
D[i] = min(T[i], S[i])Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0…1
D[i] = min(T[i], S[i]) Минимум двух чисел, f32Minimum of two numbers, f32 VDMIN Vt, Vs, VdVDMIN Vt, Vs, Vd Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0
D[i] = min(T[i], S[i])Vt={T[i]}, Vs={S[i]}, Vd={D[i]}, i=0
D[i] = min(T[i], S[i]) Минимум двух чисел, f64Minimum of two numbers, f64

Табл. 12.Tab. 12.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Векторные операции целочисленного умноженияVector operations of integer multiplication VMPYB T, S, DVMPYB T, S, D T = {8{i8}} = {T7,…,T0};
S = {8{i8}} = {S7,…,S0};
D = {8{i16}} = {D7,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:7;T = {8{i8}} = {T7,…,T0};
S = {8{i8}} = {S7,…,S0};
D = {8{i16}} = {D7,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:7; Умножение, целое, со знаком.
[i16 = i8 ⋅ i8]Multiplication, integer, signed.
[i16 = i8 ⋅ i8] VMPYBU T, S, DVMPYBU T, S, D T = {8{u8}} = {T7,…,T0};
S = {8{u8}} = {S7,…,S0};
D = {8{u16}} = {D7,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:7;T = {8{u8}} = {T7,…,T0};
S = {8{u8}} = {S7,…,S0};
D = {8{u16}} = {D7,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:7; Умножение, целое, без знака.
[u16 = u8 ⋅ u8]Multiplication, integer, unsigned.
[u16 = u8 ⋅ u8] VMPYH T, S, DVMPYH T, S, D T = {4{i16}} = {T3,…,T0};
S = {4{i16}} = {S3,…,S0};
D = {4{i32}} = {D3,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:3;T = {4{i16}} = {T3,…,T0};
S = {4{i16}} = {S3,…,S0};
D = {4{i32}} = {D3,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:3; Умножение, целое, со знаком.
[i32 = i16 ⋅ i16]Multiplication, integer, signed.
[i32 = i16 ⋅ i16] VMPYHU T, S, DVMPYHU T, S, D T = {4{u16}} = {T3,…,T0};
S = {4{u16}} = {S3,…,S0};
D = {4{u32}} = {D3,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:3;T = {4{u16}} = {T3,…,T0};
S = {4{u16}} = {S3,…,S0};
D = {4{u32}} = {D3,…,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0:3; Умножение, целое, без знака.
[u32 = u16 ⋅ u16]Multiplication, integer, unsigned.
[u32 = u16 ⋅ u16] VMPYL T, S, DVMPYL T, S, D T = {2{i32}} = {T1,T0};
S = {2{i32}} = {S1,S0};
D = {2{i64}} = {D1,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0,1;T = {2{i32}} = {T1,T0};
S = {2{i32}} = {S1,S0};
D = {2{i64}} = {D1,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0.1; Умножение, целое, со знаком.
[i64 = i32 ⋅ i32]Multiplication, integer, signed.
[i64 = i32 ⋅ i32] VMPYLU T, S, DVMPYLU T, S, D T = {2{u32}} = {T1,T0};
S = {2{u32}} = {S1,S0};
D = {2{u64}} = {D1,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0,1;T = {2{u32}} = {T1,T0};
S = {2{u32}} = {S1,S0};
D = {2{u64}} = {D1,D0} = {D1,D0};
Di = Ti ⋅ Si; i = 0.1; Умножение, целое, без знака.
[u64 = u32 ⋅ u32]Multiplication, integer, unsigned.
[u64 = u32 ⋅ u32] VMPYD T, S, DVMPYD T, S, D T = {i64};
S = {i64};
D = {i128} = {D1,D0};
D = T ⋅ S;T = {i64};
S = {i64};
D = {i128} = {D1,D0};
D = T ⋅ S; Умножение, целое, со знаком.
[i128 = i64 ⋅ i64]Multiplication, integer, signed.
[i128 = i64 ⋅ i64] VMPYDU T, S, DVMPYDU T, S, D T = {u64};
S = {u64};
D = {u128} = {D1,D0};
D = T ⋅ S;T = {u64};
S = {u64};
D = {u128} = {D1,D0};
D = T ⋅ S; Умножение, целое, без знака.
[u128 = u64 ⋅ u64]Multiplication, integer, unsigned.
[u128 = u64 ⋅ u64] Векторные операции целочисленного умножения с накоплением в аккумулятореVector operations of integer multiplication with accumulation in the accumulator VMPACBB T, S
VMPACBB T, S, VAdVMPACBB T, S
VMPACBB T, S, VAd T = {8{i8}} = {T7,…,T0};
S = {8{i8}} = {S7,…,S0};
AC = {8{i32}} = {AC7,…,AC0};
ACi += Ti ⋅ Si; i = 0:7;T = {8{i8}} = {T7,…,T0};
S = {8{i8}} = {S7,…,S0};
AC = {8{i32}} = {AC7,…,AC0};
ACi += Ti ⋅ Si; i = 0:7; Умножение с накоплением в аккумуляторе, целое, со знаком.
В двухадресной форме неявно используется VA0.
[i32 += i8 ⋅ i8]Multiplication with accumulation in the accumulator, integer, signed.
The two-address form implicitly uses VA0.
[i32 += i8 ⋅ i8] VMPACBB T, S
VMPACBB T, S, VAdVMPACBB T, S
VMPACBB T, S, VAd T = {8{u8}} = {T7,…,T0};
S = {8{i8}} = {S7,…,S0};
AC = {8{i32}} = {AC7,…,AC0};
ACi += Ti ⋅ Si; i = 0:7;T = {8{u8}} = {T7,…,T0};
S = {8{i8}} = {S7,…,S0};
AC = {8{i32}} = {AC7,…,AC0};
ACi += Ti ⋅ Si; i = 0:7; Умножение с накоплением в аккумуляторе, целое, со знаком.
В двухадресной форме неявно используется VA0.
[i32 += u8 ⋅ i8]Multiplication with accumulation in the accumulator, integer, signed.
The two-address form implicitly uses VA0.
[i32 += u8 ⋅ i8] VMPACBH T, S
VMPACBH T, S, VAdVMPACBH T, S
VMPACBH T, S, VAd T = {4{0, i8}} = {0,T3,…,0,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3;T = {4{0, i8}} = {0,T3,…,0,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3; Умножение с накоплением в аккумуляторе, целое, со знаком.
В двухадресной форме неявно используется VA0.
[i64 += i8 ⋅ i16]Multiplication with accumulation in the accumulator, integer, signed.
The two-address form implicitly uses VA0.
[i64 += i8 ⋅ i16] VMPACBUH T, S
VMPACBUH T, S, VAdVMPACBUH T, S
VMPACBUH T, S, VAd T = {4{0,u8}} = {0,T3,…,0,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3;T = {4{0,u8}} = {0,T3,…,0,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3; Умножение с накоплением в аккумуляторе, целое, со знаком.
В двухадресной форме неявно используется VA0.
[i64 += u8 ⋅ i16]Multiplication with accumulation in the accumulator, integer, signed.
The two-address form implicitly uses VA0.
[i64 += u8 ⋅ i16] VMPACHH T, S
VMPACHH T, S, VAdVMPACHH T, S
VMPACHH T, S, VAd T = {4{i16}} = {T3,…,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3;T = {4{i16}} = {T3,…,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3; Умножение с накоплением в аккумуляторе, целое, со знаком.
В двухадресной форме неявно используется VA0.
[i64 += i16 ⋅ i16]Multiplication with accumulation in the accumulator, integer, signed.
The two-address form implicitly uses VA0.
[i64 += i16 ⋅ i16] VMPACHUH T, S
VMPACHUH T, S, VAdVMPACHUH T, S
VMPACHUH T, S, VAd T = {4{u16}} = {T3,…,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3;T = {4{u16}} = {T3,…,T0};
S = {4{i16}} = {S3,…,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3; Умножение с накоплением в аккумуляторе, целое, со знаком.
В двухадресной форме неявно используется VA0.
[i64 += u16 ⋅ i16]Multiplication with accumulation in the accumulator, integer, signed.
The two-address form implicitly uses VA0.
[i64 += u16 ⋅ i16] VMPACLL T, S
VMPACLL T, S, VAdVMPACLL T, S
VMPACLL T, S, VAd T = {2{i32}} = {T1,T0};
S = {2{i32}} = {S1,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0,2;T = {2{i32}} = {T1,T0};
S = {2{i32}} = {S1,S0};
AC = {4{i64}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0.2; Умножение с накоплением в аккумуляторе, целое, со знаком.
В двухадресной форме неявно используется VA0.
[i64 += i32 ⋅ i32]Multiplication with accumulation in the accumulator, integer, signed.
The two-address form implicitly uses VA0.
[i64 += i32 ⋅ i32] Векторные операции умножения с плавающей запятойVector floating point multiplications VHMPY T, S, DVHMPY T, S, D T = {4{f16}} = {T3,…,T0};
S = {4{f16}} = {S3,…,S0};
D = {4{f16}} = {D3,…,D0};
Di = Ti ⋅ Si; i = 0:3;T = {4{f16}} = {T3,…,T0};
S = {4{f16}} = {S3,…,S0};
D = {4{f16}} = {D3,…,D0};
Di = Ti ⋅ Si; i = 0:3; Умножение, числа с плавающей точкой половинной точности.
[f16 = f16 ⋅ f16]Multiplication, half precision floating point numbers.
[f16 = f16 ⋅ f16] VFMPY T, S, DVFMPY T, S, D T = {2{f32}} = {T1,T0};
S = {2{f32}} = {S1,S0};
D = {2{f32}} = {D1,D0};
Di = Ti ⋅ Si; i = 0,1;T = {2{f32}} = {T1,T0};
S = {2{f32}} = {S1,S0};
D = {2{f32}} = {D1,D0};
Di = Ti ⋅ Si; i = 0.1; Умножение, числа с плавающей точкой одинарной точности.
[f32 = f32 ⋅ f32]Multiplication, single precision floating point numbers.
[f32 = f32 ⋅ f32] VDMPY T, S, DVDMPY T, S, D T = {f64};
S = {f64};
D = {f64};
D = T ⋅ S;T = {f64};
S = {f64};
D = {f64};
D = T ⋅ S; Умножение, числа с плавающей точкой двойной точности.
[f64 = f64 ⋅ f64]Multiplication, double precision floating point numbers.
[f64 = f64 ⋅ f64] Векторные операции умножения с плавающей запятой с накоплением в аккумулятореVector Floating Point Multiplications with Accumulator Accumulator VHMPAC T, S
VHMPAC T, S, VAdVHMPAC T, S
VHMPAC T, S, VAd T = {4{f16}} = {T3,…,T0};
S = {4{f16}} = {S3,…,S0};
AC = {4{f32}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3;
[f32 += f32(f16 ⋅ f16)]T = {4{f16}} = {T3,…,T0};
S = {4{f16}} = {S3,…,S0};
AC = {4{f32}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0:3;
[f32 += f32(f16 ⋅ f16)] Умножение с накоплением, числа с плавающей точкой половинной точности. Умножение и сложение произведений производится в разрядной сетке f16, затем результат расширяется до f32 и накапливается в аккумуляторе f32.
В двухадресной форме неявно используется VA0.Multiplication with accumulation, half-precision floating point numbers. Multiplication and addition of products is performed in the f16 bit grid, then the result is expanded to f32 and accumulated in the f32 accumulator.
The two-address form implicitly uses VA0. VFMPAC T, S
VFMPAC T, S, VAdVFMPAC T, S
VFMPAC T, S, VAd T = {2{f32}} = {T1,T0};
S = {2{f32}} = {S1,S0};
AC = {2{f32}} = {AC1,AC0};
ACi += Ti ⋅ Si; i = 0,1;T = {2{f32}} = {T1,T0};
S = {2{f32}} = {S1,S0};
AC = {2{f32}} = {AC1,AC0};
ACi += Ti ⋅ Si; i = 0.1; Умножение с накоплением, числа с плавающей точкой одинарной точности.
В двухадресной форме неявно используется VA0.
[f32 += f32 ⋅ f32]Multiplication with accumulation, single precision floating point numbers.
The two-address form implicitly uses VA0.
[f32 += f32 ⋅ f32] VFMPAC4 T, S
VFMPAC4 T, S, VAdVFMPAC4 T, S
VFMPAC4 T, S, VAd T = {T', T} = {4{f32}} = {T3,T2,T1,T0};
S = {S', S} = {4{f32}} = {S3,S2,S1,S0};
AC = {4{f32}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0,3;T = {T', T} = {4{f32}} = {T3,T2,T1,T0};
S = {S', S} = {4{f32}} = {S3,S2,S1,S0};
AC = {4{f32}} = {AC3,…,AC0};
ACi += Ti ⋅ Si; i = 0.3; Умножение с накоплением, числа с плавающей точкой одинарной точности.
В двухадресной форме неявно используется VA0.
[f32 += f32 ⋅ f32]
Используются смежные входные регистры: R'[i] = R[i^1].Multiplication with accumulation, single precision floating point numbers.
The two-address form implicitly uses VA0.
[f32 += f32 ⋅ f32]
Adjacent input registers are used: R'[i] = R[i^1]. VDMPAC T, S
VDMPAC T, S, VAdVDMPAC T, S
VDMPAC T, S, VAd T = {1{f64}} = {T0};
S = {1{f64}} = {S0};
AC = {1{f64}} = {AC0};
ACi += Ti ⋅ Si; i = 0;T = {1{f64}} = {T0};
S = {1{f64}} = {S0};
AC = {1{f64}} = {AC0};
ACi += Ti ⋅ Si; i = 0; Умножение с накоплением, числа с плавающей точкой двойной точности.
В двухадресной форме неявно используется VA0.
[f64 += f64 ⋅ f64]Multiplication with accumulation, double precision floating point numbers.
The two-address form implicitly uses VA0.
[f64 += f64 ⋅ f64] VDMPAC2 T, S
VDMPAC2 T, S, VAdVDMPAC2 T, S
VDMPAC2 T, S, VAd T = {T', T} = {2{f64}} = {T1, T0};
S = {S', S} = {2{f64}} = {S1, S0};
AC = {2{f64}} = {AC1,AC0};
ACi += Ti ⋅ Si; i = 0…1;T = {T', T} = {2{f64}} = {T1, T0};
S = {S', S} = {2{f64}} = {S1, S0};
AC = {2{f64}} = {AC1,AC0};
ACi += Ti ⋅ Si; i = 0…1; Умножение с накоплением, числа с плавающей точкой двойной точности.
В двухадресной форме неявно используется VA0.
[f64 += f64 ⋅ f64]
Используются смежные входные регистры: R'[i] = R[i^1].Multiplication with accumulation, double precision floating point numbers.
The two-address form implicitly uses VA0.
[f64 += f64 ⋅ f64]
Adjacent input registers are used: R'[i] = R[i^1].

Табл. 13.Tab. 13.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Векторные операции сдвигаVector shift operations VASRD Vt(#u5),Vs, VdVASRD Vt(#u5),Vs, Vd D[i] = S[i] >> T[i], i=0
D[i] = S[i] >> #IMM5, i=0D[i] = S[i] >> T[i], i=0
D[i] = S[i] >>#IMM5, i=0 Арифметический сдвиг вправо, i64Arithmetic right shift, i64 VLSLD Vt(#u5),Vs, VdVLSLD Vt(#u5),Vs, Vd D[i] = S[i] << T[i], i=0
D[i] = S[i] << #IMM5, i=0D[i] = S[i] << T[i], i=0
D[i] = S[i] <<#IMM5, i=0 Сдвиг влево, i64Shift left, i64 VLSRD Vt(#u5),Vs, VdVLSRD Vt(#u5),Vs, Vd D[i] = S[i] >>> T[i], i=0
D[i] = S[i] >>> #IMM5, i=0D[i] = S[i] >>> T[i], i=0
D[i] = S[i] >>>#IMM5, i=0 Логический сдвиг вправо, i64Logical shift right, i64 VASRD.RND Vt(#u5),Vs, Vd VASRD.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >> T[i])If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >> T[i]) Арифметический сдвиг вправо с округлением, i64Arithmetic right shift with rounding, i64 VLSRD.RND Vt(#u5),Vs, VdVLSRD.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >>> T[i])If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >>> T[i]) Логический сдвиг вправо с округлением, i64Logical right shift with rounding, i64 VASRL Vt(#u5),Vs, VdVASRL Vt(#u5),Vs, Vd D[i] = S[i] >> T[i], i=0:1
D[i] = S[i] >> #IMM5, i=0:1D[i] = S[i] >> T[i], i=0:1
D[i] = S[i] >>#IMM5, i=0:1 Арифметический сдвиг вправо, i32Arithmetic right shift, i32 VLSLL Vt(#u5),Vs, VdVLSLL Vt(#u5),Vs, Vd D[i] = S[i] << T[i], i=0:1
D[i] = S[i] << #IMM5, i=0:1D[i] = S[i] << T[i], i=0:1
D[i] = S[i] <<#IMM5, i=0:1 Сдвиг влево, i32Shift left, i32 VLSRL Vt(#u5),Vs, VdVLSRL Vt(#u5),Vs, Vd D[i] = S[i] >>> T[i], i=0:1
D[i] = S[i] >>> #IMM5, i=0:1D[i] = S[i] >>> T[i], i=0:1
D[i] = S[i] >>>#IMM5, i=0:1 Логический сдвиг вправо, i32Logical right shift, i32 VASRL.RND Vt(#u5),Vs, Vd VASRL.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >> T[i]), i=0…1If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >> T[i]), i=0…1 Арифметический сдвиг вправо с округлением, i32Arithmetic right shift with rounding, i32 VLSRL.RND Vt(#u5),Vs, VdVLSRL.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >>> T[i]), i=0…1If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >>> T[i]), i=0…1 Логический сдвиг вправо с округлением, i32Logical right shift with rounding, i32 VASRH Vt(#u5),Vs, VdVASRH Vt(#u5),Vs, Vd D[i] = S[i] >> T[i], i=0:3
D[i] = S[i] >> #IMM5, i=0:3D[i] = S[i] >> T[i], i=0:3
D[i] = S[i] >>#IMM5, i=0:3 Арифметический сдвиг вправо, i16Arithmetic right shift, i16 VLSLH Vt(#u5),Vs, VdVLSLH Vt(#u5),Vs, Vd D[i] = S[i] << T[i], i=0:3
D[i] = S[i] << #IMM5, i=0:3D[i] = S[i] << T[i], i=0:3
D[i] = S[i] <<#IMM5, i=0:3 Сдвиг влево, i16Shift left, i16 VLSRH Vt(#u5),Vs, VdVLSRH Vt(#u5),Vs, Vd D[i] = S[i] >>> T[i], i=0:3
D[i] = S[i] >>> #IMM5, i=0:3D[i] = S[i] >>> T[i], i=0:3
D[i] = S[i] >>>#IMM5, i=0:3 Логический сдвиг вправо, i16Logical shift right, i16 VASRH.RND Vt(#u5),Vs, Vd VASRH.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >> T[i]), i=0:3If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >> T[i]), i=0:3 Арифметический сдвиг вправо с округлением, i16Arithmetic right shift with rounding, i16 VLSRH.RND Vt(#u5),Vs, VdVLSRH.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >>> T[i]), i=0:3If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(S[i] >>> T[i]), i=0:3 Логический сдвиг вправо с округлением, i16Logical right shift with rounding, i16 VASRB Vt(#u5),Vs, VdVASRB Vt(#u5),Vs, Vd D[i] = S[i] >> T[i], i=0:7
D[i] = S[i] >> #IMM5, i=0:7D[i] = S[i] >> T[i], i=0:7
D[i] = S[i] >>#IMM5, i=0:7 Арифметический сдвиг вправо, i8Arithmetic right shift, i8 VLSLB Vt(#u5),Vs, VdVLSLB Vt(#u5),Vs, Vd D[i] = S[i] << T[i], i=0:7
D[i] = S[i] << #IMM5, i=0:7D[i] = S[i] << T[i], i=0:7
D[i] = S[i] <<#IMM5, i=0:7 Сдвиг влево, i8Shift left, i8 VLSRB Vt(#u5),Vs, VdVLSRB Vt(#u5),Vs, Vd D[i] = S[i] >>> T[i], i=0:7
D[i] = S[i] >>> #IMM5, i=0:7D[i] = S[i] >>> T[i], i=0:7
D[i] = S[i] >>>#IMM5, i=0:7 Логический сдвиг вправо, i8Logical shift right, i8 VASRB.RND Vt(#u5),Vs, Vd VASRB.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(D[i] >> T[i]), i=0:7If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(D[i] >> T[i]), i=0:7 Арифметический сдвиг вправо с округлением, i8Arithmetic right shift with rounding, i8 VLSRB.RND Vt(#u5),Vs, VdVLSRB.RND Vt(#u5),Vs, Vd If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(D[i] >>> T[i]), i=0:7If(T[i] == 0) D[i]= S[i]
Else D[i] = RND(D[i] >>> T[i]), i=0:7 Логический сдвиг вправо с округлением, i8Logical right shift with rounding, i8

Табл. 14.Tab. fourteen.

Мнемоника командыCommand mnemonic ФормулаFormula ОписаниеDescription Векторные операции расширения целочисленных типов Vector extension operations of integer types VCVBHEU Vs, Vd
VCVBHOU Vs, Vd
VCVBHE Vs, Vd
VCVBHO Vs, VdVCVBHEU Vs, Vd
VCVBHOU Vs, Vd
VCVBHE Vs, Vd
VCVBHO Vs, Vd Vs = {S7…S0}, i8
Vd = {D3…D0}, i16
D[i]=zext_8→16(S[2i+0]), i=0…3
D[i]=zext_8→16(S[2i+1]), i=0…3
D[i] = sext_8→16(S[2i+0]), i=0…3
D[i] = sext_8→16(S[2i+1]), i=0…3Vs = {S7…S0}, i8
Vd = {D3…D0}, i16
D[i]=zext _8→16 (S[2i+0]), i=0…3
D[i]=zext _8→16 (S[2i+1]), i=0…3
D[i] = sext _8→16 (S[2i+0]), i=0…3
D[i] = sext _8→16 (S[2i+1]), i=0…3 Расширение знаком i8→i16 или расширение нулем u8→u16. В качестве источника используется четные или нечетные элементы первого операнда.Sign expansion i8→i16 or zero expansion u8→u16. The even or odd elements of the first operand are used as the source. VCVHLEU Vs, Vd
VCVHLOU Vs, Vd
VCVHLE Vs, Vd
VCVHLO Vs, VdVCVHLEU Vs, Vd
VCVHLOU Vs, Vd
VCVHLE Vs, Vd
VCVHLO Vs, Vd Vs = {S3…S0}, i16
Vd = {D1…D0}, i32
D[i]=zext_16→32(S[2i+0]), i=0…1
D[i]=zext_16→32(S[2i+1]), i=0…1
D[i]=sext_16→32(S[2i+0]), i=0…1
D[i]=sext_16→32(S[2i+1]), i=0…1Vs = {S3…S0}, i16
Vd = {D1…D0}, i32
D[i]=zext _16→32 (S[2i+0]), i=0…1
D[i]=zext _16→32 (S[2i+1]), i=0…1
D[i]=sext _16→32 (S[2i+0]), i=0…1
D[i]=sext _16→32 (S[2i+1]), i=0…1 Расширение знаком i16→i32 или расширение нулем u16→u32. В качестве источника используется четные или нечетные элементы первого операнда.Sign expansion i16→i32 or zero expansion u16→u32. The even or odd elements of the first operand are used as the source. VCVLDEU Vs, Vd
VCVLDOU Vs, Vd
VCVLDE Vs, Vd
VCVLDO Vs, VdVCVLDEU Vs, Vd
VCVLDOU Vs, Vd
VCVLDE Vs, Vd
VCVLDO Vs, Vd Vs = {S1…S0}, i32
Vd = {D0}, i64
D[i]=zext_32→64(S[2i+0]), i=0…1
D[i]=zext_32→64(S[2i+1]), i=0…1
D[i]=sext_32→64(S[2i+0]), i=0…1
D[i]=sext_32→64(S[2i+1]), i=0…1Vs = {S1…S0}, i32
Vd = {D0}, i64
D[i]=zext _32→64 (S[2i+0]), i=0…1
D[i]=zext _32→64 (S[2i+1]), i=0…1
D[i]=sext _32→64 (S[2i+0]), i=0…1
D[i]=sext _32→64 (S[2i+1]), i=0…1 Расширение знаком i32→i64 или расширение нулем u32→u64. В качестве источника используется четные или нечетные элементы первого операнда.Sign expansion i32→i64 or zero expansion u32→u64. The even or odd elements of the first operand are used as the source. Векторные операции усечения целочисленных типовVector Truncation Operations on Integer Types VSATDL Vt, Vs, VdVSATDL Vt, Vs, Vd Vt = {T0}, i64
Vs = {S0}, i64
Vd = {sat32(S0), sat32(T0)}, i32Vt = {T0}, i64
Vs = {S0}, i64
Vd = {sat32(S0), sat32(T0)}, i32 Принудительная сатурация, i64 → i32Forced saturation, i64 → i32 VSATDLU Vt, Vs, VdVSATDLU Vt, Vs, Vd Vt = {T0}, i64
Vs = {S0}, i64
Vd = {usat32(S0), usat32(T0)}, u32Vt = {T0}, i64
Vs = {S0}, i64
Vd = {usat32(S0), usat32(T0)}, u32 Принудительная сатурация, i64 → u32Forced saturation, i64 → u32 VSATLH Vt, Vs, VdVSATLH Vt, Vs, Vd Vt = {T1, T0}, i32
Vs = {S1, S0}, i32
Vd = {sat16(S1), sat16(T1), sat16(S0), sat16(T0)}, i16Vt = {T1, T0}, i32
Vs = {S1, S0}, i32
Vd = {sat16(S1), sat16(T1), sat16(S0), sat16(T0)}, i16 Принудительная сатурация, i32 → i16Forced saturation, i32 → i16 VSATLHU Vt, Vs, VdVSATLHU Vt, Vs, Vd Vt = {T1, T0}, i32
Vs = {S1, S0}, i32
Vd = {usat16(S1), usat16(T1), usat16(S0), usat16(T0)}, u16Vt = {T1, T0}, i32
Vs = {S1, S0}, i32
Vd = {usat16(S1), usat16(T1), usat16(S0), usat16(T0)}, u16 Принудительная сатурация, i32 → u16Forced saturation, i32 → u16 VSATHB Vt, Vs, VdVSATHB Vt, Vs, Vd Vt = {T3, T2, T1, T0}, i16
Vs = {S3, S2, S1, S0}, i16
Vd = {sat8(S3), sat8(T3), …, sat8(S0), sat8(T0)}, i8Vt = {T3, T2, T1, T0}, i16
Vs = {S3, S2, S1, S0}, i16
Vd = {sat8(S3), sat8(T3), …, sat8(S0), sat8(T0)}, i8 Принудительная сатурация, i16 → i8Forced saturation, i16 → i8 VSATHBU Vt, Vs, VdVSATHBU Vt, Vs, Vd Vt = {T3, T2, T1, T0}, i16
Vs = {S3, S2, S1, S0}, i16
Vd = {usat8(S3), usat8(T3), …,u sat8(S0), usat8(T0)},u8Vt = {T3, T2, T1, T0}, i16
Vs = {S3, S2, S1, S0}, i16
Vd = {usat8(S3), usat8(T3), …,u sat8(S0), usat8(T0)},u8 Принудительная сатурация, i16 → u8Forced saturation, i16 → u8 Векторные операции преобразования типов с плавающей запятойVector floating-point type conversions VFDCV Vt, VVdVFDCV Vt, VVd Vt = {T1, T0}, f32
VVd = {Vd’, Vd} = {{D1}, {D0}}, f64
Di = Ti, i = 0…1Vt = {T1, T0}, f32
VVd = {Vd', Vd} = {{D1}, {D0}}, f64
Di = Ti, i = 0…1 Преобразование из float32 в float64Convert from float32 to float64 VDFCV Vt, Vs, VdVDFCV Vt, Vs, Vd Vt = {T0}, f64
Vs = {S0}, f64
Vd = {D1, D0}, f32
D1 = S0, D0 = T0Vt = {T0}, f64
Vs = {S0}, f64
Vd = {D1, D0}, f32
D1 = S0, D0 = T0 Преобразование из float64 в float32Convert from float64 to float32 VHFCV Vt, VVdVHFCV Vt, VVd Vt = {T3, T2, T1, T0}, f16
VVd = {Vd’, Vd} = {{D3, D1}, {D2, D0}}, f32
Di = Ti, i = 0…1Vt = {T3, T2, T1, T0}, f16
VVd = {Vd', Vd} = {{D3, D1}, {D2, D0}}, f32
Di = Ti, i = 0…1 Преобразование из float16 в float32Convert from float16 to float32 VFHCV Vt, Vs, VdVFHCV Vt, Vs, Vd Vt = {T1, T0}, f64
Vs = {S1, S0}, f64
Vd = {D3, D2, D1, D0}, f32
D3 = S1, D2 = T1, D1 = S0, D0 = T0Vt = {T1, T0}, f64
Vs = {S1, S0}, f64
Vd = {D3, D2, D1, D0}, f32
D3=S1, D2=T1, D1=S0, D0=T0 Преобразование из float32 в float16Convert from float32 to float16 Векторные операции преобразования типов INT->FLTVector Type Conversion Operations INT->FLT VCVDF Vt, Vs, VdVCVDF Vt, Vs, Vd Vt = {T0}, i64
Vs = {S0}, i64
Vd = {D1, D0}, f32
D1 = S0, D0 = T0Vt = {T0}, i64
Vs = {S0}, i64
Vd = {D1, D0}, f32
D1 = S0, D0 = T0 Преобразование из i64 в float32Convert from i64 to float32 VCVDFU Vt, Vs, VdVCVDFU Vt, Vs, Vd Vt = {T0}, u64
Vs = {S0}, u64
Vd = {D1, D0}, f32
D1 = S0, D0 = T0Vt = {T0}, u64
Vs = {S0}, u64
Vd = {D1, D0}, f32
D1 = S0, D0 = T0 Преобразование из u64 в float32Convert from u64 to float32 VCVDD Vt, VdVCVDD Vt, Vd Vt = {T0}, i64
Vd = {D0}, f64
D0 = T0Vt = {T0}, i64
Vd = {D0}, f64
D0 = T0 Преобразование из i64 в float64Convert from i64 to float64 VCVDDU Vt, VdVCVDDU Vt, Vd Vt = {T0}, u64
Vd = {D0}, f64
D0 = T0Vt = {T0}, u64
Vd = {D0}, f64
D0 = T0 Преобразование из u64 в float64Convert from u64 to float64 VCVIF Vt, VdVCVIF Vt, Vd Vt = {T1, T0}, i32
Vd = {D1, D0}, f32
D1 = T1, D0 = T0Vt = {T1, T0}, i32
Vd = {D1, D0}, f32
D1=T1, D0=T0 Преобразование из i32 в float32Convert from i32 to float32 VCVIFU Vt, VdVCVIFU Vt, Vd Vt = {T1, T0}, u32
Vd = {D1, D0}, f32
D1 = T1, D0 = T0Vt = {T1, T0}, u32
Vd = {D1, D0}, f32
D1=T1, D0=T0 Преобразование из u32 в float32Convert from u32 to float32 VCVID Vt, VVdVCVID Vt, VVd Vt = {T1, T0}, i32
Vd = {Vd’, Vd] = {{D1}, {D0}}, f64
D1 = T1, D0 = T0Vt = {T1, T0}, i32
Vd = {Vd', Vd] = {{D1}, {D0}}, f64
D1=T1, D0=T0 Преобразование из int32 в float64Convert from int32 to float64 VCVIDU Vt, VVdVCVIDU Vt, VVd Vt = {T1, T0}, u32
Vd = {Vd’, Vd] = {{D1}, {D0}}, f64
D1 = T1, D0 = T0Vt = {T1, T0}, u32
Vd = {Vd', Vd] = {{D1}, {D0}}, f64
D1=T1, D0=T0 Преобразование из uint32 в float64Convert from uint32 to float64 VCVHF Vt, VVdVCVHF Vt, VVd Vt = {T3, T2, T1, T0}, i16
Vd = {Vd’, Vd] = {{D3, D1}, {D2, D0}}, f64
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, i16
Vd = {Vd', Vd] = {{D3, D1}, {D2, D0}}, f64
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из i16 в float32Convert from i16 to float32 VCVHFU Vt, VVdVCVHFU Vt, VVd Vt = {T3, T2, T1, T0}, u16
Vd = {Vd’, Vd} = {{D3, D1}, {D2, D0}}, f64
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, u16
Vd = {Vd', Vd} = {{D3, D1}, {D2, D0}}, f64
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из u16 в float32Convert from u16 to float32 VCVIH Vt, Vs, VdVCVIH Vt, Vs, Vd Vt = {T1, T0}, i32
Vs = {S1, S0}, i32
Vd = {D3, D2, D1, D0}, f16
D3 = S1, D2 = T1, D1 = S0, D0 = T0Vt = {T1, T0}, i32
Vs = {S1, S0}, i32
Vd = {D3, D2, D1, D0}, f16
D3=S1, D2=T1, D1=S0, D0=T0 Преобразование из i32 в float16Convert from i32 to float16 VCVIHU Vt, Vs, VdVCVIHU Vt, Vs, Vd Vt = {T1, T0}, u32
Vs = {S1, S0}, u32
Vd = {D3, D2, D1, D0}, f16
D3 = S1, D2 = T1, D1 = S0, D0 = T0Vt = {T1, T0}, u32
Vs = {S1, S0}, u32
Vd = {D3, D2, D1, D0}, f16
D3=S1, D2=T1, D1=S0, D0=T0 Преобразование из u32 в float16Convert from u32 to float16 VCVLINHF Vt, VdVCVLINHF Vt, Vd Vt = {T3, T2, T1, T0}, i16
Vd = {Vd’, Vd] = {{D3, D2}, {D1, D0}}, f64
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, i16
Vd = {Vd', Vd] = {{D3, D2}, {D1, D0}}, f64
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из i16 в float32Convert from i16 to float32 VCVLINHFU Vt, VVdVCVLINHFU Vt, VVd Vt = {T3, T2, T1, T0}, u16
Vd = {Vd’, Vd} = {{D3, D2}, {D1, D0}}, f64
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, u16
Vd = {Vd', Vd} = {{D3, D2}, {D1, D0}}, f64
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из u16 в float32Convert from u16 to float32 VCVHH Vt, VdVCVHH Vt, Vd Vt = {T3, T2, T1, T0}, i16
Vd = {D3, D2, D1, D0}, f16
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, i16
Vd = {D3, D2, D1, D0}, f16
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из i16 в float16Convert from i16 to float16 VCVHHU Vt, VdVCVHHU Vt, Vd Vt = {T3, T2, T1, T0}, u16
Vd = {D3, D2, D1, D0}, f16
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, u16
Vd = {D3, D2, D1, D0}, f16
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из u16 в float16Convert from u16 to float16 Векторные операции преобразования типов FLT ->INTVector Type Conversion Operations FLT ->INT VHCVH Vt, Vd
VHCVH.floor Vt, Vd
VHCVH.round Vt, Vd
VHCVH.ceil Vt, Vd
VHCVH.trunc Vt, VdVHCVH Vt, Vd
VHCVH.floor Vt, Vd
VHCVH.round Vt, Vd
VHCVH.ceil Vt, Vd
VHCVH.trunc Vt, Vd Vt = {T3, T2, T1, T0}, f16
Vd = {D3, D2, D1, D0}, i16
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, f16
Vd = {D3, D2, D1, D0}, i16
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из float16 в i16
Опциональное округление, принудительная сатурацияConvert from float16 to i16
Optional rounding, forced saturation VHCVHU Vt, Vd
VHCVHU.floor Vt, Vd
VHCVHU.round Vt, Vd
VHCVHU.ceil Vt, Vd
VHCVHU.trunc Vt, VdVHCVHU Vt, Vd
VHCVHU.floor Vt, Vd
VHCVHU.round Vt, Vd
VHCVHU.ceil Vt, Vd
VHCVHU.trunc Vt, Vd Vt = {T3, T2, T1, T0}, f16
Vd = {D3, D2, D1, D0}, u16
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, f16
Vd = {D3, D2, D1, D0}, u16
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из float16 в u16
Опциональное округление, принудительная сатурацияConvert from float16 to u16
Optional rounding, forced saturation VHCVI Vt, VVdVHCVI Vt, VVd Vt = {T3, T2, T1, T0}, f16
VVd = {Vd’, Vd} = {{D3, D1}, {D2, D0}}, i32
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, f16
VVd = {Vd', Vd} = {{D3, D1}, {D2, D0}}, i32
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из float16 в i32, принудительная сатурацияConvert from float16 to i32, forced saturation VHCVIU Vt, VVdVHCVIU Vt, VVd Vt = {T3, T2, T1, T0}, f16
VVd = {Vd’, Vd} = {{D3, D1}, {D2, D0}}, u32
D3 = T3, D2 = T2, D1 = T1, D0 = T0Vt = {T3, T2, T1, T0}, f16
VVd = {Vd', Vd} = {{D3, D1}, {D2, D0}}, u32
D3=T3, D2=T2, D1=T1, D0=T0 Преобразование из float16 в u32, принудительная сатурацияFloat16 to u32 conversion, forced saturation VFCVD Vt, VVdVFCVD Vt, VVd Vt = {T1, T0}, f32
VVd = {Vd’, Vd} = {{D1}, {D0}}, i64
D1 = T1, D0 = T0 Vt = {T1, T0}, f32
VVd = {Vd', Vd} = {{D1}, {D0}}, i64
D1=T1, D0=T0 Преобразование из float32 в i64, принудительная сатурацияfloat32 to i64 conversion, forced saturation VFCVDU Vt, VVdVFCVDU Vt, VVd Vt = {T1, T0}, f32
VVd = {Vd’, Vd} = {{D1}, {D0}}, u64
D1 = T1, D0 = T0Vt = {T1, T0}, f32
VVd = {Vd', Vd} = {{D1}, {D0}}, u64
D1=T1, D0=T0 Преобразование из float32 в u64, принудительная сатурацияFloat32 to u64 conversion, forced saturation VFCVI Vt, Vd
VFCVI.floor Vt, Vd
VFCVI.round Vt, Vd
VFCVI.ceil Vt, Vd
VFCVI.trunc Vt, VdVFCVI Vt, Vd
VFCVI.floor Vt, Vd
VFCVI.round Vt, Vd
VFCVI.ceil Vt, Vd
VFCVI.trunc Vt, Vd Vt = {T1, T0}, f32
VVd = {D1, D0}, i32
D1 = T1, D0 = T0Vt = {T1, T0}, f32
VVd = {D1, D0}, i32
D1=T1, D0=T0 Преобразование из float32 в i32
Опциональное округление, принудительная сатурацияConvert from float32 to i32
Optional rounding, forced saturation VFCVIU Vt, Vd
VFCVIU.floor Vt, Vd
VFCVIU.round Vt, Vd
VFCVIU.ceil Vt, Vd
VFCVIU.trunc Vt, VdVFCVIU Vt, Vd
VFCVIU.floor Vt, Vd
VFCVIU.round Vt, Vd
VFCVIU.ceil Vt, Vd
VFCVIU.trunc Vt, Vd Vt = {T1, T0}, f32
VVd = {D1, D0}, u32
D1 = T1, D0 = T0Vt = {T1, T0}, f32
VVd = {D1, D0}, u32
D1=T1, D0=T0 Преобразование из float32 в u32
Опциональное округление, принудительная сатурацияConvert from float32 to u32
Optional rounding, forced saturation VDCVD Vt, Vd
VDCVD.floor Vt, Vd
VDCVD.round Vt, Vd
VDCVD.ceil Vt, Vd
VDCVD.trunc Vt, VdVDCVD Vt, Vd
VDCVD.floor Vt, Vd
VDCVD.round Vt, Vd
VDCVD.ceil Vt, Vd
VDCVD.trunc Vt, Vd Vt = {T0}, f64
VVd = {D0}, i64
D0 = T0Vt = {T0}, f64
VVd = {D0}, i64
D0 = T0 Преобразование из float64 в i64
Опциональное округление, принудительная сатурацияConvert from float64 to i64
Optional rounding, forced saturation VDCVDU Vt, Vd
VDCVDU.floor Vt, Vd
VDCVDU.round Vt, Vd
VDCVDU.ceil Vt, Vd
VDCVDU.trunc Vt, VdVDCVDU Vt, Vd
VDCVDU.floor Vt, Vd
VDCVDU.round Vt, Vd
VDCVDU.ceil Vt, Vd
VDCVDU.trunc Vt, Vd Vt = {T0}, f64
VVd = {D0}, u64
D0 = T0Vt = {T0}, f64
VVd = {D0}, u64
D0 = T0 Преобразование из float64 в u64
Опциональное округление, принудительная сатурацияConvert from float64 to u64
Optional rounding, forced saturation VDCVI Vt, Vs, VdVDCVI Vt, Vs, Vd Vt = {T0}, f64
Vs = {S0}, f64
VVd = {D1, D0}, i32
D1 = S0, D0 = T0Vt = {T0}, f64
Vs = {S0}, f64
VVd = {D1, D0}, i32
D1 = S0, D0 = T0 Преобразование из float64 в i32, принудительная сатурацияConvert from float64 to i32, forced saturation VDCVIU Vt, Vs, VdVDCVIU Vt, Vs, Vd Vt = {T0}, f64
Vs = {S0}, f64
VVd = {D1, D0}, u32
D1 = S0, D0 = T0Vt = {T0}, f64
Vs = {S0}, f64
VVd = {D1, D0}, u32
D1 = S0, D0 = T0 Преобразование из float64 в u32, принудительная сатурацияFloat64 to u32 conversion, forced saturation VFCVH Vt, Vs, Vd
VFCVH.round Vt,Vs, Vd
VFCVH.trunc Vt, Vs, Vd
VFCVH.ceil Vt, Vs, Vd
VFCVH.floor Vt, Vs, VdVFCVH Vt, Vs, Vd
VFCVH.round Vt, Vs, Vd
VFCVH.trunc Vt, Vs, Vd
VFCVH.ceil Vt, Vs, Vd
VFCVH.floor Vt, Vs, Vd Vt = {T1, T0}, f32
Vs = {S1, S0}, f32
Vd = {D3, D2, D1, D0}, i16
D3 = S1, D2 = T1, D1 = S0, D0 = T0Vt = {T1, T0}, f32
Vs = {S1, S0}, f32
Vd = {D3, D2, D1, D0}, i16
D3=S1, D2=T1, D1=S0, D0=T0 Преобразование из float32 в i16
Опциональное округление, принудительная сатурацияConvert from float32 to i16
Optional rounding, forced saturation

Табл. 15.Tab. fifteen.

Мнемоника командыCommand mnemonic ОписаниеDescription VREDD Vt, VdVREDD Vt, Vd Скользящая межсекционная целочисленная сумма. При условном исполнении записываются только элементы регистра назначения, выделенные предикатом, суммирование при этом идет последовательно по всем элементам.
Vt = {T[i].D}, Vd = {D[i].D}
D[i].D = sum(T[0].D, …, T[i].D), i=0…7Moving cross-section integer sum. In conditional execution, only the elements of the destination register highlighted by the predicate are written, while the summation proceeds sequentially over all elements.
Vt = {T[i].D}, Vd = {D[i].D}
D[i].D = sum(T[0].D, …, T[i].D), i=0…7 VREDRD Vt, Rd.DVREDRD Vt, Rd.D Операция межсекционной редукции по целочисленному сложению с помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
Rd.D = SUM(T[i])The operation of intersectional reduction by integer addition with the placement of the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the command is executed, zero is written to the result.
Vt = {T[i].D}, i=0…7
Rd.D = SUM(T[i]) VANDREDD Vt, Vd.D
VANDREDRD Vt, Rd.DVANDREDD Vt, Vd.D
VANDREDRD Vt, Rd.D Операция межсекционной редукции по логическому «И» с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = AND(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of cross-sectional reduction by logical "AND" with the reproduction of the result by the vector register or by placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = AND(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VORREDD Vt, Vd.D
VORREDRD Vt, Rd.DVORREDD Vt, Vd.D
VORREDRD Vt, Rd.D Операция межсекционной редукции по логическому «ИЛИ» с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = OR(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of cross-sectional reduction by logical "OR" with the reproduction of the result in a vector register or placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = OR(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VEORREDD Vt, Vd.D
VEORREDRD Vt, Rd.DVEORREDD Vt, Vd.D
VEORREDRD Vt, Rd.D Операция межсекционной редукции по логическому исключающему «ИЛИ» с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = EOR(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of cross-sectional reduction by logical exclusive "OR" with the reproduction of the result in a vector register or placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = EOR(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VADDREDD Vt, Vd.D
VADDREDRD Vt, Rd.DVADDREDD Vt, Vd.D
VADDREDRD Vt, Rd.D Операция межсекционной редукции по целочисленному сложению с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = SUM(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of intersectional reduction by integer addition with multiplication of the result in a vector register or placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = SUM(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VMAXREDD Vt, Vd.D
VMAXREDRD Vt, Rd.DVMAXREDD Vt, Vd.D
VMAXREDRD Vt, Rd.D Операция межсекционной редукции по целочисленному знаковому максимуму с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = MAX(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of intersectional reduction by an integer signed maximum with the result multiplied by a vector register or by placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = MAX(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VMINREDD Vt, Vd.D
VMINREDRD Vt, Rd.DVMINREDD Vt, Vd.D
VMINREDRD Vt, Rd.D Операция межсекционной редукции по целочисленному знаковому минимуму с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = MIN(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of intersectional reduction by an integer signed minimum with the result multiplied by a vector register or by placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = MIN(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VMAXREDDU Vt, Vd.D
VMAXREDRDU Vt, Rd.DVMAXREDDU Vt, Vd.D
VMAXREDRDU Vt, Rd.D Операция межсекционной редукции по целочисленному беззнаковому максимуму с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = MAX(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of intersectional reduction by an integer unsigned maximum with the result multiplied by a vector register or by placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = MAX(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VMINREDDU Vt, Vd.D
VMINREDRDU Vt, Rd.DVMINREDDU Vt, Vd.D
VMINREDRDU Vt, Rd.D Операция межсекционной редукции по целочисленному беззнаковому минимуму с размножением результата по векторному регистру или помещением результата в скалярный регистр. При условном исполнении складываться должны только выделенные предикатом элементы, остальные должны быть проигнорированы. При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].D}, i=0…7
R = MINU(T[i])
Векторный вариант:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Скалярный вариант:
Rd.D = RThe operation of intersectional reduction by an integer unsigned minimum with the result multiplied by a vector register or by placing the result in a scalar register. In conditional execution, only the elements selected by the predicate should be added, the rest should be ignored. If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].D}, i=0…7
R = MINU(T[i])
Vector option:
Vd = {D[i]}, D[i].D = R, i64, i=0…7
Scalar option:
Rd.D = R VFADDRED Vt, Vd.L
VFADDREDR Vt, Rd.LVFADDRED Vt, Vd.L
VFADDREDR Vt, Rd.L Операция межсекционной редукции по сложению плавающей запятой с размножением результата по векторному регистру или помещением результата в скалярный регистр. Из каждой векторной секции используется только младший элемент.
При условном исполнении на вход подаются только выделенные предикатом младшие элементы каждой векторной секции, записываются все подсвеченые выходные элемнеты (не только младшие). При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].L}, i=0…15
R = FSUM(T[j*2]), j=0…7
Векторный вариант:
Vd = {D[i]}, D[i].L = R, f32, i=0…15
Скалярный вариант:
Rd.L = RIntersectional reduction operation on floating-point addition with multiplication of the result in a vector register or placing the result in a scalar register. From each vector section, only the least significant element is used.
In conditional execution, only the lower elements of each vector section selected by the predicate are fed to the input, all the highlighted output elements are written (not only the lower ones). If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].L}, i=0…15
R = FSUM(T[j*2]), j=0…7
Vector option:
Vd = {D[i]}, D[i].L = R, f32, i=0…15
Scalar option:
Rd.L = R VFMAXRED Vt, Vd.L
VFMAXREDR Vt, Rd.LVFMAXRED Vt, Vd.L
VFMAXREDR Vt, Rd.L Операция межсекционной редукции по максимуму плавающей запятой с размножением результата по векторному регистру или помещением результата в скалярный регистр. Из каждой векторной секции используется только младший элемент.
При условном исполнении на вход подаются только выделенные предикатом младшие элементы каждой векторной секции, записываются все подсвеченые выходные элемнеты (не только младшие). При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].L}, i=0…15
R = FMAX(T[j*2]), j=0…7
Векторный вариант:
Vd = {D[i]}, D[i].L = R, f32, i=0…15
Скалярный вариант:
Rd.L = RThe operation of intersectional reduction by the maximum floating point with the result multiplied by a vector register or by placing the result in a scalar register. From each vector section, only the least significant element is used.
In conditional execution, only the lower elements of each vector section selected by the predicate are fed to the input, all the highlighted output elements are written (not only the lower ones). If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].L}, i=0…15
R = FMAX(T[j*2]), j=0…7
Vector option:
Vd = {D[i]}, D[i].L = R, f32, i=0…15
Scalar option:
Rd.L = R VFMINRED Vt, Vd.L
VFMINREDR Vt, Rd.LVFMINRED Vt, Vd.L
VFMINREDR Vt, Rd.L Операция межсекционной редукции по минимуму плавающей запятой с размножением результата по векторному регистру или помещением результата в скалярный регистр. Из каждой векторной секции используется только младший элемент.
При условном исполнении на вход подаются только выделенные предикатом младшие элементы каждой векторной секции, записываются все подсвеченые выходные элемнеты (не только младшие). При нулевом условии скалярная команда исполняется, в результат записывается ноль.
Vt = {T[i].L}, i=0…15
R = FMIN(T[j*2]), j=0…7
Векторный вариант:
Vd = {D[i]}, D[i].L = R, f32, i=0…15
Скалярный вариант:
Rd.L = RThe operation of cross-sectional reduction by the minimum of the floating point with the result multiplied by the vector register or by placing the result in a scalar register. From each vector section, only the least significant element is used.
In conditional execution, only the lower elements of each vector section selected by the predicate are fed to the input, all the highlighted output elements are written (not only the lower ones). If the condition is zero, the scalar instruction is executed and zero is written to the result.
Vt = {T[i].L}, i=0…15
R = FMIN(T[j*2]), j=0…7
Vector option:
Vd = {D[i]}, D[i].L = R, f32, i=0…15
Scalar option:
Rd.L = R

Табл. 16.Tab. 16.

Мнемоника командыCommand mnemonic ОписаниеDescription VSHUFBH Vt, Vs, VdVSHUFBH Vt, Vs, Vd Vt: = {T63,…, T2, T1, T0}, i8
Vs = {S63,…, S2, S1, S0}, i8
Vdi: = {R31,…, R2, R1, R0}, u16
Vdo: = {D31,…, D2, D1, D0}, i16
VV = {S63, …, S0, T63, …, T0}
Di = sext_8→16(VV[R[i] & 0x7F]), i=0,1,…,31
Операция универсальной межсекционной выборки с расширением типа данныхVt: = {T63,…, T2, T1, T0}, i8
Vs = {S63,…, S2, S1, S0}, i8
Vdi:={R31,…, R2, R1, R0}, u16
Vdo:={D31,…, D2, D1, D0}, i16
VV = {S63, …, S0, T63, …, T0}
Di = sext _8→16 (VV[R[i] & 0x7F]), i=0,1,…,31
Generic Intersectional Fetch Operation with Data Type Extension VSHUFBL Vt, Vs, VdVSHUFBL Vt, Vs, Vd Vt: = {T63,…, T2, T1, T0}, i8
Vs = {S63,…, S2, S1, S0}, i8
Vdi: = {R15,…, R2, R1, R0}, u32
Vdo: = {D15,…, D2, D1, D0}, i32
VV = {S63, …, S0, T63, …, T0}
Di = sext_8→32(VV[R[i] & 0x7F]), i=0,1,…,31
Операция универсальной межсекционной выборки с расширением типа данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt: = {T63,…, T2, T1, T0}, i8
Vs = {S63,…, S2, S1, S0}, i8
Vdi:={R15,…, R2, R1, R0}, u32
Vdo:={D15,…, D2, D1, D0}, i32
VV = {S63, …, S0, T63, …, T0}
Di = sext _8→32 (VV[R[i] & 0x7F]), i=0,1,…,31
Generic intersectional fetch operation with data type extension.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFHL Vt, Vs, VdVSHUFHL Vt, Vs, Vd Vt = {T31,…, T2, T1, T0}, i16
Vs = {S31,…, S2, S1, S0}, i16
Vdi: = {R15,…, R2, R1, R0}, u32
Vdo: = {D15,…, D2, D1, D0}, i32
VV = {S31, …, S0, T31, …, T0}
Di = sext_16→32(VV[R[i] & 0x3F]), i=0,1,…,15
Операция универсальной межсекционной выборки с расширением типа данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt = {T31,…, T2, T1, T0}, i16
Vs = {S31,…, S2, S1, S0}, i16
Vdi:={R15,…, R2, R1, R0}, u32
Vdo:={D15,…, D2, D1, D0}, i32
VV = {S31, …, S0, T31, …, T0}
Di = sext _16→32 (VV[R[i] & 0x3F]), i=0,1,…,15
Generic intersectional fetch operation with data type extension.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFBHU Vt, Vs, VdVSHUFBHU Vt, Vs, Vd Vt: = {T63,…, T2, T1, T0}, u8
Vs = {S63,…, S2, S1, S0}, u8
Vdi: = {R31,…, R2, R1, R0}, u16
Vdo: = {D31,…, D2, D1, D0}, u16
VV = {S63, …, S0, T63, …, T0}
Di = zext_8→16(VV[R[i] & 0x7F]), i=0,1,…,31
Операция универсальной межсекционной выборки с расширением типа данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt: = {T63,…, T2, T1, T0}, u8
Vs = {S63,…, S2, S1, S0}, u8
Vdi:={R31,…, R2, R1, R0}, u16
Vdo:={D31,…, D2, D1, D0}, u16
VV = {S63, …, S0, T63, …, T0}
Di = zext _8→16 (VV[R[i] & 0x7F]), i=0,1,…,31
Generic intersectional fetch operation with data type extension.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFBLU Vt, Vs, VdVSHUFBLU Vt, Vs, Vd Vt: = {T63,…, T2, T1, T0}, u8
Vs = {S63,…, S2, S1, S0}, u8
Vdi: = {R15,…, R2, R1, R0}, u32
Vdo: = {D15,…, D2, D1, D0}, u32
VV = {S63, …, S0, T63, …, T0}
Di = zext_8→32(VV[R[i] & 0x7F]), i=0,1,…,31
Операция универсальной межсекционной выборки с расширением типа данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt: = {T63,…, T2, T1, T0}, u8
Vs = {S63,…, S2, S1, S0}, u8
Vdi:={R15,…, R2, R1, R0}, u32
Vdo:={D15,…, D2, D1, D0}, u32
VV = {S63, …, S0, T63, …, T0}
Di = zext _8→32 (VV[R[i] & 0x7F]), i=0,1,…,31
Generic intersectional fetch operation with data type extension.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFHLU Vt, Vs, VdVSHUFHLU Vt, Vs, Vd Vt = {T31,…, T2, T1, T0}, u16
Vs = {S31,…, S2, S1, S0}, u16
Vdi: = {R15,…, R2, R1, R0}, u32
Vdo: = {D15,…, D2, D1, D0}, u32
VV = {S31, …, S0, T31, …, T0}
Di = zext_16→32(VV[R[i] & 0x3F]), i=0,1,…,15
Операция универсальной межсекционной выборки с расширением типа данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt = {T31,…, T2, T1, T0}, u16
Vs = {S31,…, S2, S1, S0}, u16
Vdi:={R15,…, R2, R1, R0}, u32
Vdo:={D15,…, D2, D1, D0}, u32
VV = {S31, …, S0, T31, …, T0}
Di = zext _16→32 (VV[R[i] & 0x3F]), i=0,1,…,15
Generic intersectional fetch operation with data type extension.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFHB Vt, Vs, VdVSHUFHB Vt, Vs, Vd Vt: = {T31…, T2, T1, T0}, i16
Vs = {S31…, S2, S1, S0}, i16
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, i8
VV = {S31, …, S0, T31, …, T0}
Di = sat_i16→i8 (VV[R[i] & 0x3F]), i=0,1,…,63
Операция универсальной межсекционной выборки с принудительной сатурацией signed → signed в меньший тип данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt: = {T31…, T2, T1, T0}, i16
Vs = {S31…, S2, S1, S0}, i16
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, i8
VV = {S31, …, S0, T31, …, T0}
Di = sat _i16→i8 (VV[R[i] & 0x3F]), i=0,1,…,63
A universal intersectional fetch operation with forced saturation signed → signed to a smaller data type.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFLB Vt, Vs, VdVSHUFLB Vt, Vs, Vd Vt = {T15…, T2, T1, T0}, i31
Vs = {S15…, S2, S1, S0}, i31
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, i8
VV = {S15, …, S0, T15, …, T0}
Di = sat_i32→u8 (VV[R[i] & 0x1F]), i=0,1,…,63
Операция универсальной межсекционной выборки с принудительной сатурацией signed → signed в меньший тип данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt = {T15…, T2, T1, T0}, i31
Vs = {S15…, S2, S1, S0}, i31
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, i8
VV = {S15, …, S0, T15, …, T0}
Di = sat _i32→u8 (VV[R[i] & 0x1F]), i=0,1,…,63
A universal intersectional fetch operation with forced saturation signed → signed to a smaller data type.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFLH Vt, Vs, VdVSHUFLH Vt, Vs, Vd Vt: = {T15…, T2, T1, T0}, i32
Vs = {S15…, S2, S1, S0}, i32
Vdi: = {R31…, R2, R1, R0}, u16
Vdo: = {D31…, D2, D1, D0}, i16
VV = {S15, …, S0, T15, …, T0}
Di = sat_i32→i16 (VV[R[i] & 0x1F]), i=0,1,…,31
Операция универсальной межсекционной выборки с принудительной сатурацией signed → signed в меньший тип данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt: = {T15…, T2, T1, T0}, i32
Vs = {S15…, S2, S1, S0}, i32
Vdi: = {R31…, R2, R1, R0}, u16
Vdo: = {D31…, D2, D1, D0}, i16
VV = {S15, …, S0, T15, …, T0}
Di = sat _i32→i16 (VV[R[i] & 0x1F]), i=0,1,…,31
A universal intersectional fetch operation with forced saturation signed → signed to a smaller data type.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFHBU Vt, Vs, VdVSHUFHBU Vt, Vs, Vd Vt: = {T31…, T2, T1, T0}, u16
Vs = {S31…, S2, S1, S0}, u16
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, u8
VV = {S31, …, S0, T31, …, T0}
Di = sat_i16→u8 (VV[R[i] & 0x3F]), i=0,1,…,63
Операция универсальной межсекционной выборки с принудительной сатурацией signed → unsigned в меньший тип данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt: = {T31…, T2, T1, T0}, u16
Vs = {S31…, S2, S1, S0}, u16
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, u8
VV = {S31, …, S0, T31, …, T0}
Di = sat _i16→u8 (VV[R[i] & 0x3F]), i=0,1,…,63
A generic cross-sectional fetch operation with forced saturation signed → unsigned to a smaller data type.
The former values of the Vd register are taken as indexes, the result is written to it. VSHUFLBU Vt, Vs, VdVSHUFLBU Vt, Vs, Vd Vt = {T15…, T2, T1, T0}, u31
Vs = {S15…, S2, S1, S0}, u31
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, u8
VV = {S15, …, S0, T15, …, T0}
Di = sat_i32→u8 (VV[R[i] & 0x1F]), i=0,1,…,63
Операция универсальной межсекционной выборки с принудительной сатурацией signed → unsigned в меньший тип данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt = {T15…, T2, T1, T0}, u31
Vs = {S15…, S2, S1, S0}, u31
Vdi: = {R63…, R2, R1, R0}, u8
Vdo: = {D63…, D2, D1, D0}, u8
VV = {S15, …, S0, T15, …, T0}
Di = sat _i32→u8 (VV[R[i] & 0x1F]), i=0,1,…,63
A generic cross-sectional fetch operation with forced saturation signed → unsigned to a smaller data type.
The previous values of the Vd register are taken as indices, the result is written to it. VSHUFLHU Vt, Vs, VdVSHUFLHU Vt, Vs, Vd Vt: = {T15…, T2, T1, T0}, u32
Vs = {S15…, S2, S1, S0}, u32
Vdi: = {R31…, R2, R1, R0}, u16
Vdo: = {D31…, D2, D1, D0}, u16
VV = {S15, …, S0, T15, …, T0}
Di = sat_i32→u16 (VV[R[i] & 0x1F]), i=0,1,…,31
Операция универсальной межсекционной выборки с принудительной сатурацией signed → unsigned в меньший тип данных.
В качестве индексов берутся прежние значения регистра Vd, результат записывается в него же.Vt: = {T15…, T2, T1, T0}, u32
Vs = {S15…, S2, S1, S0}, u32
Vdi: = {R31…, R2, R1, R0}, u16
Vdo: = {D31…, D2, D1, D0}, u16
VV = {S15, …, S0, T15, …, T0}
Di = sat _i32→u16 (VV[R[i] & 0x1F]), i=0,1,…,31
A generic cross-sectional fetch operation with forced saturation signed → unsigned to a smaller data type.
The previous values of the Vd register are taken as indices, the result is written to it.

Табл. 17.Tab. 17.

Мнемоника командыCommand mnemonic ОписаниеDescription VLUT0 Vt, Vs, Vr, Vd
VLUT1 Vt, Vs, Vr, VdVLUT0 Vt, Vs, Vr, Vd
VLUT1 Vt, Vs, Vr, Vd VLUTa T, S, R, D
Vt = {T63…, T2, T1, T0}, u8
Vs = {S63…, S2, S1, S0}, u8
Vr = {R63…, R2, R1, R0}, u8
Vd = {D63…, D2, D1, D0}, u8
VV = {R63, …, R0, S63, …, S0}, u8
if (T[7] == a) then Di = VV[T[i] & 0x7F], i=0,1,…,63
a - целочисленный 1-битный параметр, который может принимать значения: 0, 1;
T, S, R - входные 512-разрядные вектора;
D - выходной 512-разрядный вектор.
Входные и выходные вектора разделены на 64 байта. Вектор VV: = {Vr, Vs} содержит таблицу преобразования, точнее - половину этой таблицы, полная таблица преобразования содержится в двух векторах VV, соответствующих двум значениям параметра a. Вектор S T содержит преобразуемые данные.
Выбор данных из таблицы для каждого байта ti вектора T (i = 0,1,…,63) определяется его младшими семью битами согласно формуле: di = SVV[ti[6:0]].
Особенностью команды VLUT является то, что кроме собственно выходного вектора она формирует также побайтовую маску записи для этого вектора согласно формуле: mi = (ti[7]== a). Таким образом, в выходной регистр D записываются данные только из нужной таблицы.VLUTa T, S, R, D
Vt = {T63…, T2, T1, T0}, u8
Vs = {S63…, S2, S1, S0}, u8
Vr = {R63…, R2, R1, R0}, u8
Vd = {D63…, D2, D1, D0}, u8
VV = {R63, …, R0, S63, …, S0}, u8
if (T[7] == a) then Di = VV[T[i] & 0x7F], i=0,1,…,63
a - integer 1-bit parameter that can take values: 0, 1;
T, S, R - input 512-bit vectors;
D is an output 512-bit vector.
The input and output vectors are divided into 64 bytes. The vector VV: = {Vr, Vs} contains a lookup table, more precisely, half of this lookup table, the complete lookup table is contained in two vectors VV corresponding to two values of the parameter a. The ST vector contains the data to be converted.
The choice of data from the table for each byte ti of the vector T (i = 0,1,…,63) is determined by its least significant seven bits according to the formula: di = SVV[ti[6:0]].
A feature of the VLUT command is that, in addition to the output vector itself, it also forms a byte-by-byte recording mask for this vector according to the formula: mi = (ti[7]== a). Thus, only data from the desired table is written to the output register D.

Табл. 18.Tab. eighteen.

Мнемоника командыCommand mnemonic ОписаниеDescription VHIST0 Vs, Vd
VHIST1 Vs, Vd
VHIST2 Vs, Vd
VHIST3 Vs, VdVHIST0 Vs, Vd
VHIST1 Vs, Vd
VHIST2 Vs, Vd
VHIST3 Vs, Vd VHISTa S, D
a - целочисленный 2-битный параметр, который может принимать значения: 0, 1, 2, 3;
S - входной 512-разрядный вектор;
D - выходной 512-разрядный вектор.
Входной и выходной вектор разделены на 64 байта. Для каждого i (i = 0,1,…,63) вычисляется количество байт, значение которых равно (64*a + i), и это количество записывается в i-й байт выходного вектора.
Для вычисления 256-уровневой гистограммы необходимо выполнить эту команду для всех четырех значений параметра a, затем преобразовать полученные 64-компонентные байтовые вектора с расширением до 16-разрядных или 32-разрядных беззнаковых целых, в зависимости от того, в каком формате будет производиться накопление бинов гистограммы, и произвести накопление.
Таким образом, для вычисления гистограмм потребуются также векторные команды расширения типа и целочисленного сложения. При компактном коде возможно вычисление 256-уровневой гистограммы для 64 пикселей примерно за 6 тактов.
Четыре значения параметра a могут быть реализованы в виде четырех различных команд.VHISTa S, D
a - integer 2-bit parameter that can take values: 0, 1, 2, 3;
S - input 512-bit vector;
D is an output 512-bit vector.
The input and output vector are divided into 64 bytes. For each i (i = 0,1,…,63) the number of bytes is calculated, the value of which is equal to (64*a + i), and this number is written to the i-th byte of the output vector.
To calculate a 256-level histogram, you must execute this command for all four values of the a parameter, then convert the resulting 64-component byte vectors with an extension to 16-bit or 32-bit unsigned integers, depending on the format in which the bins will be accumulated histograms, and accumulate.
Thus, to calculate histograms, vector type extension and integer addition instructions are also required. With compact code, it is possible to calculate a 256-level histogram for 64 pixels in about 6 cycles.
The four values of the a parameter can be implemented as four different commands. VHISTC0 Vs, VAd
VHISTC1 Vs, VAd
VHISTC2 Vs, VAd
VHISTC3 Vs, VAdVHISTC0 Vs, VAd
VHISTC1 Vs, VAd
VHISTC2 Vs, VAd
VHISTC3 Vs, VAd Команда аналогичная VHIST*, только накопление результата идет в аккумуляторах. Используются по восемь 32-разрядных беззнаковых аккумуляторов в каждой векторной секции. Для подсчета количества всех входных значений требуется четыре аккумулятора.The command is similar to VHIST*, only the result is accumulated in accumulators. Eight 32-bit unsigned accumulators are used in each vector section. It takes four accumulators to count the number of all input values.

Табл. 19.Tab. 19.

Мнемоника командыCommand mnemonic ОписаниеDescription VPUSHD Vt, Vs, VdVPUSHD Vt, Vs, Vd Vt: = {T7…, T2, T1, T0}, u64
Vs = {S7…, S2, S1, S0}, u64
Vd = { S6, S5, S4, S3, S2, S1, S0, T7}
Операция кольцевого межсекционного сдвига на 8 байт. Извлекаемый элемент S7 теряется.
При условном исполнении команда исполняется в обычном режиме, но записываются только активные элементы согласно условию-предикату.Vt: = {T7…, T2, T1, T0}, u64
Vs = {S7…, S2, S1, S0}, u64
Vd = {S6, S5, S4, S3, S2, S1, S0, T7}
The operation of the circular intersectional shift by 8 bytes. The retrieved element S7 is lost.
With conditional execution, the command is executed in the normal mode, but only active elements are written according to the predicate condition. VPUSHRD Rt.D, Vs, VdVPUSHRD Rt.D, Vs, Vd Vs = {S7…, S2, S1, S0}, u64
Vd = { S6, S5, S4, S3, S2, S1, S0, Rt}
Rt.D = S7.
Операция кольцевого межсекционного сдвига на 8 байт с пересылкой скалярного регистра в нулевую секцию. Извлекаемый элемент возвращается в Rt.D.Vs = {S7…, S2, S1, S0}, u64
Vd = {S6, S5, S4, S3, S2, S1, S0, Rt}
Rt.D = S7.
8-byte ring intersection shift operation with transfer of the scalar register to the zero section. The retrieved element is returned to Rt.D.

Табл. 20.Tab. twenty.

Claims

1. A scalar vector processor 100 containing scalar and vector data processing channels 105 and 107 connected by a ring bus CDB 112, which are connected to the reduction unit VRED 104, as well as to the first level data memory DMEM/L1D $ 103, which is connected to the cache second-level memory L2$ 101, which is connected to the external interface of the processor, which has access to the external memory of the computing system, and is also connected to the first-level program memory PMEM/L1I$ 102, the output of which is connected to the input of the FETCH 109 command fetch unit, the output of which connected to the input of the command decoding block DECODE 110, the first output of which is connected to the inputs of the scalar and vector channels, and the second output is connected to the program control block PCTRL 111, the output of which is connected to the input of the program memory of the first level PMEM/L1I $ 102, and

- the program memory of the first level PMEM/L1I$ 102 and the data memory of the first level DMEM/L1D$ 103 are made with the possibility of generating calls and transferring them to

- cache memory of the second level L2$ 101, which is configured to serve calls from the program memory of the first level PMEM/L1I$ 102 and the data memory of the first level DMEM/L1D$ 103, as well as to load data via an external interface from the external memory of the computer system and transferring data to the first level data memory DMEM/L1D$ 103 and the first level program memory PMEM/L1I$ 102;

- command fetch unit FETCH 109 is configured to fetch commands from program memory PMEM/L1I $ 102 and transfer them to

- a command decoding unit DECODE 110, configured to decode commands and generate program control commands for the executive devices of the processor and transfer them to

- a PCTRL block, which is configured to execute program control instructions.

2. The processor according to claim 1, characterized in that the first-level data memory DMEM/L1D$ 103 is made in the form of a first-level L1D$ cache memory or in the form of a tightly coupled TCM (Tightly-Coupled Memory) DMEM static memory.

3. The processor according to claim 1, characterized in that the first-level program memory PMEM/L1I$ 102 is made in the form of a first-level L1I$ cache memory or in the form of a tightly coupled TCM (Tightly-Coupled Memory) PMEM static memory.

4. The processor according to claim 1, characterized in that at the level of the computing core it has a Harvard architecture with the possibility of simultaneous access to the first level program memory PMEM/L1$ 102 and the first level data memory DMEM/L1D$ 103 via separate buses.

5. The processor according to claim. 1, characterized in that the cache memory of the second level L2 $ 101 has a von Neumann architecture.

6. The processor according to claim. 1, characterized in that the program control commands are selected from a set of commands containing program jump commands and program cycle commands.

7. The processor according to claim 1, characterized in that the commands are combined into instructions, which are organized in the form of a VLIW package 201 (VLIW - Very Long Instruction Word).

8. The processor according to claim 7, characterized in that the VLIW package 201 contains up to eight instructions, of which up to four instructions are for the scalar data processing channel agents and up to four instructions are for the vector data processing channel agents.

9. The processor according to claim 7, characterized in that the VLIW package 201 contains up to two scalar data exchange instructions and up to two vector data exchange instructions with the DMEM/L1D $ 103 data memory.

10. The processor according to claim 1, characterized in that it has a command system consisting of program control commands, commands of the actuators of the scalar data processing channel and the vector data processing channel, as well as commands of the VRED 104 reduction unit.

11. The processor according to claim. 1, characterized in that the scalar channel 105 contains one scalar computing section 106.

12. Processor according to claim. 1, characterized in that the scalar computing section 106 contains a scalar register file RF 301, which is multi-ported and which stores the processed scalar data.

13. Processor according to claim 12, characterized in that the scalar register file RF 301 contains ports associated with the scalar data processing channel 105 and configured to communicate with the data memory DMEM/L1D$ 103.

14. The processor according to claim 12, characterized in that the scalar register file RF 301 contains ports associated with the actuators of the scalar computing section 106 of the scalar data processing channel 105, configured to transmit initial data to perform computational operations and write the results of operations back to scalar register file RF 301.

15. The processor according to claim 1, characterized in that the scalar computing section 106 contains data processing units SLSE0 310, SLSE1 311, which are configured to provide data exchange between the DMEM/L1D $ 103 data memory and the RF 301 scalar register file, including including the execution of data transfer commands between the data memory DMEM/L1D $ 103 and the scalar register file RF301.

16. The processor according to claim 1, characterized in that the scalar computing section 106 contains data processing units ALU0 302, ALU1 303, ALU2 304, ALU3 305, performing arithmetic and logical operations on fixed-point numbers.

17. The processor according to claim 1, characterized in that the scalar computing section 106 contains data processing units FALU0 306, FALU1 307, performing arithmetic and logical operations on floating point numbers.

18. The processor according to claim. 1, characterized in that the scalar computing section 106 contains data processing units SMU0 308, SMU1 309, performing multiplication operations on fixed and floating point numbers.

19. The processor according to claim. 1, characterized in that the scalar computing section 106 contains a data processing unit SH 312 that performs logical and arithmetic shift operations.

20. The processor according to claim. 1, characterized in that the scalar computing section 106 contains a data processing unit CONV 315 that performs data type conversion operations.

21. The processor according to claim. 1, characterized in that the scalar computing section 106 contains a data processing unit DIV 313 that performs division operations.

22. The processor according to claim. 1, characterized in that the scalar computing section 106 includes a data processing unit MF 314, which performs the calculation of transcendental mathematical functions.

23. The processor according to claim. 1, characterized in that the vector channel 107 consists of several vector computing sections 108, the number of which corresponds to the capacity of the processed vector.

24. The processor according to claim 23, characterized in that the vector computing section 108 contains a vector register file VRF 401, which is multi-port and multi-format and which stores the processed vector data.

25. Processor according to claim 24, characterized in that the vector register file VRF 401 is multi-format, so that each 64-bit register 500 of the vector register file VRF 401 can store either one 64-bit value 501 or two 32-bit values 502 , or four 16-bit values 503, or eight 8-bit values 504.

26. The processor according to claim 24, characterized in that the VRF 401 vector register file contains ports associated with the external interface of the vector channel 107 and configured to exchange data with the DMEM/L1D $ 103 data memory.

27. The processor according to claim 24, characterized in that the VRF 401 vector register file contains ports associated with the execution devices of the vector computing section 108, configured to transmit initial data to perform computational operations and write the results back.

28. The processor according to claim 24, characterized in that the vector register file VRF 401 is configured to work with various data formats.

29. The processor according to claim 23, characterized in that the vector computing section 108 contains blocks VLSE0 412, VLSE1 413, which are configured to provide data exchange between the DMEM/L1D $ 103 data memory and the VRF 401 vector register file, including execution of data transfer commands between the data memory DMEM/L1D $ 103 and the vector register file VRF 401.

30. The processor according to claim 23, characterized in that the vector computing section 108 contains blocks VALU0 403, VALU1 404, VALU2 405, VALU3 406, configured to perform arithmetic and logical operations on fixed-point numbers.

31. The processor according to claim 23, characterized in that the vector computing section 108 contains blocks VFALU0 407, VFALU1 408, configured to perform arithmetic and logical operations on floating point numbers.

32. The processor according to claim 23, characterized in that the vector computing section 108 contains blocks VMU0 409, VMU1 410, configured to perform multiplication and multiplication-accumulation operations on fixed and floating point numbers.

33. The processor according to claim 23, characterized in that the vector computing section 108 contains a vector register file of accumulator registers VAC 402, configured to store data obtained and used as a result of performing multiplication with accumulation operations performed by vector multiplier units VMU0 409, VMU1 410.

34. The processor according to claim 23, characterized in that the vector computing section 108 contains a VSH 411 block, configured to perform logical and arithmetic shift operations on vector operands.

35. The processor according to claim 23, characterized in that the vector computing section 108 contains a VCONV 414 block, configured to perform a data type conversion operation on vector operands.

36. Processor according to claim. 1, characterized in that the reduction unit VRED 104 is configured to calculate reduction functions.

37. The processor according to claim 1, characterized in that the reduction unit VRED 104 is configured to calculate the reduction functions and, at the same time, implement the interaction functions of the scalar and vector parts of the processor in various operations in which the scalar channel 105 generates and / or consumes the scalar required and/or generated by the vector channel 107.

38. The processor according to claim. 1, characterized in that the reduction unit VRED 104 contains a RALU 601 unit, configured to perform arithmetic-logical intersectional reduction operations.

39. The processor according to claim. 1, characterized in that the reduction unit VRED 104 contains a block SHUFFLE 602, configured to perform cross-sectional permutations.

40. The processor according to claim. 1, characterized in that the reduction unit VRED 104 contains a LUT 603, configured to perform cross-sectional table transformations.

41. Processor according to claim. 1, characterized in that the reduction unit VRED 104 contains a block HIST 604 configured to perform histogram calculation operations.

42. The processor according to claim 1, characterized in that the CDB (Circular Data Bus) 112 is configured to exchange data simultaneously with the implementation of computational operations in the scalar and vector channels 105, 107 and in the reduction unit VRED 104.

43. The processor according to claim 1, characterized in that the ring bus CDB 112 is configured to implement cyclic shift commands, as a result of which the register Ri of the scalar register file RF 301 is shifted to the register Vj of the vector register file VRF 401 of the zero vector computing section 108: Vj.0=Ri; the register Vj of the vector register file VRF 401 of the senior (N-1) vector computing section 108 is shifted to the register Ri of the scalar register file RF 301: Ri=Vj.N-1; the registers Vj of the vector register files VRF 401 of the remaining vector computing sections 108 are shifted by one section towards the higher sections: Vj.k=Vj.k-1, k=1,2,…,N-1.

44. The processor according to claim 1, characterized in that the ring bus CDB 112 is configured to sequentially move data from the vector channel 107 to the scalar channel 105 in order to perform operations that are only available in the scalar channel 105, with the subsequent return of the converted data to the vector channel 107.