GB2634041A

GB2634041A - Narrowing vector store instruction

Info

Publication number: GB2634041A
Application number: GB2314829.9A
Authority: GB
Inventors: Martinez Vicente Alejandro
Original assignee: ARM Ltd; Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2025-04-02
Anticipated expiration: 2043-09-27
Also published as: WO2025068671A1; GB2634041B; TW202514357A; GB202314829D0

Abstract

An apparatus comprises instruction decoding circuitry to decode instructions and issue circuitry to issue at least one micro-operation to control processing circuitry to perform a processing operation. In response to decoding of a narrowing vector store instruction specifying a plurality of vector source registers, e.g. Z1 and Z2, each for specifying a vector operand, the instruction decoding circuitry is configured to control the issue circuitry to issue at least one micro-operation to control the processing circuitry to narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size, and store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size.

Description

NARROWING VECTOR STORE INSTRUCTION

The present technique relates to the field of data processing.

Processing circuitry may support a vector processing architecture where vector instructions can trigger vector operations to be performed on vector operands comprising multiple data elements. Performing vector operations using vector instructions can enable the instruction fetch and decoding overhead of a given operation to be performed on each of a set of data elements to be reduced compared to a scalar implementation processing each data element individually using a separate scalar instruction.

At least some examples provide an apparatus comprising: instruction decoding circuitry to decode instructions; and issue circuitry to issue, in response to decoding of a given instruction by the instruction decoding circuitry, at least one micro-operation to control processing circuitry to perform a processing operation corresponding to the given instruction; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, the instruction decoding circuitry is configured to control the issue circuitry to issue at least one micro-operation to control the processing circuitry to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the memory system corresponding to a target memory address determined based on the at least one address operand.

At least some examples provide computer-readable code for fabrication of an apparatus comprising: instruction decoding circuitry to decode instructions; and issue circuitry to issue, in response to decoding of a given instruction by the instruction decoding circuitry, at least one micro-operation to control processing circuitry to perform a processing operation corresponding to the given instruction; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, the instruction decoding circuitry is configured to control the issue circuitry to issue at least one micro-operation to control the processing circuitry to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the memory system corresponding to a target memory address determined based on the at least one address operand.

At least some examples provide a method comprising: decoding instructions; and in response to decoding of a given instruction, issuing at least one micro-operation to control processing circuitry to perform a processing operation corresponding to the given instruction; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, at least one micro-operation is issued to control the processing circuitry to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the memory system corresponding to a target memory address determined based on the at least one address operand.

At least some examples provide a computer program for controlling a host data processing apparatus to provide an instruction execution environment for execution of target program code, the computer program comprising: instruction decoding program logic to decode instructions of the target program code; and processing program logic to perform a processing operation corresponding to a given instruction decoded by the instruction decoding program logic; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, the instruction decoding program logic is configured to control the issue program logic to issue at least one micro-operation to control the processing program logic to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a simulated address space, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the simulated address space corresponding to a target memory address determined based on the at least one address operand.

At least some examples provide a storage medium storing the computer-readable code or the computer program mentioned above. The storage medium may be a non-transitory storage medium.

Further aspects, features and advantages of the present technique will be apparent from the following description of examples, which is to be read in conjunction with the accompanying drawings, in which: Figure 1 illustrates an example of a processing system supporting vector processing; Figure 2 illustrates an example of vector registers having a vector length indicated by a vector length parameter; Figure 3 illustrates a first example of instruction decoding circuitry and issue circuitry, in this case provided within a processor such as a Central Processing Unit (CPU); Figure 4 illustrates an example of use of a co-processor to process an offloaded subset of operations from a stream of operations to be performed by a main processor; Figure 5 illustrates functionality of both non-interleaving and interleaving variants of a narrowing vector store instruction; Figure 6 illustrates functionality of a non-interleaving variant of a narrowing vector store instruction with predication; Figure 7 illustrates a method of data processing; Figure 8 illustrates a method of processing a narrowing vector store instruction; Figure 9 illustrates a method of processing an interleaving variant of a narrowing vector store instruction; Figure 10 illustrates a method of processing a non-interleaving variant of a narrowing vector store instruction; Figure 11 illustrates a simulation example.

An apparatus comprises instruction decoding circuitry to decode instructions, and issue circuitry to issue, in response to decoding of a given instruction by the instruction decoding circuitry, at least one micro-operation to control processing circuitry to perform a processing operation corresponding to the given instruction. Design of the instruction set supported by the instruction decoding circuitry can be a relatively complex task, because there may be a limited encoding space available in comparison with the wide range of operations which could theoretically be supported, and so to support a particular instruction, there would need to be a justification for providing it. The design choices and compromises made in which operations are supported and how those instructions are encoded can affect the performance and power consumption of the processing circuitry. Other factors to consider can be the practicality of implementing circuitry that executes the instruction, and the flexibility the architected instructions provide for supporting different micro-architectural design choices of circuit implementation. Therefore, merely because a given operation is theoretically possible, this does not automatically mean that its inclusion as a specific instruction would be desirable. Instruction set architecture designers are generally extremely cautious about adding new instructions to the instruction set, as once an instruction is included, it is extremely difficult to remove it as it would need to continue to be supported to allow legacy software using the instruction to continue to function. If an instruction is introduced which turns out to be problematic to implement in circuit hardware or which causes there to be insufficient encoding space to represent another more desirable operation, the consequences of that poorly thought out addition to the instruction set would be felt for a long time. Hence, care is taken when considering additions to the instruction set.

In the examples below, the instruction decoding circuitry supports a narrowing vector store instruction which specifies at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length. In response to decoding of the narrowing vector store instruction, the instruction decoding circuitry controls the issue circuitry to issue at least one micro-operation to control the processing circuitry to: * narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and * store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the memory system corresponding to a target memory address determined based on the at least one address operand.

Hence, the narrowing vector store instruction causes both narrowing of vector elements of a plurality of source registers and storing of the narrowed vector elements from the plurality of source registers to memory to be performed in response to a single instruction.

This instruction can be helpful for workloads which process a data structure where the storage format of data elements stored in memory is compressed compared to the processing format in which the data elements are subject to processing operations at the processing circuitry. For example, maintaining vectors in vector registers with the second data element size can be useful for enabling higher precision to be retained during intermediate steps of vector processing operations performed on those data elements, while storing vectors having elements narrowed to the first data element size can provide a denser storage format sacrificing precision to reduce memory storage overhead for a given number of data elements. An example of a type of workload for which this is useful is an image processing workload where it can be very common for pixel data to be stored with a smaller data element size in memory, whilst having a wider data element size when loaded from memory for applying image processing kernels such as filtering, resizing or rotation.

Vector processing architectures can be attractive for this type of workload as they can apply a repetitive operation to multiple data elements in response to a single instruction.

However, the inventor has recognised that existing vector instruction sets can be inefficient for handling this type of workload.

The narrowing vector store instruction described herein can be more efficient than alternative instructions because, as the narrowing and store are both performed in response to a single architectural instruction, there is no need to consume additional architectural register(s) (registers represented by one of the limited set of architectural register identifiers that can be encoded in the instruction encoding) for holding one or more temporary narrowed vectors between the narrowing operation and the store (in contrast, such architectural register(s) would be consumed if a narrowing instruction and a store instruction were executed as separate architectural instructions with the narrowing instruction being a register-to-register instruction rather than a register-to-memory instruction). Therefore, the narrowing vector store instruction can help to reduce register pressure, which will tend to improve performance because code runs out of spare architectural registers less often, reducing the need to spill register data to memory when it could otherwise have been kept within the register file corresponding to architectural vector registers.

Also, providing an instruction which can store to memory at least one vector of narrowed data elements from two or more source vector registers can preserve the memory throughput (amount of data processed per store instruction) that would be achieved for an equivalent non-narrowing vector store instruction. In contrast, an alternative narrowing store instruction which stores to memory narrowed data elements from a single vector source register would have a reduced memory throughput compared to an equivalent non-narrowing vector store instruction. This is because for a narrowing store instruction the data element size is reduced, so if an amount of data in a vector source register comprised a given number of elements having the larger size, when the elements are narrowed to a smaller size and stored to memory then the amount of data stored to memory would be reduced compared to the amount of data in the vector source register. Hence, a narrowing store instruction acting on a single vector source register would have reduced memory throughput compared to a non-narrowing vector store instruction. However, storing narrowed data elements to memory from a plurality of vector source registers means that the total amount of data stored per store operation is not reduced compared to the non-narrowing vector store instruction because data is being stored from more than one vector source register, and therefore memory throughput can be preserved. Executing a greater number of store instructions per processed amount of memory data would incur a greater cost in terms of memory control bandwidth (e.g. increase the number of memory translation lookups, load/store unit slots consumed etc., which can restrict bandwidth available for other operations and hence limit performance). This cost can be reduced for a given amount of memory throughput by using the narrowing vector store instruction described above. Therefore, the narrowing vector store instruction which stores a vector of narrowed data elements to memory from a plurality of vector source registers can provide a number of advantages for supporting workloads which require data elements to be processed with a larger data element size than their storage format in memory, and so the extra encoding space consumed by providing this type of instruction in the instruction set can be justified.

The narrowing vector store instruction specifies two or more vector source registers. The vector source registers could be identified explicitly by two or more source register fields in the encoding of the narrowing vector store instruction. Alternatively, the instruction may explicitly identify a first source register of the two or more source registers in a register field of the instruction encoding, and one or more other source registers of the two or more vector source registers may implicitly be identified as the vector source registers whose architectural register identifiers are at predetermined offsets relative to the architectural register identifier of the first source register. For example, the two or more vector source registers may be the registers associated with a contiguous set of architectural register identifiers starting from the architectural register identifier of the first source register that is explicitly encoded in the instruction encoding. With this approach, the software developer or compiler is constrained to select registers with contiguous architectural register identifiers as the sources for the narrowing vector store instruction, but this constraint has the advantage of reducing the amount of encoding space needed for encoding the multiple source registers, which can be useful as instruction encoding space can be at a premium.

Regardless of the particular approach taken to encode the vector source registers in the instruction, each vector source register explicitly or implicitly identified by the instruction is associated with a different architectural register identifier, and so can separately be specified as a source or destination register by another instruction of the instruction set supported by the instruction decoding circuitry. That is, where a given vector instruction specifies a source/destination vector register field for specifying an identifier of a corresponding vector register used as a source/destination operand, each vector source register of the narrowing vector store instruction may have an associated architectural register identifier which corresponds to a different encoding of the source/destination register field of the given vector instruction. Hence, each source register of the narrowing vector store instruction could independently be specified as an operand for a vector operation or as a destination to be updated based on the result of the vector operation.

The narrowing vector store instruction could be mapped to one or more micro-operations by the instruction decoding circuitry. In some examples, the provision of a single narrowing vector store instruction can lead to a reduction in the number of micro-operations which are issued than examples using more than one instruction. For example, one approach to store to memory narrowed vector data elements from a plurality of vector source registers is to first use a register-to-register zip instruction to narrow data elements and combine narrowed data elements from a plurality of vector source registers and write the result to an architectural register, and then use a store instruction to retrieve the narrowed data from the destination register of the zip instruction and store said data to memory. The store instruction itself could be mapped to two micro-operations: a micro-operation to retrieve data from a source register, and a micro-operation to compute a destination address and perform the store on the data retrieved from the register in the first micro-operation. Hence, use of a zip instruction followed by a store instruction could involve the use of at least three micro-operations. In comparison, the instruction decoder may map the narrowing vector store instruction to fewer micro-operations, reducing the number of micro-operations which are required to perform the operation, and therefore reducing the overhead associated with performing the operation.

In particular, the at least one micro-operation may comprise a first micro-operation and a second micro-operation, in which the first micro-operation is to control the processing circuitry to form the at least one vector of narrowed data elements, and the second micro-operation is to control the processing circuitry to consume the at least one vector of narrowed data elements formed by the first micro-operation and store the consumed at least one vector of narrowed data elements to the location in the memory system. In some examples, the second micro-operation may also control the processing circuitry to determine the location in the memory system to which the at least one vector of narrowed data elements is to be stored, based on the at least one address operand. Because for the narrowing vector store instruction the narrowed data is not stored to an architectural register at an intermediate step between the narrowing and store to memory, the micro-operation for retrieving data from the architectural register may not need to be provided, and instead the micro-operation for storing data to memory can consume data directly from the micro-operation for narrowing the data elements. Hence, providing a single architectural instruction which stores narrowed data elements from a plurality of vector source registers to memory can lead to a reduction in issued micro-operations, and therefore an improvement in performance, compared to examples in which the operation is performed using more than one instruction.

The amount of data stored to memory in response to the narrowing vector store instruction is not particularly limited. However, in some examples the encoding of the narrowing vector store instruction permits the size of the at least one vector of narrowed elements, which is stored to memory, to be equal to the given vector length. It is not essential for every instance of the narrowing vector store instruction to store an amount of data of size greater than or equal to the given vector length, as for some cases of predication, it might be that the predicate operand of the narrowing vector store instruction may have selected an active portion of data of size less than the given vector length. Nevertheless, the maximum amount of data capable of being stored by the narrowing vector store instruction may be greater than or equal to the given vector length. This is notably different from a narrowing vector store instruction which specifies a single vector source register, because in the case where there is a single vector register having the given vector length as the source, then when the data elements of that single vector register are narrowed the resulting data stored to memory will be less than the given vector length. In comparison, because more than one vector source register is specified by the present narrowing vector store instruction, then even after the elements are narrowed the total amount of data stored from the plurality of vector source registers can be equal in size or larger than the given vector length. This helps to increase the number of narrowed data elements stored to memory for a given amount of overhead associated with executing a store instruction.

Some examples may support a non-interleaving variant of the narrowing vector store instruction. In response to the non-interleaving variant of the narrowing vector store instruction, the instruction decoding circuitry may control the issue circuitry to issue the at least one micro-operation to control the processing circuitry to store to the location in memory the at least one vector of narrowed data elements wherein the narrowed data elements corresponding to a given one of the plurality of source vector registers are provided in a contiguous portion of the at least one vector of narrowed data elements without intervening narrowed data elements corresponding to any other of the plurality of source vector registers. Hence, the vector of narrowed data elements may comprise a plurality of separate portions, each one of the plurality of vector source registers corresponding to one portion, and each portion comprising the narrowed data elements of the corresponding vector source register. The non-interleaving variant can be useful for supporting workloads where (in comparison to the interleaving variant discussed below) the storage structure in memory does not pack multiple channels in interleaved fashion.

The non-interleaving variant of the narrowing vector store instruction may specify at least one predicate register, each predicate register specifying element predication information indicative of which data elements of the plurality of vector source registers are masked data elements for which corresponding portions of the at least one vector of narrowed data elements are to specify a value independent of the masked data elements. For example, for a portion of the location in the memory system which would otherwise be written with a narrowed data element corresponding to a given masked data element if predication had not been applied, the value independent of the masked data element (e.g., zero) could be written instead.

Alternatively, the portions of the memory system which correspond to masked data elements of a vector source register may be excluded from the store operation, and the store operation may not store any data to those portions of the memory system. The predication information could be represented in different ways. Some examples may specify, as the at least one predicate value, a predicate mask comprising a number of bit fields each indicating whether a corresponding set of one or more elements is masked or non-masked. Alternatively, a predicate counter could be used to indicate a total number of non-masked data elements (which might implicitly be considered to start from the first data element of the vector of narrowed data elements to be stored to the memory system). In other words, the predicate counter could indicate the boundary between an initial set of non-masked data elements and a subsequent set of masked data elements. Regardless of the specific way in which the predicate value represents the masked data elements, by supporting predication, this allows software to prevent the store operations for the narrowing vector store instruction spilling into a subsequent region of memory beyond the end of the data structure being processed, even if the data structure being processed has a total number of data elements that does not correspond to an exact multiple of the number of elements that can be processed using a single instance of the narrowing vector store instruction.

For the non-interleaving variant of the narrowing vector store instruction, the element predication information may be specified by the at least one predicate register at a granularity of the narrowed data elements to be stored to memory. Typically, element predication information for a store instruction may be sized to correspond to a given vector length so that a single predicate register corresponds to a single vector source register. However, because in response to the narrowing vector store instruction the elements are narrowed before being stored, the provided amount of element predication information for a given portion of memory at the narrowed data size can actually be applied to a larger amount of data in the source registers. This means that a given predicate register having a given granularity can correspond to more than one vector source register. Hence, a given predicate register may specify element predication information corresponding to two or more of the plurality of vector source registers of the narrowing vector store instruction. This can enable a reduction in the amount of predicate registers required for a given operation.

In some examples, in response to the narrowing vector store instruction specifying N vector source registers, the instruction decoding circuitry may control the issue circuitry to issue the at least one micro-operation to control the processing circuitry to narrow the data elements to the first data element size being no larger than 1/N times the second data element size. For example, if two source vector registers were specified then the narrowed data elements may be no larger than half the size of the data elements of the source vector registers. This ensures that the vector of narrowed data elements is no larger than the given vector length. This means that the amount of data stored to memory in response to the narrowing vector store instruction is no more than the amount of data that would be stored to memory in response to a conventional store acting on one source register having the given vector length. In such a way control information, which would otherwise be used to control a store operation having the size of the given vector length, does not need to be expanded to accommodate the narrowing vector store instruction. For example, a predicate or other control value may have a size which corresponds to the given vector length, so by restricting the vector of narrowed data elements to the given vector length, then this control information does not need to be expanded and the narrowing vector store instruction is not associated with additional overhead.

Some examples may support an interleaving variant of the narrowing vector store instruction. In response to the interleaving variant of the narrowing vector store instruction, the instruction decoding circuitry controls the issue circuitry to issue the at least one micro-operation to control the processing circuitry to store to the memory system at least one interleaved vector of narrowed data elements comprising a plurality of interleaved channels of narrowed data elements, each of the interleaved channels corresponding to one of the source vector registers. The interleaving variant can be useful for supporting processing workloads which operate on a storage structure comprising multiple interleaved channels of data elements, such as complex numbers represented as separate real and imaginary parts in two channels, or pixel data for an image represented in RGB (red, green, blue) or RGBA (red, green, blue, alpha) format, but which require processing kernels to be applied to individual channels of data extracted from the stored multi-channel structure. Hence, whilst data may be stored in vector registers in separate channels for processing, it may be desired to store the data back to memory in an interleaved pattern so that channels corresponding to a particular element group (e.g., a particular pixel) are stored together in memory.

By combining the interleaving operation with the store operation, then the number of micro-operations may be reduced (by removing a micro-operation for retrieving the interleaved elements from an architectural register for the subsequent store operation), and register pressure can be reduced as the interleaved elements do not need to be temporarily assigned to an architectural register at an intermediate step between the interleaving and the store.

An interleaving pattern for storing the vector of narrowed data elements to the memory location can be implicitly defined by an encoding of the interleaving variant of the narrowing vector store instruction. Hence, there is no need for the de-interleaving variant of the narrowing vector store instruction to specify an index operand which specifies index values explicitly identifying which of the source data elements is to be inserted at each respective element position of the vector of narrowed data elements to be stored to the location in memory. For example the implicitly defined interleaving pattern may be a pattern in which narrowed data elements from respective vector source registers are selected in turn to provide the next element of the vector of narrowed data elements (e.g. if there are separate source registers for R, G, B, and A channels in an image processing example, then elements may be selected from each register in turn to provide an interleaved vector of narrowed data elements RGBARGBARGBA...). Hence, the interleaving variant of the narrowing vector store instruction may be architecturally constrained to control the processing circuitry to perform a particular implicitly-defined interleaving pattern, and cannot support any arbitrary general-purpose permutation where any single data element from the source vectors can be arbitrarily permuted to any position within the vector of narrowed data elements. Using an implicitly defined interleaving pattern can be advantageous because an index vector used to control a general-purpose permutation can make it difficult to write vector-length-agnostic software according to a scalable vector architecture (discussed further below), as the index values for the index vector would need pre-computing in software for a specific vector length, and the limited range available for each index value of the index vector can constrain the maximum vector length that can be supported by a generic permutation operation controlled by the index vector. By providing an interleaving variant of the narrowing vector store instruction which uses an implicitly-defined interleaving pattern, these problems can be avoided, and scalable vectorised software becomes possible for handling workloads which operate on narrower data elements stored in memory as a packed interleaved data structure.

The interleaving variant of the narrowing vector store instruction may also specify at least one predicate value specifying element predication information indicative of which data elements of the plurality of vector source registers are masked data elements for which corresponding portions of the at least one vector of narrowed data elements are to specify a value independent of the masked data elements. However, unlike the non-interleaving variant, for the interleaving variant it can be useful for the element predication information to be specified at a granularity of element groups, each element group comprising one data element from each of the plurality of interleaved channels. For example, all of the elements within the same element group may have their predication controlled by the same mask bit of a predicate mask specified as the at least one predicate value. Alternatively, in the case of a predicate counter, monotonically increasing values of the predicate counter cause successive groups of elements to be selected as active (e.g. a count value of 3 causes a further element group (comprising multiple data elements) to be selected as active which would have been masked if the count value was 2). Hence, for the interleaving variant, the encoding of the instruction constrains the predication to either mask all data elements of the same element group or not mask any of the data elements are the same element group, so it would not be possible to partially mask some elements of the element group while not masking other elements in the same element group. Applying predication at granularity of element groups can be particularly useful for the interleaving variant as typically if there is a need to interleave multiple channels of interleaved data, each of the channels of a given element group will either need to be processed or not, and so there is little benefit to supporting partial predication. Restricting predication to the granularity of element groups can also simplify the micro-architectural implementation because when interleaving the elements of a single element group, a common predicate control can be applied to the corresponding element position in each vector source register.

The number of vector source registers specified by the narrowing vector store instruction can vary. Some examples may support a two-register variant of the narrowing vector store instruction. For example, in the interleaving variant this can be useful to support interleaving of real and imaginary parts of a complex value. Other examples may support three-register and/or four-register versions of the instruction, or could support further vector source registers. An interleaving variant with three vector source registers could be useful for supporting operations on image data in RGB format. An interleaving variant with four vector source registers could be useful for supporting operations on image data in RGBA format. On the other hand, regardless of whether the variant is interleaving or non-interleaving, supporting more than two vector source registers can help to increase the memory throughput, allowing fewer vectorised loop iterations to be processed for a given amount of data. Where more than one variant of the instruction is supported corresponding to different numbers of vector source registers, the respective variants of the instruction can be distinguished by their instruction opcodes or by another field of the instruction identifying the number of vector source registers.

The processing circuitry is to narrow data elements of the plurality of vector source registers to a first data element size from a second data element size. The way this is performed is not particularly limited. For example, a rounding process can be performed when reducing the number of bits to the first data element size. In some examples, if the first number of bits is too small to represent the value of the data element then the narrowed data element may be a saturated value taking the largest value representable in a data element having the first size.

However, in workloads where data is stored in memory at a narrower size and is operated on in registers at a larger size, there may often be no requirement to account for rounding. In these workloads the values to be stored to memory may be anticipated to be no larger than can be represented using the first data element size, as the increase in data element size was for increased precision rather than for representing values larger than could be represented in the smaller data element size. For example, data may be stored in memory as an integer having the first data element size, loaded and expanded to the second data element size, converted to a more precise representation (e.g., floating point from integer), operated on, converted back to the less precise representation (rounding may be performed at this stage), and then specified in the narrowing vector store instruction. Therefore, the data to be narrowed and stored may be expected to not be significantly larger than the elements which were initially loaded from memory having the first data element size. Hence, in some examples the processing circuitry is configured to narrow a given data element by selecting as the narrowed data element a portion of lowest order bits of the given data element, the portion having a size equal to the first data element size. Narrowing by truncation in this way is simple to perform, and can avoid incurring overhead for unnecessary rounding in cases where rounding is not anticipated to be necessary.

In some examples, the narrowing vector store instruction may permit the stored at least one vector of narrowed data elements to have a total size greater than or equal to twice the given vector length. For example, this could be useful for an implementation which stores RGBA data from four vector source registers with a 2x narrowing operation, so that the result is that two vector's worth of data (i.e. data of a size twice the given vector length) is stored to memory. By enabling more than one vector length of data to be stored in a single instruction, memory throughput can be increased.

Some implementations may support a fixed definition for the first data element size and the second data element size (e.g. with a fixed ratio between the first data element size and second data element size, e.g. with the second data element size being twice or four times the first data element size).

However, in some examples, at least one of the first data element size and the second data element size is variable depending on at least one control parameter associated with the narrowing vector store instruction. For example, this control parameter could comprise one or more of: the opcode of the instruction, a field of the instruction encoding for specifying element size information; and/or a parameter stored in a control register for specifying element size information. For example, variants of the instruction corresponding to different settings of the control parameter could be provided for supporting two-times and/or four-times widening operations, and for operating on different sized elements in memory. For example, variants could be provided for cases where the first data element size and second data element size are 8 and 16 bits respectively, 8 and 32 bits respectively and/or 16 and 32 bits respectively, say.

In some implementations, the given vector length may be implicitly defined as a fixed value in the instruction set architecture. In other examples, the given vector length may be variable, based on a parameter associated with the narrowing vector store instruction.

However, it can be useful for the apparatus to comprise vector length storage circuitry to store a vector length parameter indicative of the given vector length. This can help to support vector-length-agnostic software written according to a scalable vector instruction set architecture, where the given vector length used for execution of the narrowing vector store instruction is unknown at compile time, so that the same software could execute on different processing platforms (implementing different vector lengths as the given vector length), but instruction such as loop control instructions may adapt their operation to the implemented vector length based on the given vector length read from the vector length storage circuitry.

In some examples, the apparatus comprising the instruction decoding circuitry and issue circuitry may also comprise the processing circuitry controlled by the issue circuitry to perform the operation corresponding to the narrowing vector store instruction.

However, in other examples, for at least a subset of types of instruction, the issue circuitry may issue the at least one micro-operation corresponding to the narrowing vector store instruction to off-chip processing circuitry on a separate integrated circuit to the issue circuitry. Hence, it is not essential that the processing circuitry which performs the operation in response to the narrowing vector store instruction is part of the same apparatus as the instruction decoding circuitry and issue circuitry which decodes the narrowing vector store instruction and issues one or more corresponding micro-operations. For example, the off-chip processing circuitry may comprise a co-processor to which the main processor comprising the instruction decoding circuitry and issue circuitry may offload operations, where the co-processor could be implemented on a separate chiplet to the instruction decoding/issuing circuit logic of the main processor.

In some examples, the apparatus may comprise a co-processor configured to perform processing operations for a subset of instruction types offloaded by a main processor, and the co-processor may comprise the instruction decoding circuitry for decoding instructions of said subset of instruction types and the issue circuitry. The coprocessor could be located on the same chip as the main processor or on a different chip compared to the main processor. For the subset of instruction types offloaded to the co-processor, that co-processor may have its own internal instruction decoding circuitry and issue circuitry, and so the techniques discussed above could also be implemented within the co-processor.

Hence, there are a variety of use cases where instruction decoding circuitry and issue circuitry may be provided supporting the narrowing vector store instruction.

The techniques discussed above may be implemented within an apparatus which has hardware circuitry provided for implementing the instruction decoding circuitry, issue circuitry (and if provided in the same apparatus, processing circuitry) as discussed above. However, the same technique can also be implemented within a computer program which executes on a host data processing apparatus to provide an instruction execution environment for execution of target code. Such a computer program may control the host data processing apparatus to simulate the architectural environment which would be provided on a hardware apparatus which actually supports target code according to a given instruction set architecture, even if the host data processing apparatus itself does not support that architecture.

The computer program may have instruction decoding program logic which emulates functions of the instruction decoding circuitry discussed above, and processing program logic which performs a processing operation corresponding to a given instruction decoded by the instruction decoding program logic. For example, the instruction decoding program logic may comprise if/then statements to control selection, in response to a given instruction of the target code, of a corresponding sequence of code (part of the processing program logic) written in the native instruction set of the host data processing apparatus, where execution of that sequence of code would control the host data processing apparatus to perform the operations corresponding to the decoded instruction. The instruction decoding program logic and processing program logic support a narrowing vector store instruction as discussed above, which causes a corresponding narrowing store operation (with optional interleaving) to be performed by the processing program logic. However, registers and memory address space expected to be provided in the target program code's instruction set architecture may not actually be provided in the host apparatus. Therefore, such registers and memory address space may be simulated by mapping them onto the host's storage circuitry (e.g. registers and memory of the hots apparatus). Hence, in the simulation embodiment, the store target address of the narrowing vector store instruction (computed based on the at least one address operand) represents an address in a simulated address space (which may differ from the address space used to access host memory of the host data processing apparatus) and the vector source registers may be emulated using corresponding regions of the host storage circuitry rather than being mapped to any specific hardware register file.

Such a simulation computer program can be useful, for example, when legacy code written for one instruction set architecture is being executed on a host processor which supports a different instruction set architecture. Also, the simulation can allow software development for a newer version of the instruction set architecture to start before processing hardware supporting that new architecture version is ready, as the execution of the software on the simulated execution environment can enable testing of the software in parallel with ongoing development of the hardware devices supporting the new architecture. The simulation program may be stored on a storage medium, which may be an non-transitory storage medium.

Particular examples will now be described with reference to the Figures.

Figure 1 schematically illustrates an example of a data processing apparatus 2. The data processing apparatus has a processing pipeline 4 which includes a number of pipeline stages.

In this example, the pipeline stages include a fetch stage 6 for fetching instructions from an instruction cache 8; a decode stage 10 (an example of instruction decoding circuitry) for decoding the fetched program instructions to generate micro-operations (decoded instructions) to be processed by remaining stages of the pipeline; an issue stage 12 (an example of issuing circuitry) for checking whether operands required for the micro-operations are available in registers 14 and issuing micro-operations for execution once the required operands for a given micro-operation are available; an execute stage 16 (an example of processing circuitry) for executing data processing operations corresponding to the micro-operations, by processing operands read from the registers 14 to generate result values; and a writeback stage 18 for writing the results of the processing back to the registers 14. It will be appreciated that this is merely one example of a possible pipeline arrangement, and other systems may have additional stages or a different configuration of stages. For example, in an out-of-order processor a register renaming stage could be included for mapping architectural registers specified by program instructions or micro-operations to physical register specifiers identifying physical registers in the registers 14. In some examples, there may be a one-to-one relationship between program instructions decoded by the decode stage 10 and the corresponding micro-operations processed by the execute stage. It is also possible for there to be a one-to-many or many-to-one relationship between program instructions and micro-operations, so that, for example, a single program instruction may be split into two or more micro-operations, or two or more program instructions may be fused to be processed as a single micro-operation.

The execute stage 16 includes a number of processing units, for executing different classes of processing operation. For example the execution units may include a scalar processing unit 20 (e.g. comprising a scalar arithmetic/logic unit (ALU) 20 for performing arithmetic or logical operations on scalar operands read from the registers 14); a vector processing unit 22 for performing vector operations on vectors comprising multiple vector elements; and a load/store unit 28 for performing load/store operations to access data in a memory system 8, 30, 32, 34. Other examples of processing units which could be provided at the execute stage could include a floating-point unit for performing operations involving values represented in floating-point format, or a branch unit for processing branch instructions.

The registers 14 include scalar registers 25 for storing scalar values, vector registers 26 for storing vector values, and predicate registers 27 for storing predicate values. The predicate values 27 may be used by the vector processing unit 22 when processing vector instructions, with a predicate value in a given predicate register indicating which vector elements of a corresponding vector operand stored in the vector registers 26 are active (non-masked) vector elements or inactive (masked) vector elements (where operations corresponding to inactive data elements may be suppressed or may not affect a result value generated by the vector processing unit 22 in response to a vector instruction).

A memory management unit (MMU) 36 controls address translations between virtual addresses (specified by instruction fetches from the fetch circuitry 6 or load/store requests from the load/store unit 28) and physical addresses identifying locations in the memory system, based on address mappings defined in a page table structure stored in the memory system. The page table structure may also define memory attributes which may specify access permissions for accessing the corresponding pages of the address space, e.g. specifying whether regions of the address space are read only or readable/writable, specifying which privilege levels are allowed to access the region, and/or specifying other properties which govern how the corresponding region of the address space can be accessed. Entries from the page table structure may be cached in a translation lookaside buffer (TLB) 38 which is a cache maintained by the MMU 36 for caching page table entries or other information for speeding up access to page table entries from the page table structure shown in memory.

In this example, the memory system includes a level one data cache 30, the level one instruction cache 8, a shared level two cache 32 and main system memory 34. It will be appreciated that this is just one example of a possible memory hierarchy and other arrangements of caches can be provided. The specific types of processing unit 20 to 28 shown in the execute stage 16 are just one example, and other implementations may have a different set of processing units or could include multiple instances of the same type of processing unit so that multiple micro-operations of the same type can be handled in parallel. It will be appreciated that Figure 1 is merely a simplified representation of some components of a possible processor pipeline arrangement, and the processor may include many other elements not illustrated for conciseness.

Figure 2 illustrates sets of architectural registers available for referencing by instructions encoded according to the instruction set architecture (ISA) supported by the processing system 2. Figure 2 does not show the physical register files provided in hardware for implementing these architectural registers, but merely shows which registers are logically available for referencing by instructions. The particular mapping of the architectural registers logically referenced by instructions onto physical storage within the physical register file(s) 14 can be implemented in many different ways (e.g. with a one-to-one mapping between architectural registers and physical registers in an in-order processor or a variable mapping between architectural registers and physical registers controlled by register renaming in an out-of-order processor).

In this example, the ISA supports: * a general purpose register set 25 comprising general purpose scalar registers for specifying scalar operands for scalar processing operations; * a vector register set 26 comprising vector registers for specifying vector operands for vector processing operations, matrix operations or other SIMD operations; * a predicate register set 27 comprising predicate registers for specifying predicate values for predicating vector, matrix or other SIMD operations; and * a set of control registers 40 for storing control values for controlling operation of the processing apparatus 2. Information stored in the control registers may be set automatically in response to certain events, or can be programmable based on execution of a system register updating instruction.

In this example, the ISA supported by the processing apparatus 2 is a scalable vector ISA (also known as a "vector length agnostic" vector ISA) supporting vector instructions operating on vectors of scalable vector length to enable the same instruction sequence to be executed on apparatuses with hardware supporting different maximum vector lengths. This allows different hardware designers of processor implementations to choose different maximum vector lengths depending on whether their design priority is high-performance or reduced circuit area and power consumption, while software developers need not tailor their software to a particular hardware platform as the software written according to the scalable vector ISA can be executed across any hardware platform supporting the scalable vector ISA, regardless of the particular maximum vector length supported by a particular hardware platform. Hence, the vector length to be used for vector registers 26 accessed by a particular vector instruction of the scalable vector ISA (and hence also the predicate length of the corresponding predicate registers 27) is unknown at compile time (neither defined to be fixed in the ISA itself, nor specified by a parameter of the software itself). The operations performed in response to a given vector instruction of the scalable vector ISA may differ depending on the vector length chosen for a particular hardware implementation (e.g. hardware supporting a greater maximum vector length may process a greater number of vector elements for a given vector instruction than hardware supporting a smaller maximum vector length). An implementation with a shorter vector length may therefore require a greater number of loop iterations to carry out a particular function than an implementation with a longer vector length.

The vector length agnostic property of the scalable vector ISA is useful because within a fixed encoding space available for encoding instructions of the ISA, it is not feasible to create different instructions for every different vector length that may be demanded by processor designers, when considering the wide range of requirements scaling from relatively small energy-efficient microcontrollers to servers and other high-performance-computing systems. By not having a fixed vector length known at compile time, multiple markets can be addressed using the same ISA, without effort from software developers in tailoring code to each performance/power/area point.

To achieve the scalable property of the scalable vector ISA, the functionality of the vector instructions of the scalable vector ISA is defined in the architecture with reference to a parameter (e.g. VL 43 or SVL 42 as shown in Figure 2, described in more detail below) which indicates the vector length in use (when considering the maximum vector length supported in hardware and any software-defined limitations using the control registers 40), where that parameter VL or SVL is unknown at compile time. Hence, execution of the same vector instruction on different systems may produce different results (typically varying in terms of the number of vector elements generated, a subset of which may have the same result values on different platforms, but in general platforms implementing a greater vector length may generate additional vector elements in comparison with a platform implementing a smaller vector length).

Predicate values defined in the predicate registers 27 may be used to control which elements are generated in a given instance of an instruction and can be set based on vector length agnostic principles such as by using comparison instructions to automatically generate the values of predicate for a particular loop iteration or applying some generally-defined predicate pattern which can scale to different vector lengths. Certain instructions may update loop control parameters such as an element count value to track how many vector elements have been processed so far, so that across all iterations of a loop as a whole both implementations with wider and narrow vector lengths may eventually achieve the same results but with different levels of performance, since the implementation with a wider vector length may require fewer loop iterations than an implementation with a narrower vector length.

This particular example of an ISA also supports two different modes for executing vector operations: a non-streaming mode of operation and a streaming mode of operation. Mode indicating state information 41 stored in the control registers 40 indicates whether the current mode is the non-streaming mode or streaming mode, and can be set in response to execution of a mode changing instruction. Scalar operations using the general purpose registers 25 may be processed in the same way regardless of whether the current mode is the non-streaming mode or the streaming mode, but operations using the vector registers 26 and predicate registers 27 may be processed differently depending on whether the current mode is the streaming mode or the non-streaming mode.

In the non-streaming mode, vector registers 26 are architecturally designated as having a vector register length VL identified by a non-streaming vector length specifying value 43 specified in the control registers 40, and the predicate registers 27 are architecturally designated as having a register length VLJX, where X is a constant corresponding to a minimum vector element size supported (e.g. X may equal 8 for an implementation where the smallest vector element size is 8 bits). In the streaming mode, vector registers 26 are architecturally designated as having a streaming mode vector length SVL identified by a streaming vector length specifying value 42 specified in the control registers 40 (the streaming vector length specifying value 42 being separate from the non-streaming vector length specifying value 43), and the predicate registers 27 are architecturally designated as having a register length SVL/X. Hence, both the vector registers 26 and predicate registers 27 may logically be seen as changing register length when there is a change of mode between the streaming mode and the non-streaming mode.

Both the non-streaming vector length specifying value 43 and streaming mode vector length specifying value 42 may be implemented in different ways. In some examples, these vector length specifying values 43, 42 could simply be a hardwired piece of state information which is not programmable by software, and simply indicates the maximum register length supported for each mode by the hardware. This can then be read by software to identify the particular vector length implemented on the hardware executing the program, so that the same software can execute with different vector lengths on different hardware.

In other examples, the ISA may support more privileged software being able to limit the maximum vector length which is usable by software executing in a less privileged state. For example, to save power a given piece of software could be limited so that it cannot make use of the full vector length supported in hardware. Hence, the vector length specifying values 43, 42 could include information settable by software, to specify the vector length to be used in each mode. Nevertheless, even if the more privileged software applies a limit on vector length, the vector length for the application software is still unknown at compile time because it will not be known whether the actual implemented vector length in a particular processor will be greater or less than the limit defined in the length specifying value 43, 42. For implementations with hardware supporting a smaller maximum vector length than the limit defined in the length specifying value 43, 42, a smaller vector length than indicated by the limit will actually be used. For example, the effective vector length seen by software may correspond to the minimum of the maximum vector length supported in hardware for the current mode and the vector length limit set by software. The vector length specifying values 43, 42 may be banked per exception level so that different limits on maximum vector length supported may be specified for software executing in different exception levels (e.g. software at one exception level may be allowed to use a longer vector length than software at another exception level).

Hence, there can be a variety of ways in which control state information stored in the control registers 40 may influence the vector length useful vector operations, but in general some state information is available which can enable software to determine the effective vector length used for each mode. Hence, a given vector length is defined for the vectors associated with a given vector instruction to be executed.

It can be useful to support both the non-streaming modes and streaming modes, as this can provide greater flexibility for hardware microarchitecture designers to support different hardware implementations as shown in Figures 3 and 4 for example. In the example of Figure 3, the vector instructions are executed by processing circuitry 16 of a main processor 2, without any operations being offloaded to a coprocessor. On the other hand, in the example of Figure 4, a coprocessor 50 is provided for executing a particular subset of vector operations using the vector registers 26 and predicate registers 27. The coprocessor 50 may comprise coprocessor decode circuitry 52 for decoding instructions issued to the coprocessor 50 by issue circuitry 12 of the main processor 2, and coprocessor issue circuitry 54 for receiving the decoded micro-operations from the coprocessor decode circuitry 52 and determining when operands for those instructions will be available and issuing the micro-operations to coprocessor processing circuitry 56 when the operands are available. The coprocessor 50 may have its own register storage 58 separate from the register storage 14 and the main processor 2. The coprocessor 50 may have access to the shared memory system shared with the main processor 2 and so can execute vector load/store instructions to load/store data from/to memory (for example, the coprocessor 50 may have access to one or more of the main processor's data caches 30, 32 and may access main memory 34). The coprocessor processing circuitry 56 executes the load/store and computation operations represented by the instructions offloaded to the coprocessor 50 by the main processor 2, with reference to operands stored in the coprocessor register storage 58 and data accessed from the memory system.

The architecturally defined streaming mode of processing and separate vector length indicating values 42, 43 for the respective modes makes it simpler for the hardware to determine when instructions requiring vector registers should be offloaded to the coprocessor 50 or executed within the local execution units 16 of the main processor 2. It also allows software to explicitly designate whether a particular workload would be more suited for execution on the general purpose execution units 16 of the processor 2 or on the more bespoke hardware of the coprocessor 50. This can be useful because for vector processing routines requiring smaller vectors and/or workloads where vector operations are interspersed with scalar operations, it may be more appropriate for the vector operations to be processed on the general purpose execution units 16 local to the processor 2 itself, while the coprocessor 50 may be more suited to processing "streaming" workloads which require high throughput of vector operations on large datasets with relatively little need for intervening scalar operations (e.g. workloads associated with machine learning applications such as neural network processing).

For such streaming workloads, longer vector lengths may be useful to reduce the instruction fetch/decode overhead associated with processing a given number of vector elements. Hence, although the ISA does not require it (the vector length for non-streaming mode may be selected from among a certain set of vector lengths supported, and the streaming mode vector length may be selected from among a second set of vector lengths supported, with no fixed relation between the length selected for non-streaming mode and streaming mode), in implementations which choose to provide a coprocessor 50 for supporting the streaming vector mode, it is relatively likely that the streaming mode vector length may be greater than the non-streaming mode vector length, in some cases many times greater. As just one example (other lengths can also be used), an implementation might choose to implement a maximum vector length of 128 bits in the non-streaming mode and 512 bits in the streaming mode, with the predicate registers therefore having an architectural vector length of 16 bits in the non-streaming mode and 64 bits in the streaming mode.

In the example using a coprocessor 50 shown in Figure 4, the coprocessor 50 could be either on the same chip (integrated circuit) as the main processor 2, or on a separate chip. For example, the main processor 2 and coprocessor 50 may be implemented a separate chiplets on an interposer, each chiplet being manufactured as a separate component and then assembled on the interposer.

Hence, when considering instruction decoding circuitry, issue circuitry and processing circuitry for implementing the narrowing vector store instruction mentioned in this application, a number of different implementations are possible, including: * a first example where the instruction decoding circuitry 10, issue circuitry 12 and processing circuitry 16 are all provided within a processor 2 as shown in Figure 3.

* a second example where the instruction decoding circuitry 10 and issue circuitry 12 are in the main processor 2, but the processing circuitry which executes the narrowing vector store instruction is (at least in some operating modes, such as the streaming mode described above) the coprocessor processing circuitry 56 in the coprocessor 50, which could be either on the same integrated circuit as the main processor 2 or could be off-chip processing circuitry 56 on a separate integrated circuit from the main processor 2. In other operating modes (e.g. the non-streaming mode), the processing circuitry which executes the narrowing vector store instruction could be the processing circuitry 16 of the main processor 2 as in the first example.

* a third example as shown in Figure 4 where the responsibility for decoding the narrowing vector store instruction into one or more micro-operations lies with the coprocessor decode circuitry 52, and so the apparatus handling the narrowing vector store instruction can be considered to be the coprocessor 50 which may or may not be on the same chip as the main processor 2. In this case, the instruction decoding circuitry, issue circuitry and processing circuitry may be the coprocessor decode circuitry 52, coprocessor issue circuitry 54 and coprocessor processing circuitry 56 respectively.

As it is possible that the processing circuitry which actually executes the operations for the narrowing vector store instruction could be on a different chip to the instruction decoding circuitry 10, 52 and issue circuitry 12, 54 that decodes and issues the narrowing vector store instruction, it is not essential for the processing circuitry 16, 56 itself to be in the same apparatus as the instruction decoding circuitry 10, 52 and issue circuitry 12, 54.

While the example of Figure 2 discussed a scalable vector ISA, the narrowing vector store instruction described in this patent application can also be applied to a non-scalable vector ISA for which the vector length is known at compile time (either being fixed in the architecture, or being variable based on a software-specified parameter). Also, while a scalable vector ISA supporting separate non-streaming and streaming modes is described above, the narrowing vector store instruction could also be provided in a scalable vector ISA not supporting the streaming mode, so that mode indication 41 and streaming vector length parameters 42 are not provided, and the given vector length to be used for vector instructions is defined by VL parameter 43.

Figure 5 shows an example of both non-interleaving and interleaving variants of a narrowing vector store instruction.

It can be relatively common when processing certain workloads, such as image processing workloads, to store data elements in memory having a smaller data element size, but when said data is loaded from memory into registers to pad the data to process it as a wider data type. For example, 8-bit pixel data stored in memory may be widened to 16 bits per data element, to allow for processing at higher precision. After processing, the data may then be stored back to memory at the narrower data type, and therefore may be narrowed prior to being stored to memory. However, existing vector architectures may be inefficient at processing this type of operation.

One approach may be to use a narrowing store instruction which specifies a single vector source operand. The instruction may narrow elements of the source register and store them to a contiguous region of memory. However, this approach does not allow for interleaving of elements of the source vector register with elements of other source registers. Interleaving may be required in some workloads where a narrowing store could be used, such as image processing, to store data from different channels (which may be loaded into different registers for separate processing) as interleaved channels of data elements for each pixel position (e.g., separate R, G, B, and A channels may be stored in memory as RGBARGBA...). In addition, a narrowing store instruction specifying a single vector source operand would store to memory less than a full vector length worth of data. The single vector source operand would have a size of up to a vector length, so after narrowing the amount stored to memory would be less than a vector length. The store operation may be able to process at least a vector length of data, and therefore by storing less than a vector length of data, available capacity of the store operation may be unused (and the number of elements processed per store instruction may be reduced, thus requiring additional instances of store instructions to be executed to process a given number of elements, increasing the load/store unit and memory translation overhead in processing that number of elements).

To allow for interleaving and storing a full vector length of data to memory, an alternative approach may be to use a zip instruction which takes data from two or more source registers, narrows and combines data elements of the two or more source registers, and stores the result to an architectural destination register. This could be followed by a store instruction to take the narrowed and combined elements from the architectural destination register, and store the result to a memory location. However, a problem with this approach is that an architectural register is assigned as the destination of the zip instruction, which increases register pressure as it means that the software will be sooner to run out of spare architectural vector register identifiers available for identifying new variables, which can reduce performance due to increased memory operations to spill variables out to memory that could not fit in the architectural register space. A further problem with this approach is that it may be decomposed into a larger number of micro-operations than could be used in a different approach. In particular, the store instruction may be decomposed into two micro-operations, the first of which ("STDATA") retrieves the data from the architectural source register (which is the destination register of the zip instruction) and the second of which ("STADDRESS") stores the retrieved data to memory. However, in the approach discussed below using the narrowing vector store instruction, the first of these micro-operations can be avoided because the narrowed and combined data elements are not stored to an architectural register in an intermediate stage of the operation.

These issues can be addressed by providing, as an instruction supported in the instruction set architecture, a narrowing vector store instruction which specifies at least one address operand for defining a target memory address for a store operation, and specifies multiple vector source registers 26. The instruction can also specify predicate information, for example using a predicate register 27. The function of the instruction is to cause processing circuitry 16, 56 to narrow data elements of the plurality of specified vector source registers to a first data element size from a second data element size, and store a vector of narrowed data elements having a first data element size from the plurality of vector source registers to a location in memory corresponding to the target memory address.

Hence, in the example of Figure 5, the instruction specifies two vector source registers Z1 and Z2, and is used to store to memory a vector of narrowed data elements having a first data element size (e.g., 8 bits) from elements of the two source registers Z1 and Z2 having a second data element size (e.g., 16 bits). Each source register has the given vector length VL, SVL currently in use for the vector processing operations.

For example, the narrowing vector store instruction could specify various operands, such as: ST1B {<Z1>.H, <Z2>.H}, <PO>, [X17, X16] Of course, the instruction encoding seen by the instruction decoding circuitry 10, 52 would comprise a binary value encoded to represent the corresponding information. Here, ST1B represents the type of instruction and can correspond to the opcode encoded in the binary encoding. The register identifiers Z1 and Z2 identify the two vector source registers. In some examples, these could all be explicitly identified in the instruction encoding, or alternatively the instruction may be constrained to select as source registers a group of vector registers having adjacent register identifiers, and only one of these register identifiers may be explicitly encoded in the instruction encoding, with others implicitly being at certain offsets relative to the encoded register identifier. The notation.H represents that the second data element size to be used for the un-narrowed data elements is "halfword" size (16 bits) (as opposed to byte size.B (8 bits) or fullword size.W (32 bits)). In this example, it is implicit that the first data element size for the narrowed data elements to be stored in memory is 8 bits, but other examples could include an operand for identifying the first data element size. The predicate register identifier PO identifies the predicate register providing the predicate value. The register identifiers X17 and X16 represent address operands used to identify the target memory address (e.g. X17 can identify a base register and X16 can identify a register used to specify an offset to be added to the value stored in the base register to generate the target address). It will be appreciated that this is just one example of an addressing mode and any other addressing mode could also be used (e.g. other examples could use an immediate value, program counter value, and/or stack pointer value as one of the address operands).

Hence, when the instruction is decoded by the instruction decoding circuitry 10, 52, the instruction decoding circuitry 10, 52 generates at least one micro-operation which when issued by issue circuitry 12, 54 causes the processing circuitry 16, 56 to narrow data elements of the plurality of vector source registers to a first size from a second size, and store to the location in memory a vector of narrowed data elements. The narrowing could be performed by taking a portion of the lowest-order bits of each data element, the portion having the size equal to the narrowed data element size. For example, as illustrated in Figure 5 element Al could be narrowed to element Al', which comprises the lowest-order bits of Al (e.g., in anticipation that the higher-order bits of Ai are zero).

Both interleaving and non-interleaving variants are possible. With an interleaving variant, the vector of narrowed data elements stored to memory may comprise interleaved elements from each of the vector source registers taken in turn, e.g., Ai BiA2B2... This can be particularly useful in examples for processing data in channels. For example, in image processing data an individual pixel may be represented by R, G, and B values stored in memory together as RGBRGB etc. The individual channels may be de-interleaved on loading so they can be processed as separate channels, but then may be interleaved again on storage. Hence, a variant of the narrowing vector store instruction which interleaves elements from the plurality of vector source registers may be particularly useful. In the non-interleaving variant, the vector of narrowed data elements stored to memory may comprise contiguous portions from each vector source register without intervening elements from another vector source register, e.g., A1A2A3A4131B2B3B4. This variant can be useful in workloads where the contents of each vector source register are to be treated separately both in memory and in the registers.

While Figure 5 shows a two-source-register example of the instruction, it will be appreciated that other variants could store data from a different number of vector source registers (e.g., a three-register variant could be provided for handling RGB interleaved data, or a four-register variant could be provided for handling RGBA data).

Also, variants can be provided to support different element sizes for the first and second data element size, possibly with different ratios between the second data element size and first data element size (e.g. a 4x narrowing rather than a 2x narrowing). 16-bit to 8-bit narrowing is just one example and other examples could vary the data element size for one or both of the first/second data element size based on a programmable parameter referenced by the instruction (either in the instruction encoding itself, or in a control register, or a combination of both). With particular combinations of number of vector source registers and narrowing, the operation may store to memory more than one vector length's worth of data. Hence, the narrowing vector store instruction may generate more than one vector of narrowed data elements. For example, 4 vector source registers narrowed by 2x would mean that 2 vectors of narrowed data elements are stored to memory.

Using this type of instruction can provide several advantages. It can mean that fewer micro-operations can be issued to the processing circuitry. In particular, since there is no architectural register to serve as a destination of the narrowing operation, a micro-operation for storing narrowed elements to memory can directly consume the result of the narrowing operation, and there is no requirement to provide a micro-operation for retrieving data from a register (e.g., the "STDATA" micro-operation does not need to be included). Also, an architectural register does not need to be specified to store the result of the narrowing operation. Also, by storing a full vector length of data to memory, memory throughput can be increased compared to examples in which less than a full vector length of data is stored to memory.

Figure 6 shows a non-interleaving variant of the narrowing store instruction to which predication has been applied. Two predicate values Pg are illustrated which may be provided in a register specified by the narrowing vector store instruction (or, at least in the case of the counter, could be specified in the encoding of the narrowing vector store instruction). The predicate values control predication at the granularity of the narrowed data elements in memory. As shown in Figure 6, if an element of the vector of narrowed data elements to be stored to memory is indicated as masked by the predicate (e.g., if the corresponding element of the predicate has the value "0" in Figure 6, or is a value at an element position having an index higher than the predicate counter value), then the store operation may not cause that element to be stored to memory. For example, in Figure 6 elements 0-8 of the vector of narrowed data elements are unmasked and elements 9-15 are masked, and therefore only values of elements 0-8 are stored to memory.

It will be seen that the predicate operates at the granularity of the data elements in memory. For example, a predicate mask may have elements provided at a one-to-one relationship with elements in the vector of narrowed data elements, or each increment of a predicate counter may correspond to a narrowed data element and the predicate counter may have a maximum value which corresponds to the number of narrowed data elements. If the store instruction were non-narrowing such that the source vector register had the same number of elements as the vector stored to memory, then this would mean that the predicate register may only be usable for predication of one source vector register. However, since the narrowing vector store instruction reduces the size of elements of the vector source registers, meaning that there are fewer data elements in each of the source vector registers than in the vector of narrowed data elements, then the predicate value can be used for predication of two or more vector source registers. Figure 6 shows in particular how the predicate mask Pg can be used for predication of elements from both vector source registers Z1 and Z2.

Although Figure 6 illustrates predication for the non-interleaving variant, it will be appreciated that predication can also be applied to the interleaving variant of the narrowing vector store instruction. In the interleaving variant, however, the predicate value may control predication at granularity of element groups rather than granularity of narrowed data elements. Each element group comprises a contiguous block of narrowed data elements in the memory based structure, each element group comprising a single data element from each of the channels. If a given element group is indicated as masked by the predicate, then the narrowed data elements corresponding to that group are not stored to memory as part of the store operation. For example, in Figure 5 an element group may comprise the elements and a single bit of a predicate mask could control whether both or neither elements are stored to memory in response to the narrowing vector store instruction.

Figure 7 shows a method of performing data processing. At step 100, instruction decoding circuitry 10, 52 decodes an instruction that is encoded according to a given instruction set architecture. At step 102, in response to decoding of the instruction by the instruction decoding circuitry 10, 52, issue circuitry 12, 54 issues at least one micro-operation corresponding to the instruction. At step 104, processing circuitry 16, 56 performs a corresponding processing operation in response to the at least one micro-operation.

Figure 8 illustrates steps performed for processing the operation at step 104 when the instruction decoded at step 100 is a narrowing vector store instruction. At step 110, the processing circuitry 16, 56 narrows data elements of a plurality of vector source registers to a first data element size from a second data element size. At step 112, the processing circuitry 16, 56 stores a vector of narrowed data elements comprising the data elements narrowed at step 110 to a location in memory corresponding to a target memory address computed based on at least one address operand of the narrowing vector store instruction.

Figure 9 illustrates steps performed for the processing operation performed at step 104 when the instruction decoded at step 100 is an interleaving variant of the narrowing vector store instruction. At step 120, the processing circuitry 16, 56 narrows data elements of a plurality of vector source registers to a first data element size from a second data element size. At step 122, the processing circuitry 16, 56 interleaves narrowed data elements from the plurality of vector source registers in a vector of narrowed data elements. The vector of narrowed data elements therefore comprises a plurality of interleaved channels, each channel corresponding to one of the plurality of vector source registers. At step 124, the processing circuitry stores the vector of interleaved narrowed data elements to a location in memory corresponding to a target memory address computed based on at least one address operand of the narrowing vector store instruction. Predication may be applied at step 124 such that a subset of the elements of the vector of narrowed data elements are not stored to memory. The predicate value applied at step 124 may be applied at the granularity of element groups comprising elements at the same position of each of the plurality of vector source registers which are provided in a contiguous portion of the vector of narrowed data elements.

Figure 10 illustrates steps performed for the processing operation performed at step 104 when the instruction decoded at step 100 is a non-interleaving variant of the narrowing vector store instruction. At step 130, the processing circuitry 16, 56 narrows data elements of a plurality of vector source registers to a first data element size from a second data element size. At step 132, the processing circuitry 16, 56 provides at least one vector of narrowed data elements wherein the narrowed data elements corresponding to a given one of the source vector registers are provided in a continuous portion of the vector of narrowed data elements, without intervening narrowed data elements corresponding to any other of the plurality of source vector registers. Therefore, each of the elements may be taken from a first vector source register, then each of the elements from a second vector source register, and so on. At step 134, the processing circuitry stores the vector of narrowed data elements to a location in memory corresponding to a target memory address computed based on at least one address operand of the narrowing vector store instruction. Predication may be applied at step 134 such that a subset of the elements of the vector of narrowed data elements are not stored to memory. The predicate value used to control predication at step 134 may correspond to more than one of the plurality of vector source registers.

Concepts described herein may be embodied in computer-readable code for fabrication of an apparatus that embodies the described concepts. For example, the computer-readable code can be used at one or more stages of a semiconductor design and fabrication process, including an electronic design automation (EDA) stage, to fabricate an integrated circuit comprising the apparatus embodying the concepts. The above computer-readable code may additionally or alternatively enable the definition, modelling, simulation, verification and/or testing of an apparatus embodying the concepts described herein.

For example, the computer-readable code for fabrication of an apparatus embodying the concepts described herein can be embodied in code defining a hardware description language (HDL) representation of the concepts. For example, the code may define a register-transfer-level (RTL) abstraction of one or more logic circuits for defining an apparatus embodying the concepts. The code may define a HDL representation of the one or more logic circuits embodying the apparatus in Verilog, SystemVerilog, Chisel, or VHDL (Very High-Speed Integrated Circuit Hardware Description Language) as well as intermediate representations such as FIRRTL. Computer-readable code may provide definitions embodying the concept using system-level modelling languages such as SystemC and SystemVerilog or other behavioural representations of the concepts that can be interpreted by a computer to enable simulation, functional and/or formal verification, and testing of the concepts.

Additionally or alternatively, the computer-readable code may define a low-level description of integrated circuit components that embody concepts described herein, such as one or more netlists or integrated circuit layout definitions, including representations such as GDSII. The one or more netlists or other computer-readable representation of integrated circuit components may be generated by applying one or more logic synthesis processes to an RTL representation to generate definitions for use in fabrication of an apparatus embodying the invention. Alternatively or additionally, the one or more logic synthesis processes can generate from the computer-readable code a bitstream to be loaded into a field programmable gate array (FPGA) to configure the FPGA to embody the described concepts. The FPGA may be deployed for the purposes of verification and test of the concepts prior to fabrication in an integrated circuit or the FPGA may be deployed in a product directly.

The computer-readable code may comprise a mix of code representations for fabrication of an apparatus, for example including a mix of one or more of an RTL representation, a netlist representation, or another computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus embodying the invention. Alternatively or additionally, the concept may be defined in a combination of a computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus and computer-readable code defining instructions which are to be executed by the defined apparatus once fabricated.

Such computer-readable code can be disposed in any known transitory computer-readable medium (such as wired or wireless transmission of code over a network) or non-transitory computer-readable medium such as semiconductor, magnetic disk, or optical disc. An integrated circuit fabricated using the computer-readable code may comprise components such as one or more of a central processing unit, graphics processing unit, neural processing unit, digital signal processor or other components that individually or collectively embody the concept.

Figure 11 illustrates a simulator implementation that may be used. Whilst the earlier described embodiments implement the present invention in terms of apparatus and methods for operating specific processing hardware supporting the techniques concerned, it is also possible to provide an instruction execution environment in accordance with the embodiments described herein which is implemented through the use of a computer program. Such computer programs are often referred to as simulators, insofar as they provide a software based implementation of a hardware architecture. Varieties of simulator computer programs include emulators, virtual machines, models, and binary translators, including dynamic binary translators. Typically, a simulator implementation may run on a host processor 330 having host storage circuitry 332 (e.g. registers and/or memory), optionally running a host operating system 320, supporting the simulator program 310. In some arrangements, there may be multiple layers of simulation between the hardware and the provided instruction execution environment, and/or multiple distinct instruction execution environments provided on the same host processor. Historically, powerful processors have been required to provide simulator implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstances, such as when there is a desire to run code native to another processor for compatibility or re-use reasons. For example, the simulator implementation may provide an instruction execution environment with additional functionality which is not supported by the host processor hardware, or provide an instruction execution environment typically associated with a different hardware architecture. An overview of simulation is given in "Some Efficient Architecture Simulation Techniques", Robert Bedichek, Winter 1990 USENIX Conference, Pages 53 -63.

To the extent that embodiments have previously been described with reference to particular hardware constructs or features, in a simulated embodiment, equivalent functionality may be provided by suitable software constructs or features. For example, particular circuitry may be implemented in a simulated embodiment as computer program logic. Similarly, memory hardware, such as a register or cache, may be implemented in a simulated embodiment as a software data structure stored in the host storage (e.g. memory or registers) of the host processor 330. In arrangements where one or more of the hardware elements referenced in the previously described embodiments are present on the host hardware (for example, host processor 330), some simulated embodiments may make use of the host hardware, where suitable.

The simulator program 310 may be stored on a computer-readable storage medium (which may be a non-transitory medium), and provides a program interface (instruction execution environment) to the target code 300 (which may include applications, operating systems and a hypervisor) which is the same as the interface of the hardware architecture being modelled by the simulator program 310. Thus, the program instructions of the target code 300 may be executed from within the instruction execution environment using the simulator program 310, so that a host computer 330 which does not actually have the hardware features of the apparatus 2 discussed above (e.g. an instruction decoder 10 and processing circuitry 16 supporting the narrowing vector store instruction as discussed above) can emulate these features.

Hence, the simulator program 310 may have instruction decoding program logic 312 for decoding instructions of the target code 300 and mapping these to corresponding sets of instructions in the native instruction set of the host apparatus 330 which are provided as part of processing program logic 313 of the simulator program. The instruction decoding program logic 312 includes decoding program logic 313 for decoding the narrowing vector store instruction as described above. Register emulating program logic 314 maps register accesses requested by instructions of the target code to accesses to corresponding data structures maintained in the host storage circuitry 332 of the host apparatus 330, such as by accessing data in registers or memory of the host apparatus 330. Memory management program logic 316 implements address translation, page table walks and access permission checking to simulate access to a simulated address space by the target code 300, in a corresponding way to the MMU 36 as described in the hardware-implemented embodiment above. Memory address space simulating program logic 318 is provided to map the simulated physical addresses, obtained by the memory management program logic 316 based on address translation using the page table information maintained by software of the target program code 300, to host virtual addresses used to access host memory of the host processor 330. These host virtual addresses may themselves be translated into host physical addresses using the standard address translation mechanisms supported by the host (the translation of host virtual addresses to host physical addresses being outside the scope of what is controlled by the simulator program 310).

In the present application, the words "configured to..." are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a "configuration" means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. "Configured to" does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.

In the present application, lists of features preceded with the phrase "at least one of mean that any one or more of those features can be provided either individually or in combination. For example, "at least one of: [A], [B] and [C]" encompasses any of the following options: A alone (without B or C), B alone (without A or C), C alone (without A or B), A and B in combination (without C), A and C in combination (without B), B and C in combination (without A), or A, B and C in combination.

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims.

Claims

CLAIMS1. An apparatus, comprising: instruction decoding circuitry to decode instructions; and issue circuitry to issue, in response to decoding of a given instruction by the instruction decoding circuitry, at least one micro-operation to control processing circuitry to perform a processing operation corresponding to the given instruction; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, the instruction decoding circuitry is configured to control the issue circuitry to issue at least one micro-operation to control the processing circuitry to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the memory system corresponding to a target memory address determined based on the at least one address operand.
2. The apparatus according to claim 1, in which the at least one micro-operation comprises a first micro-operation and a second micro-operation; wherein the first micro-operation is to control the processing circuitry to form the at least one vector of narrowed data elements, and the second micro-operation is to control the processing circuitry to consume the at least one vector of narrowed data elements formed by the first micro-operation and store the consumed at least one vector of narrowed data elements to the location in the memory system.
3. The apparatus according to claim 2, in which the second micro-operation is to control the processing circuitry to determine the location in the memory system based on the at least one address operand.
4. The apparatus according to any preceding claim, in which the encoding of the narrowing vector store instruction permits the size of the at least one vector of narrowed elements to be equal to the given vector length.
5. The apparatus according to any preceding claim, in which in response to a non-interleaving variant of the narrowing vector store instruction, the instruction decoding circuitry is configured to control the issue circuitry to issue the at least one micro-operation to control the processing circuitry to store to the location in memory the at least one vector of narrowed data elements wherein the narrowed data elements corresponding to a given one of the plurality of source vector registers are provided in a contiguous portion of the at least one vector of narrowed data elements without intervening narrowed data elements corresponding to any other of the plurality of source vector registers.
6. The apparatus according to claim 5, in which the non-interleaving variant of the narrowing vector store instruction specifies at least one predicate register, each predicate register specifying element predication information indicative of which data elements of the plurality of vector source registers are masked data elements for which corresponding portions of the at least one vector of narrowed data elements are to specify a value independent of the masked data elements; wherein a given predicate register specifies element predication information corresponding to two or more of the plurality of vector source registers.
7. The apparatus according to any preceding claim, in which in response to the narrowing vector store instruction specifying N vector source registers, the instruction decoding circuitry is configured to control the issue circuitry to issue the at least one micro-operation to control the processing circuitry to narrow the data elements to the first data element size being no larger than 1/N times the second data element size.
8. The apparatus according to any preceding claim, in which, in response to an interleaving variant of the narrowing vector store instruction, the instruction decoding circuitry is configured to control the issue circuitry to issue the at least one micro-operation to control the processing circuitry to store to the memory system at least one interleaved vector of narrowed data elements comprising a plurality of interleaved channels of narrowed data elements, each of the interleaved channels corresponding to one of the source vector registers.
9. The apparatus according to claim 8, in which an interleaving pattern for interleaving the data elements of the plurality of source vector registers in the at least one vector of narrowed data elements is implicitly defined by an encoding of the interleaving variant of the narrowing store instruction.
10. The apparatus according to any of claims 8 and 9, in which the plurality of vector source registers comprises at least three vector source registers.
11. The apparatus according to any preceding claim, in which the processing circuitry is configured to narrow a given data element by selecting as the narrowed data element a portion of lowest order bits of the given data element, the portion having a size equal to the first data element size.
12. The apparatus according to any preceding claim, in which the first data element size and the second data element size are variable depending on a control parameter associated with the narrowing vector store instruction.
13. The apparatus according to any preceding claim, in which the first size is not larger than half of the second size.
14. The apparatus according to any preceding claim, comprising vector length storage circuitry to store a vector length parameter indicative of the given vector length.
15. The apparatus according to any preceding claim, comprising the processing circuitry.
16. The apparatus according to any preceding claim, in which, for at least a subset of types of instruction, the issue circuitry is configured to issue the at least one micro-operation to off-chip processing circuitry on a separate integrated circuit to the issuing circuitry.
17. The apparatus according to any of claims 1 to 15, comprising a co-processor configured to perform processing operations for a subset of instruction types offloaded by a main processor, the co-processor comprising the instruction decoding circuitry for decoding instructions of said subset of instruction types and the issue circuitry.
18. Computer-readable code for fabrication of an apparatus comprising: instruction decoding circuitry to decode instructions; and issue circuitry to issue, in response to decoding of a given instruction by the instruction decoding circuitry, at least one micro-operation to control processing circuitry to perform a processing operation corresponding to the given instruction; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, the instruction decoding circuitry is configured to control the issue circuitry to issue at least one micro-operation to control the processing circuitry to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the memory system corresponding to a target memory address determined based on the at least one address operand.
19. A method comprising: decoding instructions; and in response to decoding of a given instruction, issuing at least one micro-operation to control processing circuitry to perform a processing operation corresponding to the given instruction; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, at least one micro-operation is issued to control the processing circuitry to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a memory system, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the memory system corresponding to a target memory address determined based on the at least one address operand.
20. A computer program for controlling a host data processing apparatus to provide an instruction execution environment for execution of target program code, the computer program comprising: instruction decoding program logic to decode instructions of the target program code; and processing program logic to perform a processing operation corresponding to a given instruction decoded by the instruction decoding program logic; in which: in response to decoding of a narrowing vector store instruction specifying at least one address operand and a plurality of vector source registers each for specifying a vector operand having a given vector length, the instruction decoding program logic is configured to control the issue program logic to issue at least one micro-operation to control the processing program logic to: narrow data elements of the plurality of vector source registers to a first data element size from a second data element size, the first data element size being smaller than the second data element size; and store, to a location in a simulated address space, at least one vector of narrowed data elements comprising data elements of the plurality of vector source registers narrowed to the first data element size, the location in the simulated address space corresponding to a target memory address determined based on the at least one address operand.
21. A storage medium storing the computer-readable code of claim 18 or the computer program of claim 20.