HK1099095B

HK1099095B - An apparatus and a system for performing operations of multimedia applications and a method for performing the same

Info

Publication number: HK1099095B
Application number: HK07105278.5A
Authority: HK
Inventors: D. Peleg Alexander; Yaari Yaacov; Mittal Millind; M. Mennemeier Larry; Eitan Benny; F. Glew Andrew; Dulong Carole; Kowashi Eiichi; Witt Wolf
Original assignee: Intel Corporation
Priority date: 1995-08-31
Filing date: 2007-05-18
Publication date: 2011-06-03

Description

Device and system for executing operation of multimedia application and method for realizing operation

The present application is a divisional application of a patent application having an application date of 1996, 7/17, and an application number of 03132844.X entitled "method and apparatus capable of performing a fast switching operation using packet data instructions".

Technical Field

The invention relates in particular to the field of computer systems. More particularly, the present invention relates to the field of packet data operations.

Background

In a typical computer system, a processor is implemented to operate on values represented by a large number of bits (e.g., 64) using instructions that produce a result. For example, executing the add instruction adds the first 64-bit value to the second 64-bit value and stores the result as a third 64-bit value. However, multimedia applications, such as applications aimed at computer-supported collaboration (CSC-integration of teleconferencing with mixed media data processing), 2D/3D graphics, image processing, video compression/decompression, recognition algorithms and audio processing, require the processing of large amounts of data that can be represented with a small number of bits. For example, graphics data typically requires 8 or 16 bits, and sound data typically requires 8 or 16 bits. Each of these multimedia applications requires one or more algorithms, each requiring several operations. For example, an algorithm may require addition, comparison, and shift operations.

To improve multimedia applications (and other applications with the same features), prior art processors provide packet data formats. The bits in a packet data format that are normally used to represent a single value are divided into several fixed-length data elements, each element representing a separate value. For example, a 64-bit register may be divided into two 32-bit elements, each element representing a separate 32-bit value. In addition, these prior art processors provide instructions to process elements of these packet data types separately in parallel. For example, an add instruction of a packet adds corresponding data elements from the first packet data to the second packet data. Thus, if a multimedia algorithm requires a loop containing five operations that must be performed on a large number of data elements, it is always desirable to assemble the data and perform these operations in parallel using packet data instructions. In this manner, the processors can more efficiently process multimedia applications.

However, if the operation loop contains an operation that the processor cannot perform on the packet data (i.e., the processor lacks the appropriate instructions), the data must be decomposed to perform the operation. For example, if the multimedia algorithm requires an addition operation and cannot obtain the above-described packet addition instruction, the programmer must decompose the first packet data and the second packet data (i.e., separate elements containing the first packet data and the second packet data), add the separate individual elements, and then assemble the results into a packet result for further packet processing. The processing time required to perform such assembly and disassembly typically offsets the performance advantages of providing packet data formats. It is therefore desirable to include on a general purpose processor a packet data instruction set that provides all of the operations required for a typical multimedia algorithm. However, due to the limited chip area on today's microprocessors, the number of instructions that can be increased is limited.

One general purpose processor containing packet data instructions is i860XP manufactured by Intel corporation of Santa Clara, Calif^TMA processor. The i860XP processor contains several packet data types with different element sizes. In addition, the i860XP processor contains packet add and packet compare instructions. However, the packet add instruction does not break the carry chain, so the programmer must ensure that the operation being performed by the software does not cause an overflow, i.e., the operation does not cause an overflow of bits from one element of the packet data into the next element of the packet data. For example, if the value 1 is added to an 8-bit packet data element storing "11111111", an overflow occurs and the result is "100000000". In addition, the position of the decimal point in the packet data type supported by i860XP is fixed (i.e., i860XP processors support numbers 8.8, 6.10, and 8.24, where number i.j contains the i most significant bits and j bits after the decimal point). Thereby limiting the values that the programmer can represent. Since the i860XP processor only supports these two instructions, it cannot perform many of the operations required by multimedia algorithms that employ packet data.

Another general purpose processor that supports packet data is the MC88110, manufactured by Motorala corporation^TMA processor. The MC88110 processor supports several different packet data formats with different length elements. Further, the packet instruction set supported by the MC88110 processor includes assembly, disassembly, packet addition, packet subtraction, packet multiplication, packet comparison, and packet rotation.

The MC88110 processor group command generates a field of width r by operating on (t x r)/64 (where t is the number of bits in the element of the packet data) most significant bits connecting the elements in the first register pair. This field replaces the most significant bit of the packet data stored in the second register pair. This packet data is then stored in the third register pair and left-hand rotated r bits. The values of t and r supported, and an example of the operation of this instruction, are shown below in tables 1 and 2.

Undefined operation

TABLE 1

TABLE 2

This implementation of grouping instructions has two drawbacks. The first is that additional logic is required to perform the rotation at the end of the instruction. The second is the number of instructions required to generate the packet data results. For example, if it is desired to use 4 32-bit values to generate a result in a third register (shown above), two instructions with t-32 and r-32 are required, as shown in table 3 below.

TABLE 3

The MC88110 processor decompose command operates by placing a 4, 8, or 16-bit data element from the packet data into the lower half of a data element that is twice as long (8, 16, or 32-bit) and padding with zeros, i.e., setting the higher order bits of the resulting data element to zero. An example of the operation of this split command is shown in table 4 below.

TABLE 4

The MC88110 processor packet multiply instruction multiplies each element of a 64-bit packet by a 32-bit value as if the packet represented a single value, as shown in table 5 below.

TABLE 5

This multiply instruction has two disadvantages. First, this multiply instruction does not break the carry chain, so the programmer must ensure that operations performed on the packed data do not cause overflow. As a result, programmers sometimes must add additional instructions to prevent this overflow. Second, this multiply instruction multiplies each element in the packet data by a single value (i.e., the 32-bit value). As a result, the user has no flexibility to select which elements in the packet data to multiply by the 32-bit value. Thus, the programmer must prepare the data such that the same multiplication is required on every element in the grouped data or wasted processing time to decompose the data whenever less than all of the elements in the data need to be multiplied. The programmer cannot perform multiple multiplications with multiple multipliers in parallel. For example, multiplying 8 different pieces of data, one word long each, requires four separate multiplication operations. Each operation multiplies two words at a time, effectively wasting data lines and circuitry for bits above bit 16.

The MC88110 processor packet compare instruction compares the corresponding 32-bit data elements from the first packet data and the second packet data. Each of the two comparisons may return one of less than (<) or greater than or equal to (≧) giving four possible combinations. The instruction returns an 8-bit result string; four bits indicate which of four possible conditions is met, and four bits indicate the complements of these bits. Conditional branching according to the result of this instruction can be implemented in two ways: 1) transferring by using a sequence condition; or 2) using a jump table. The problem with this instruction is the fact that it requires the execution of a function according to the conditional transfer of data, such as: if Y > A, the X ═ X + B, else X ═ X. The pseudo-coded representation of this function would be:

new microprocessors attempt to speed up execution by speculatively branching to where. If the prediction is correct, performance is not lost and there is a potential to improve performance. However, if the prediction is wrong, performance is lost. Therefore, well-predicted encouragement is enormous. However, the transfer of data (such as above) appears to be an unpredictable way, which destroys the prediction algorithm and leads to more mispredictions. As a result, using this compare instruction to establish conditional branches based on data comes at a high cost in performance.

The MC88110 processor rotate instruction rotates a 64-bit value to any modulo-4 boundary between 0 and 60 bits (see the examples of table 6 below).

TABLE 6

The MC88110 processor does not support individually shifting elements in the packet data because the rotate instruction shifts the high bits of the shift-out register into the low bits of the register. As a result, programming algorithms that require individual shifting of elements in packet data types require: 1) decompose the data, 2) perform shifts on the elements separately, and 3) assemble the results into a resultant packet for further packet data processing.

Disclosure of Invention

Methods and apparatus for incorporating into a processor a packet data instruction set supporting the operations required by a typical multimedia application are described. In one embodiment, the present invention includes a processor and a memory area. The memory area contains instructions for execution by the processor to manipulate the packet data. In this embodiment, the instructions include assemble, disassemble, group add, group subtract, group multiply, group shift, and group compare.

The processor is responsive to receiving the pack instruction to pack a portion of the bits from the data elements in the at least two grouped data to form a third grouped data. In response to receiving the unpack instruction, the processor generates a fourth packed data comprising at least one data element from the first packed data operand and at least one corresponding data element from the second packed data operand.

The processor separately adds corresponding data elements from at least two of the grouped data together in parallel in response to receiving the group add instruction. In contrast, the processor separately subtracts corresponding data elements from at least two grouped data in parallel in response to receiving the group subtraction instruction.

The processor separately multiplies in parallel corresponding data elements from at least two of the grouped data in response to receiving a grouped multiplication instruction.

The processor individually shifts each data element in the packet data operand in parallel by the indicated count value in response to receiving the packet shift instruction.

The processor, in response to receiving the packet comparison instruction, individually compares corresponding data elements from the at least two packet data in parallel in the indicated relationship and as a result stores a packet mask in the first register. The group mask includes at least a first mask element and a second mask element. Each bit in the first mask element represents a result of comparing a set of corresponding data elements, and each bit in the second mask element represents a result of comparing a second set of data elements.

Drawings

The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. Like reference numerals refer to like elements.

FIG. 1 illustrates an exemplary computer system according to one embodiment of the invention.

FIG. 2 illustrates a register file of a processor according to one embodiment of the invention.

FIG. 3 is a flow diagram illustrating the general steps by which a processor processes data in accordance with one embodiment of the invention.

Fig. 4 illustrates packet data types according to one embodiment of the invention.

FIG. 5a shows packed data in registers according to one embodiment of the invention.

FIG. 5b shows packet data in a register, according to one embodiment of the invention.

FIG. 5c shows packet data in a register, according to one embodiment of the invention.

Fig. 6a shows a control signal format indicating the use of packet data according to one embodiment of the invention.

Fig. 6b illustrates a second control signal format indicating the use of packet data according to one embodiment of the invention.

Grouped addition/subtraction

Fig. 7a illustrates a method of performing a packet addition according to one embodiment of the invention.

Fig. 7b illustrates a method of performing a packet subtraction according to an embodiment of the invention.

Fig. 8 illustrates a circuit for performing packet addition and packet subtraction on bits of packet data according to one embodiment of the invention.

FIG. 9 illustrates a circuit for performing packet addition and packet subtraction on packet byte data in accordance with one embodiment of the present invention.

Fig. 10 is a logical view of a circuit that performs block addition and block subtraction on block word data according to one embodiment of the invention.

FIG. 11 is a logical view of a circuit that performs block addition and block subtraction on block doubleword data according to one embodiment of the invention.

Packet multiplication

Fig. 12 is a flow diagram illustrating a method of performing a packet multiplication operation on packet data according to one embodiment of the invention.

FIG. 13 illustrates a circuit for performing block multiplication according to one embodiment of the invention.

Multiply-add/subtract

Fig. 14 is a flow diagram illustrating a method of performing multiply-add and multiply-subtract operations on packet data in accordance with one embodiment of the present invention.

Fig. 15 illustrates a circuit for performing multiply-add and/or multiply-subtract operations on packet data in accordance with one embodiment of the present invention.

Packet shifting

Fig. 16 is a flow diagram illustrating a method of performing a packet shifting operation on packet data in accordance with one embodiment of the present invention.

Fig. 17 illustrates a circuit for performing packet shifting on individual bytes of packet data according to one embodiment of the invention.

Assembly

Fig. 18 is a flow diagram illustrating a method of performing assembly operations on packet data in accordance with one embodiment of the present invention.

FIG. 19a illustrates a circuit for performing a packing operation on packed byte data in accordance with one embodiment of the present invention.

FIG. 19b illustrates a circuit for performing a packing operation on packed word data according to one embodiment of the invention.

Decomposition of

Fig. 20 is a flow diagram illustrating a method of performing a split operation on packet data in accordance with one embodiment of the present invention.

Fig. 21 illustrates a circuit for performing a disaggregation operation on packet data according to an embodiment of the present invention.

Number counting

Fig. 22 is a flow diagram illustrating a method of performing a number count operation on packet data in accordance with one embodiment of the present invention.

Fig. 23 is a flow diagram illustrating a method of performing a number count operation on a data element of a packed data and generating a single result data element for the resulting packed data in accordance with one embodiment of the invention.

Fig. 24 illustrates a circuit for performing a number count operation on packed data having four word data elements in accordance with one embodiment of the present invention.

Fig. 25 illustrates a detailed circuit for performing a number count operation on a word data element of a packet data according to one embodiment of the present invention.

And grouping logical operation.

Fig. 26 is a flow diagram illustrating a method of performing logical operations on packet data in accordance with one embodiment of the present invention.

Fig. 27 illustrates circuitry for performing logical operations on packet data in accordance with one embodiment of the present invention.

Packet comparison

Fig. 28 is a flow diagram illustrating a method of performing a packet comparison operation on packet data in accordance with one embodiment of the present invention.

Fig. 29 illustrates a circuit for performing a packet compare operation on a single byte of packet data in accordance with one embodiment of the present invention.

Detailed Description

This application describes methods and apparatus that include in a processor a set of instructions that support operations on packet data required by typical multimedia applications. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it is understood that the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to unnecessarily obscure the present invention.

Definition of

In order to provide a basis for understanding the description of embodiments of the present invention, the following definitions are proposed.

Position X to position Y;

subfields of binary numbers are defined. For example, byte 00111010₂Bit 6 through bit 0 of (denoted by base 2) represent the sub-field 111010₂. The '2' following the binary number represents base 2. Thus, 1000₂Is equal to 8₁₀And F is₁₆Equal to 15₁₀。

R_X: is a register. A register is any device capable of storing and providing data. Further functions of the registers are described below. Registers are not essential components of a processor assembly.

SRC1, SRC2 and DEST:

identifying memory regions (such as memory addresses, registers, etc.)

Source1-i and Result 1-i: representing data

Computer system

FIG. 1 illustrates an exemplary computer system 100 according to one embodiment of the invention. Computer system 100 includes a bus 101 or other communication hardware and software for communicating information, and a processor 109 coupled with bus 101 for processing information. Processor 109 represents a central processing unit of any type of architecture including CI SC (complex instruction set computing) or RI SC (reduced instruction set computing) type architectures. Computer system 100 also includes a Random Access Memory (RAM) or other dynamic storage device (referred to as main memory 104) coupled to bus 101 for storing information and instructions to be executed by processor 109. Main memory 104 also may be used for storing temporary variables or other intermediate information during execution of instructions by processor 109. Computer system 100 also includes a Read Only Memory (ROM)106 and/or other static storage device coupled to bus 101 for storing static information and instructions for processor 109. A data storage device 107 is coupled to bus 101 for storing information and instructions.

FIG. 1 also shows that processor 109 includes execution unit 130, register file 150, cache 160, decoder 165, and internal bus 170. Of course, the processor 109 also contains other circuitry, which is not shown in order not to obscure the invention.

The execution unit 130 is configured to execute instructions received by the processor 109. In addition to identifying instructions that are typically implemented on a general purpose processor, execution unit 130 also identifies instructions in packet instruction set 140 that perform operations on packet data formats. In one embodiment, the grouped instruction set 140 includes instructions that support assembly operations, disassembly operations, grouped addition operations, grouped subtraction operations, grouped multiplication operations, grouped shift operations, grouped comparison operations, multiply-add operations, multiply-subtract operations, number calculation operations, and a set of grouped logic operations (including grouped AND, grouped NAND, grouped OR, and grouped XOR) in a manner described hereinafter. Although one embodiment of a grouped instruction set 140 containing these instructions is described, other embodiments may contain a subset or superset of these instructions.

By including these instructions, the packet data can be used to perform the operations required by many algorithms used in multimedia applications. Thus, the algorithms can be written to assemble the necessary data and perform the necessary operations on the grouped data without having to disassemble the grouped data to perform one or more operations on one data element at a time. As mentioned above, this has performance advantages over prior art general purpose processors that do not support packet data operations required by certain multimedia algorithms (i.e., if a multimedia algorithm requires operations that cannot be performed on packet data, the program must break down the data, perform the operations on separate elements separately, and then assemble the results into packet results for further packet processing). Furthermore, the disclosed manner in which several of these instructions are executed improves the performance of many multimedia applications.

Execution units 130 are coupled to register file 150 by internal bus 170. Register file 150 represents a storage area on processor 109 for storing information including data. It should be appreciated that one aspect of the invention is the described set of instructions operating on packet data. The memory area used to store packet data is not critical in accordance with this aspect of the invention. However, one embodiment of register file 150 is described later with reference to FIG. 2. Execution unit 130 is coupled to cache 160 and decoder 165. Cache memory 160 is used to cache data and/or control signals from, for example, main memory 104, and decoder 165 is used to decode instructions received by processor 109 into control signals and/or microcode entry points. In response to these control signals and/or microcode entry points, execution unit 130 performs the appropriate operations. For example, if an add instruction is received, decoder 165 may cause execution unit 130 to perform the requested add; if a subtraction instruction is received, decoder 165 causes execution unit 130 to perform the required subtraction; and the like. Decoder 165 may be implemented with any number of different mechanisms (such as a look-up table, a hardware implementation, a PLA, etc.). Thus, although the execution of various instructions by the decoder and execution unit is represented by a series of if/then statements, it should be understood that the execution of the instructions does not require a series of processing of these if/then statements. But rather any mechanism for logically performing such if/then processing is considered to be within the scope of the present invention.

Fig. 1 also shows a data storage device 107, such as a magnetic disk or optical disc, and its corresponding disc drive. Computer system 100 can also be coupled via bus 101 to a display device 121 that displays information to a computer user. The display device 121 may include a frame buffer, a dedicated graphics rendering device (graphics rendering device), a Cathode Ray Tube (CRT), and/or a flat panel display. An alphanumeric input device 122, including alphanumeric and other keys, is typically coupled to bus 101 for communicating information and command selections to processor 109. Another type of user input device is cursor control device 123, such as a mouse, a trackball, pen, touch screen, or cursor direction keys for communicating direction information and command selections to processor 109 and for controlling cursor movement on display device 121. This input device typically has two degrees of freedom in two axes, a first axis (e.g., X) and a second axis (e.g., Y), that allows the device to specify positions in a plane. However, the present invention should not be limited to input devices having only two degrees of freedom.

Another device that may be coupled to bus 101 is a hard copy device 124 that may be used to print instructions, data, or other information on a medium such as paper, film, or similar types of media. Further, the computer system 100 may be coupled to a device 125 for sound recording and/or playback, such as an audio digitizer coupled to a microphone for recording information. In addition, the device may include a speaker coupled to a digital-to-analog (D/a) converter for playing digitized sounds.

The computer system 100 may also be a terminal in a computer network, such as a LAN. Computer system 100 is now a computer subsystem of a computer network. The computer system 100 optionally includes a video digitizing device 126. The video digitizing device 126 can be used to capture video images that can be transmitted over a computer network to other computers.

In one embodiment, the processor 109 additionally supports the X86 instruction set (such as manufactured by Intel corporation of Santa Clara, Calif.)An instruction set used by existing microprocessors such as processors). Thus, in one embodiment, the processor 109 supports the IA defined by Intel corporation of Santa Clara, Calif^TMAll operations supported by the Intel architecture (see "microprocessors", Intel dataset volumes 1 and 2, 1992 and 1993, available from Intel of Santa Clara, Calif.). As a result, processor 109 can support existing X86 operations in addition to the operations of the present invention. Although the present invention is described as being included in the X86-based instruction set, alternative embodiments may include the present invention in other instruction sets. For example, the present invention may be incorporated into a 64-bit processor that employs a new instruction set.

FIG. 2 illustrates a register file of a processor according to one embodiment of the invention. Register file 150 is used to store information including control/status information, integer data, floating point data, and packet data. In the embodiment shown in FIG. 2, register file 150 includes integer registers 201, registers 209, status registers 208, and instruction pointer register 211. The status register 208 indicates the status of the processor 109. The instruction pointer register 211 stores the address of the next instruction to be executed. Integer registers 201, registers 209, status registers 208, and instruction pointer register 211 are all coupled on internal bus 170. Any additional registers are also coupled to internal bus 170.

In one embodiment, registers 209 are used for both packet data and floating point data. In this embodiment, processor 109 must treat registers 209 as stack-oriented floating point registers or as non-stack oriented packet data registers at any given time. In this embodiment, a mechanism is included to allow the processor 109 to switch between operating on the registers 209 as a stack-located floating point register and a non-stack located packet data register. In another embodiment, the processor 109 may operate on both the floating point and packet data registers 209 as non-stack located. As another example, in another embodiment, these same registers may be used to store integer data.

Of course, alternative embodiments may be implemented to include more or less register sets. For example, alternative embodiments may include separate sets of floating point registers for storing floating point data. As another example, an alternative embodiment may include a first set of registers, each for storing control/status information, and a second set of registers, each capable of storing integer, floating point, and packed data. For clarity, the meaning of registers of one embodiment should not be limited to a particular type of circuit. Rather, the registers of one embodiment need only be capable of storing and providing data and performing the functions described herein.

Various register sets (such as integer registers 201, registers 209) may be implemented to include different numbers of registers and/or different sizes of registers. For example, in one embodiment, integer register 201 is implemented to store 32 bits and register 209 is implemented to store 80 bits (all 80 bits are used to store floating point data and only 64 bits are used to store packet data). In addition, register 209 contains 8 registers, R₀212a to R₇212h。R₁ 212a、R₂212b and R₃212c are examples of various ones of registers 209. One of registers 209 may be shifted 32 into one of integer registers 201. Similarly, the value in the integer register may be shifted into one of the registers 209 in 32 bits. In another embodiment, the integer registers 201 each contain 64 bits, and 64 bits of data may be transferred between the integer registers 201 and the registers 209.

FIG. 3 is a flow diagram illustrating the general steps by which a processor processes data in accordance with one embodiment of the invention. For example, these operations include load operations that load data from cache memory 160, main memory 104, Read Only Memory (ROM)104, or data storage device 107 into registers in register file 150.

In step 301, decoder 202 receives control signal 207 from cache 160 or bus 101. Decoder 202 decodes the control signal to determine the operation to be performed.

In step 302, decoder 202 accesses a location in register file 150 or memory. A register in register file 150 or a memory cell in memory is accessed depending on the register address specified in control signals 207. For example, for operations on packet data, control signals 207 may include SRC1, SRC2, and DEST register addresses. SRC1 is the address of the first source register. SRC2 is the address of the second source register. The SRC2 address is optional in some cases, as not all operations require two source addresses. If an operation does not require the SRC2 address, only the SRC1 address is used. DEST is the address of the destination register that stores the result data. In one embodiment, SRC1 or SRC2 are also used as DEST. SRC1, SRC2, and DEST are described more fully with respect to fig. 6a and 6 b. The data stored in the corresponding registers are referred to as Source 1(Source1), Source 2(Source2), and Result (Result), respectively. Each such data is 64 bits in length.

In another embodiment of the invention, any or all of SRC1, SRC2, and DEST can define a memory cell in the addressable memory space of processor 109. For example, SRC1 may identify a memory location in main memory 104, while SRC2 identifies a first register in integer registers 201 and DEST identifies a second register in registers 209. For simplicity of description herein, the present invention will be described with respect to accessing register file 150. However, these accesses can be made to memory.

In step 303, the initiator execution unit 130 performs an operation on the accessed data. At step 304, the result is stored back to register file 150 as required by control signal 207.

Data and storage format

Fig. 4 illustrates packet data types according to one embodiment of the invention. Three packet data formats are shown: packet bytes 401, packet words 402, and packet doublewords 403. In one embodiment of the invention, the packet byte is 64 bits long containing 8 data elements. Each data element is one byte long. Typically, a data element is a single piece of data that is stored in a single register (or memory unit) along with other data elements of the same length. In one embodiment of the invention, the number of data elements stored in the register is 64 bits divided by the bit length of one data element.

The packet word 402 is 64 bits long and contains 4 words 402 data elements. Each word 402 data element contains 16 bits of information.

The packed doubleword 403 is 64 bits long and contains two doubleword 403 data elements. Each doubleword 403 data element contains 32 bits of information.

Fig. 5a to 5c illustrate a packet data store representation in a register according to one embodiment of the invention. Register representation 510 of unsigned packed bytes is shown at register R₀212a to R₇Storage of unsigned packet bytes 401 in one of 212 h. The information of each byte data element is stored in bit 7 to bit 0 of byte 0, bit 15 to bit 8 of byte 1, bit 23 to bit 16 of byte 2, bit 31 to bit 24 of byte 3, bit 39 to bit 32 of byte 4, bit 47 to bit 40 of byte 5, bit 55 to bit 48 of byte 6, and bit 63 to bit 56 of byte 7. Thus, all available bits are used in the register. This storage arrangement increases the storage efficiency of the processor. At the same time, by accessing 8 data elements, an operation can be performed on 8 data elements at the same time. Signed packet byte register representation 511 illustrates the storage of signed packet byte 401. Note that only the eighth bit of each byte data element is needed for the sign indication.

Unsigned packed word register representation 512 shows how word 3 through word 0 are stored in one of registers 209. Bits 15 through 0 contain the data element information for word 0, bits 31 through 16 contain the data element information for word 1, bits 47 through 32 contain the information for data element word 2 and bits 63 through 48 contain the information for data element word 3. Signed packed word register representation 513 is similar to unsigned packed word register representation 512. Note that only the 16 th bit of each word data element is required for the sign indication.

Unsigned packed doubleword register representation 514 shows how registers 209 store two doubleword data elements. Doubleword 0 is stored in bit 31 through bit 0 of the register. Doubleword 1 is stored in bits 63 through 32 of the register. Signed packed doubleword register representation 515 is similar to unsigned packed doubleword register representation 514. Note that the necessary sign bit is the 32 nd bit of the doubleword data element.

As described above, registers 209 may be used for both packet data and floating point data. In this embodiment of the invention, the single programmed processor 109 may be required to track information such as R₀212a whether the addressed register stores packet data or floating point data. In an alternative embodiment, processor 109 may track the type of data stored in various ones of registers 209. This alternative embodiment can then generate an error if, for example, a packed addition operation is attempted on the floating point data.

Control signal format

One embodiment of a control signal format used by the processor 109 to manipulate packet data is described below. In one embodiment of the invention, the control signal is represented as 32 bits. Decoder 202 may receive control signal 207 from bus 101. In another embodiment, decoder 202 can also receive such control signals from cache memory 160.

Fig. 6a illustrates a control signal format indicating the use of packet data according to one embodiment of the invention. The operation field OP 601 (bits 31 to 26) provides information about the operation performed by the processor 109, such as packet addition, packet subtraction, and the like. SRC1602 (bit 25 to bit 20) provides the source register address of the register in register 209. This Source register contains the first packet data Source1 to be used in the execution of the control signals. Similarly, SRC2603 (bits 19-14) contains the address of a register in register 209. This second Source register contains the packet data Source2 to be used during execution of the operation. DEST605 (bit 5 to bit 0) contains the register address in register 209. This destination register will store the Result packet data Result of the packet data operation.

The control bit SZ 610 (bits 12 and 13) indicates the length of the data elements in the first and second packet data source registers. If SZ 610 equals 01₂Then the packet data is formatted into packet bytes 401. If SZ 610 equals 10₂Then the packet data is formatted into packet words 402. SZ 610 equals 00₂Or 11₂The reservation is not used, however, in another embodiment, one of these values may be used to indicate the packet doubleword 403.

Control bit T611 (bit 11) indicates whether the operation is to be performed in saturation mode. If T611 equals 1, a saturation operation is performed. If T611 equals 0, a non-saturation operation is performed. The saturation operation will be described later.

Control bit S612 (bit 10) indicates that signed operation is used. If S612 equals 1, then a signed operation is performed. If S612 is equal to 0, an unsigned operation is performed.

Fig. 6b illustrates a second control signal format indicating the use of packet data in accordance with one embodiment of the present invention. This format corresponds to the universal integer opcode format described in the Pentium processor family user Manual, available from Inte l corporation, the literature distributor (P.O.Box7641, Mt.Prospects, IL, 60056-. Note that OP 601, SZ 610, T611, and S612 are all merged into one large field. For some control signals, bits 3 to 5 are SRC 1602. In one embodiment, when there is one SRC1602 address, then bits 3 through 5 also correspond to DEST 605. In an alternative embodiment, when the SRC2603 address is present, then bits 0 through 2 also correspond to DEST 605. For other control signals, such as packet shifting immediate operations, bits 3 through 5 represent extensions to the opcode field. In one embodiment, this extension allows the programmer to include an immediate value with a control signal, such as a shift count value. In one embodiment, the immediate value follows the control signal. This is described in more detail in "Pentium processor family user Manual" appendix F, pages F-1 to F-3. Bits 0 to 2 represent SRC 2603. This general format allows register-to-register, memory-to-register, register-to-memory, register-to-register, register-to-immediate, register-to-memory addressing. Also, in one embodiment, this general format can support integer register-to-register and register-to-integer register addressing.

Description of saturation/unsaturation

As described above, T611 indicates whether the operation is selectively saturated. When saturation is allowed, the result of the operation will be clamped when the result overflows or overflows the data range. By clamped is meant that the result is set at a maximum or minimum value if the result exceeds the maximum or minimum value of the range. In the case of underflow, saturation clamps the result to the lowest value in the range, and in the case of overflow, to the highest value. Table 7 shows the allowable range of each data format.

Minimum and maximum data format

Unsigned byte 0255

Signed byte-128127

Non-symbol character 065535

Signed word-3276832767

Unsigned double-character 02⁶⁴-1

Signed double-character-2⁶³ 2⁶³-1

TABLE 7

As described above, T611 indicates whether a saturation operation is being performed. Thus, with the unsigned byte data format, if the result of the operation is 258 and saturation is allowed, the result is clamped to 255 before being stored into the destination register of the operation. Similarly, if the result of the operation is-32999 and the processor 109 employs a signed word data format that allows saturation, the result is clamped to-32768 before it is stored into the destination register of the operation.

Packet addition

Grouped addition operation

One embodiment of the present invention is capable of performing a packet addition operation in execution unit 130. That is, the present invention enables the data elements of the first packet data to be added individually to the data elements of the second packet data.

Fig. 7a illustrates a method of performing a packet addition according to one embodiment of the invention. In step 701, decoder 202 decodes control signal 207 received by processor 109. Thus, the decoder 202 decodes: an operation code for packet addition; SRC1602, SRC2603, and DEST605 addresses in register 209; saturated/unsaturated, signed/unsigned, and length of data elements in the packet data. At step 702, the decoder 202 accesses the register 209 in the register file 150 via the internal bus 170, giving the addresses of the SRC1602 and SRC 2603. Register 209 provides packet data Source1 and Source2 stored in registers at these addresses, respectively, to execution unit 130. That is, register 209 passes packet data to execution unit 130 via internal bus 170.

In step 703, the decoder 202 enables the execution unit 130 to perform the packet addition operation. The decoder 202 also communicates the length of the packet data elements, whether saturation is employed, and whether signed arithmetic operations are employed, via the internal bus 170. At step 704, the length of the data element determines which step is performed below. If the data element length in the packet data is 8 bits (byte data), the execution unit 130 performs step 705 a. However, if the data elements in the packet data are 16 bits (word data) in length, the execution unit 130 performs step 705 b. In one embodiment of the invention only 8-bit and 16-bit data element length block additions are supported. However, other embodiments can support different and/or other lengths. For example, 32-bit data element length block addition can additionally be supported in an alternative embodiment.

Assuming the data element is 8 bits in length, step 705a is performed. Execution unit 130 adds bit 7 to bit 0 of Source1 to bit 7 to bit 0 of SRC2, generating bit 7 to bit 0 of Result packet data. In parallel with this addition, execution unit 130 adds bit 15 through bit 8 of Source1 to bit 15 through bit 8 of Source2, generating bit 15 through bit 8 of Result packet data. In parallel with these additions, execution unit 130 adds bits 23 through 16 of Source1 to bits 23 through 16 of Source2, generating bits 23 through 16 of Result packet data. In parallel with these additions, execution unit 130 adds bits 31 through 24 of Source1 to bits 31 through 24 of Source2, generating bits 31 through 24 of Result packet data. In parallel with these additions, the execution unit adds bits 39 through 32 of Source1 to bits 39 through 32 of Source2, producing bits 39 through 32 of Result packet data. In parallel with these additions, execution unit 130 adds bits 47 through 40 of Source1 to bits 47 through 40 of Source2, producing bits 47 through 40 of Result packet data. In parallel with these additions, execution unit 130 adds bit 55 through bit 48 of Source1 to bit 55 through bit 48 of Source2, generating bit 55 through bit 48 of Result packet data. In parallel with these additions, execution unit 130 adds bits 63 through 56 of Source1 to bits 63 through 56 of Source2, producing bits 63 through 56 of Result packet data.

Assuming that the data element is 16 bits in length, step 705b is performed. Execution unit 130 adds bit 15 to bit 0 of Source1 to bit 15 to bit 0 of SRC2, generating bit 15 to bit 0 of Result packet data. In parallel with this addition, execution unit 130 adds bits 31 through 16 of Source1 to bits 31 through 16 of Source2, generating bits 31 through 16 of Result packet data. In parallel with these additions, execution unit 130 adds bits 47 through 32 of Source1 to bits 47 through 32 of Source2, generating bits 47 through 32 of Result packet data. In parallel with these additions, execution unit 130 adds bits 63 through 48 of Source1 to bits 63 through 48 of Source2, generating bits 63 through 48 of Result packet data.

At step 706, decoder 202 starts one of registers 209 with the DEST605 address of the destination register. Accordingly, Result is stored in the register addressed by DEST 605.

Table 8a illustrates a register representation of a packet addition operation. The first row of bits is the packet data representation of Source1 packet data. The second row of bits is the packet data representation of Source2 packet data. The third row of bits is the packet data representation of the Result packet data. The number below each data element digit is the data element number. For example, Source1 data element 0 is 10001000₂. Thus, if the data element is 8 bits long (byte data) and an unsigned unsaturated addition is performed, the execution unit 130 generates a Result packet data as shown.

Note that in one embodiment of the invention, the result is simply truncated when the result overflows or underflows and the operation assumes unsaturation. I.e. ignore the carry bit. For example, in table 8a, the register representation for result data element 1 would be: 10001000₂+10001000₂＝00001000₂. Similarly, the result is truncated for underflows. This truncated form allows programmers to easily perform modulo arithmetic. For example, the formula for the result data element 1 may be expressed as: (Source1 data element 1+ Source2 data element 1) mod 256 is result data element 1. Furthermore, those skilled in the art will appreciate from this description that overflow and underflow can be detected by setting a bit out in the status register.

TABLE 8a

Table 8b illustrates a register representation of a packed word data add operation. Thus, if the data element is 16 bits long (word data) and an unsigned unsaturated addition is performed, the execution unit 130 generates a Result packet data as shown. Note that in word data element 2, the carry from bit 7 (see bit 1, emphasized below) propagates into bit 8, causing data element 2 to overflow (see "overflow" emphasized below).

TABLE 8b

Table 8c illustrates a register representation of a packed doubleword data addition operation. An alternative embodiment of the present invention supports this operation. Thus, if the data element is 32 bits long (i.e., doubleword data) and an unsigned unsaturated addition is performed, the execution unit 130 generates a Result packet data as shown. Note that the carry from bit 7 and bit 15 of doubleword data element 1 propagates into bit 8 and bit 16, respectively.

TABLE 8c

To better illustrate the difference between the packet addition and the normal addition, the data from the above example is reproduced in table 9. However, in this case, normal addition (64 bits) is performed on the data. Note that the carry from bit 7, bit 15, bit 23, bit 31, bit 39, and bit 47 have been carried into bit 8, bit 16, bit 24, bit 32, bit 40, and bit 48, respectively.

TABLE 9

Signed/unsaturated packet addition

Table 10 illustrates an example of signed packet addition where the data elements of the packet data are 8 bits in length. Saturation is not used. Therefore, the result can overflow and underflow. Table 10 uses data different from those in tables 8a to 8c and Table 9.

Watch 10

Signed/saturated block addition

Table 11 illustrates an example of signed packet addition where the data elements of the packet data are 8 bits in length. Saturation is employed, thus clamping overflow to a maximum value and underflow to a minimum value. Table 11 uses the same data as table 10. Here data elements 0 and 2 are clamped to a minimum value and data elements 4 and 6 are clamped to a maximum value.

TABLE 11

Group subtraction

Grouped subtraction operations

One embodiment of the present invention enables the execution of a packet subtraction operation in execution unit 130. That is, the present invention enables each data element of the second packet data to be separately subtracted from each data element of the first packet data.

Fig. 7b illustrates a method of performing a packet subtraction according to an embodiment of the invention. Note that steps 710-713 are similar to steps 701-704.

In the current embodiment of the invention only 8-bit and 16-bit data element length block subtractions are supported. However, alternate embodiments can support different and/or other lengths. For example, an alternative embodiment can additionally support 32-bit data element length grouping subtraction.

Assuming the data element is 8 bits in length, steps 714a and 715a are performed. Execution unit 130 complements 2 from bit 7 to bit 0 of Source 2. In parallel with 2's complement, execution unit 130 complements 2 of bits 15 through 8 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 from bit 23 to bit 16 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 of bits 31 through 24 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 of bits 39 through 32 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 of bits 47 through 40 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 of bits 55 through 48 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 of bits 63 through 56 of Source 2. At step 715a, execution unit 130 performs an addition of the 2's complement bits of Source2 and the bits of Source1, as generally described for step 705 a.

Assuming the data element is 16 bits in length, steps 714b and 715b are performed. Execution unit 130 complements 2 from bit 15 to bit 0 of Source 2. In parallel with this 2's complement, execution unit 130 complements 2 of bits 31 through 16 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 of bits 47 through 32 of Source 2. In parallel with these 2's complement, execution unit 130 complements 2 of bits 63 through 48 of Source 2. At step 715b, execution unit 130 performs an addition of the 2's complement bits of Source2 and the bits of Source1, as generally described for step 705 b.

Note that steps 714 and 715 are methods used in one embodiment of the present invention to subtract the first number from the second number. However, other forms of subtraction are known in the art and the present invention should not be considered limited to the use of a 2's complement arithmetic operation.

At step 716, decoder 202 enables register 209 with the destination address of the destination register. The resulting packet data is thus stored in the DEST register of register 209.

Table 12 illustrates a register representation of a packet subtraction operation. Assuming that the data elements are 8 bits long (byte data) and that unsigned unsaturated subtraction is performed, execution unit 130 generates the resulting packet data as shown.

TABLE 12

Packet data addition/subtraction circuit

Fig. 8 illustrates a circuit for performing packet addition and packet subtraction on bits of packet data according to one embodiment of the invention. Fig. 8 shows a modified bit-slice adder/subtractor 800. Adder/subtractor 801a-b can add or subtract two bits from Source2 at Source 1. The operation and carry control 803 transmits a control signal to the control line 809a to initiate an addition or subtraction operation. Thus, adder/subtractor 801a is at Source1_iSource2 is added or subtracted to the received bit i at 804a_iBit i received at 805a, generated at Result_i806 a. Cin 807a-b and Cout 808a-b represent carry control circuits often found on adders/subtractors.

Control of Cin by enabling bit control 802 from operation and carry control 803 via packet data enable 811_i+1807b and Cout_i. For example, in table 13a, unsigned packet byte addition is performed. If adder/subtractor 801a adds Source1 bit 7 and Source2 bit 7, then operation and carry control 803 will activate bit control 802, stopping the carry from propagating from bit 7 to bit 8.

TABLE 13a

However, if unsigned block word addition is performed and bit 7 of Source1 is similarly added to bit 7 of Source2 with adder/subtractor 801a, bit control 802 propagates the carry over to bit 8. Table 13b illustrates this result. This propagation is allowed for both packed doubleword addition and non-packed addition.

TABLE 13b

Adder/subtractor 801a forms Source2 by first inverting Source2i 805a and adding 1_i805a, 2's complement from Source1_i804a minus the bit Source2_i805 a. Adder/subtractor 801a then adds this result to Source1_i804 a. Techniques for 2's complement operation of a slice are well known in the art, and those skilled in the art will understand how to design a 2's complement operation circuit for such a slice. Note that the propagation of the carry is controlled by bit control 802 and operation and carry control 803.

FIG. 9 illustrates a circuit for performing packet addition and packet subtraction on packet byte data in accordance with one embodiment of the present invention. Source1 bus 901 and Source2 bus 902 pass through Source1 respectively_in906a-h and Source2_in905a-h carry the information signals to adder/subtractors 908 a-h. Thus, adder/subtractor 908a adds/subtracts Source2 bit 7 to bit 0 over Source1 bit 7 to bit 0; adder/subtractor 908b adds/subtracts Source2 bits 15 to 8 over Source1 bits 15 to 8, and so on. CTRL 904a-h receives carry propagation disable, enable/disable saturation from operation control 903 via group control 911And a control signal to enable/disable signed/unsigned arithmetic operations. The operation control 903 inhibits carry propagation by receiving carry information from CTRL 904a-h and not propagating it to next most significant adder/subtractor 908 a-h. Thus, the operation control 903 performs operations of the operation and carry control 803 and the bit control 802 of the 64-bit packet data. Given the examples of fig. 1-9 and the above description, one skilled in the art can build such a circuit.

Adder/subtractor 908a-h passes result information of various packet additions to result registers 910a-h via result outputs 907 a-h. Each Result register 910a-h stores and subsequently transmits Result information onto Result bus 909. This result information is then stored in the integer register specified by the DEST605 register address.

Fig. 10 is a logical view of a circuit that performs block addition and block subtraction on block word data according to one embodiment of the invention. Here, a block word operation is being performed. The operation control 903 enables carry propagation between bits 8 and 7, 24 and 23, 40 and 39, and 56 and 55. Thus, adder/subtractor 908a and 908b, shown as virtual adder/subtractor 1008a, work together to add/subtract the first word (bit 15 to bit 0) of packetized word data Source2 to/from the first word (bit 15 to bit 0) of packetized word data Source 1; adder/subtractor 908c and 908d, shown as virtual adder/subtractor 1008b, work together to add/subtract the second word (bits 31 to 16) of the grouped word data Source2 to the second word (bits 31 to 16) of the grouped word data Source1, and so on.

Virtual adders/subtractors 1008a-d pass result information to virtual result registers 1010a-d through result outputs 1007a-d (combined result outputs 907a-b, 907c-d, 907e-f, and 907 g-h). Each virtual Result register 1010a-d (combined Result registers 910a-b, 910c-d, 910e-f, and 910g-h) stores a 16-bit Result data element to be passed onto the Result bus 909.

FIG. 11 is a logic diagram of a circuit to perform block addition and block subtraction on block doubleword data in accordance with one embodiment of the present invention. The operation control 903 enables carry propagation between bits 8 and 7, 16 and 15, 24 and 23, 40 and 39, 48 and 47, and 56 and 55. Thus, adders/subtractors 908a-d, shown as virtual adder/subtractors 1108a, work together to add/subtract the first doubleword (bit 31 to bit 0) of combined doubleword data Source2 to/from the first doubleword (bit 31 to bit 0) of combined word data Source 1; adder/subtractors 908e-h, shown as virtual adder/subtracter 1108b, work together to add/subtract the second doubleword (bits 63 to 32) of combined doubleword data Source2 to the second doubleword (bits 63 to 32) of combined doubleword data Source 1.

Virtual adder/subtractors 1108a-b pass result information to virtual result registers 1110a-b via result outputs 1107a-b (combined result outputs 907a-d and 907 e-h). Each virtual Result register 1110a-b (combined Result registers 910a-d and 910e-h) stores a 32-bit Result data element to be passed onto the Result bus 909.

Packet multiplication

Block multiplication operation

In one embodiment of the present invention, the SRC1 register contains the multiplicand data (Source1), the SRC2 register contains the multiplier data (Source2), and the DEST register contains a portion of the product (result). I.e., each data element of Source1 is independently multiplied by the corresponding data element of Source 2. Depending on the type of multiplication, either the high order bits or the low order bits of the product will be contained in Result.

In one embodiment of the invention, the following multiplication operations are supported: multiply high unsigned packets, multiply high signed packets, and multiply low packets. High/low indicates which bits from the product are to be included in Result. This is necessary because the multiplication of two N-bit numbers results in a product with 2N bits. Since each result data element is the same size as the multiplicand and multiplier data elements, the result can only represent half of the product. A high results in the higher order bits being output as a result. The low resulting low order bits are output as a result. For example, unsigned high-packet multiplication of Source1[7:0] x Source2[7:0] stores the high order bits of the product in Result [7:0 ].

In one embodiment of the invention, the use of high/low operation modifiers eliminates the possibility of overflow from one data element to the next higher data element. That is, this modifier allows the programmer to select which bits in the product are to be in the result without regard to overflow. The programmer can generate the complete 2N bit product with a combination of block multiplication operations. For example, a programmer can multiply high unsigned packet operations with Source1 and Source2 and then multiply low packet operations with the same Source1 and Source2 to yield a complete (2N) product. Multiplicative high operations are provided because typically the high order bits of the product are the only significant part of the product. The programmer may obtain the high order bits of the product without first performing any truncations that are typically required for non-packet data operations.

In one embodiment of the invention, each data element in Source2 may have a different value. This provides the programmer with the flexibility to have different values for each multiplicand in Source1 as multipliers.

In step 1201, decoder 202 decodes control signal 207 received by processor 109. Thus, the decoder 202 decodes: an operation code for an appropriate multiplication operation; SRC1602, SRC2603, and DEST 604 addresses in register 209; signed/unsigned, high/low, and length of data elements in the packet data.

At step 1202, the decoder 202 accesses the register 209 of the given SRC1602 and SRC2603 addresses in the register file 150 via the internal bus 170. Register 209 provides the execution unit 130 with the packet data stored in the SRC 1603 register (Source1) and the packet data stored in the SRC2603 register (Source 2). That is, register 209 passes packet data to execution unit 130 via internal bus 170.

Decoder 202 enables execution unit 130 to perform the appropriate block multiply operation at step 1130. The decoder 202 also passes the length and high/low of the data elements for the multiply operation over the internal bus 170.

At step 1210, the length of the data element determines which step is performed below. If the length of the data element is 8 bits (byte data), the execution unit 130 performs step 1212. However, if the data elements in the packet data are 16 bits in length (word data), execution unit 130 performs step 1214. In one embodiment, only packed multiplications of 16-bit data element lengths are supported. In another embodiment, a block multiplication of 8-bit and 16-bit data element lengths is supported. However, in another embodiment, 32-bit data element length block multiplication is also supported.

Assuming the data element is 8 bits in length, step 1212 is performed. In step 1212, multiplying Source1 bit 7 to bit 0 by Source2 bit 7 to bit 0 generates Result bit 7 to bit 0. Multiplying Source1 bits 15-8 by Source2 bits 15-8 generates Result bits 15-8. Multiplying the Source1 bits 23 through 16 by the Source2 bits 23 through 16 generates Result bits 23 through 16. Multiplying the Source1 bits 31-24 by the Source2 bits 31-24 generates Result bits 31-24. Multiplying the Source1 bits 39 through 32 by the Source2 bits 39 through 32 generates Result bits 39 through 32. Multiplying Source1 bits 47 through 40 by Source2 bits 47 through 40 generates release bits 47 through 40. Multiplying the Source1 bits 55 through 48 by the Source2 bits 55 through 48 generates Result bits 55 through 48. Multiplying the Source1 bits 63 through 56 by the Source2 bits 63 through 56 generates Result bits 63 through 56.

Assuming the data element is 16 bits in length, step 1214 is performed. In step 1214, the following operations are performed. Multiplying Source1 bits 15 through 0 by Source2 bits 15 through 0 generates Result bits 15 through 0. Multiplying the Source1 bits 31 through 16 by the Source2 bits 31 through 16 generates Result bits 31 through 16. Multiplying Source1 bits 47 through 32 by Source2 bits 47 through 32 generates Result bits 47 through 32. Multiplying the Source1 bits 63 through 48 by the Source2 bits 63 through 48 generates Result bits 63 through 48.

In one embodiment, the multiplication of step 1212 is performed simultaneously. In yet another embodiment, these multiplications are performed serially. In another embodiment, some of these multiplications are performed simultaneously and some are performed serially. This discussion applies equally to the multiplication of step 1214.

The Result is stored in the DEST register at step 1220.

Table 14 shows a register representation of a packed multiply unsigned high operation on packed word data. The first row has bits of the packet data representation of Source 1. The second row has bits of the data representation of Source 2. The third row has bits of the packet data representation of Result. The number below each data element bit is the data element number. For example, Source1 data element 2 is 1111111100000000₂。

TABLE 14

Table 15 shows a register representation of a multiplicative high-signed grouping operation on grouped word data.

Watch 15

Table 16 shows a register representation of a packed multiply low operation on packed word data.

TABLE 16

Packet data multiplication circuit

In one embodiment, the multiplication operations may occur on multiple data elements in the same number of clock cycles as a single multiplication operation on the decomposed data. To achieve execution in the same number of clock cycles, parallelism is employed. I.e. simultaneously instructing the register to perform a multiplication operation on the data elements. This is discussed in more detail below.

FIG. 13 illustrates a circuit for performing block multiplication according to one embodiment of the invention. The arithmetic control 1300 controls a circuit that performs multiplication. The operation control 1300 processes the control signal for the multiplication and has the following outputs: high/low enable 1380; byte/word enable 1381 and sign enable 1382, high/low enable 1380 identifies whether the result includes high order bits or low order bits of the product. Byte/word enable 1381 identifies whether a byte packet data or word packet data multiplication operation is to be performed. Sign enable 1382 indicates whether signed multiplication should be employed.

The block word multiplier 1301 multiplies four word data elements simultaneously. The packed byte multiplier 1302 multiplies 8 byte data elements. Both the packet word multiplier 1301 and the packet byte multiplier have the following inputs: source1[63:0]1331, Source [63:0]1333, sign Enable 1382 and high/Low Enable 1380.

The block word multiplier 1301 comprises 4 16 × 16 multiplier circuits: 16 × 16 multiplier a 1310, 16 × 16 multiplier B1311, 16 × 16 multiplier C1312, and 16 × 16 multiplier D1313. 16 x 16 multiplier A1310 has inputs Source1[15:0] and Source2[15:0], 16 x 16 multiplier B1311 has inputs Source1[31:16] and Source2[31:16], 16 x 16 multiplier C1312 has inputs Source1[47:32] and Source2[47:32], 16 x 16 multiplier D1313 has inputs Source1[63:48] and Source2[63:48 ]. Each 16 x 16 multiplier is coupled to a symbol enable 1382. Each 16 x 16 multiplier produces a 32 bit product. For each multiplier, a multiplexer (mx 01350, mx11351, mx21352, and mx31353, respectively) receives the 32-bit result. Depending on the value of the high/low enable 1380, each multiplexer outputs either the 16 high order bits or the 16 low order bits of the product. The outputs of the four multiplexers are combined into one 64-bit result. This result is optionally stored in a result register 11371.

The block-byte multiplier 1302 includes 8 x8 multiplier circuits: 8 x8 multipliers a1320 to 8 x8 multipliers H1337. Each 8 x8 multiplier has an 8-bit input from each of Source1[63:0]1331 and Source2[63:0] 1333. For example, 8 × 8 multiplier A1320 has inputs Source1[7:0] and Source2[7:0], while 8 × 8 multiplier H1327 has inputs Source1[63:56] and Source2[63:56 ]. Each 8 x8 multiplier is coupled to a symbol enable 1382. Each 8 x8 multiplier produces a 16 bit product. For each multiplier, a multiplexer (such as M × 41360 and M × 111367) receives the 16-bit result. Depending on the value of the high/low enable 1380. Each multiplexer outputs either 8 high order bits or 8 low order bits of the product. The outputs of the 8 multiplexers are combined into one 64-bit result. This result is optionally stored in result register 21372. Byte/word enable 1381 enables a particular result register depending on the length of the data element required by the operation.

In one embodiment, the area for implementing multiplication is reduced by fabricating a circuit that can multiply both 8 x8 numbers or one 16 x 16 number. I.e., two 8 x8 multipliers and a 16 x 16 multiplier are combined into an 8 x8 and 16 x 16 multiplier. The operation control 1300 will allow for the appropriate length of the multiplication. In this embodiment, the physical area used by the multiplier can be reduced, however it would be difficult to perform both block byte multiplication and block word multiplication. In another embodiment supporting packed doubleword multiplication, one multiplier can perform four 8 × 8 multiplications, two 16 × 16 multiplications, or one 32 × 32 multiplication.

In one embodiment, only block word multiplication operations are provided. In this embodiment, the block-byte multiplier 1302 and the result register 21372 are not included.

The advantage of including the above-described block multiply operation in the instruction set

The above-described packed multiply instruction thus provides an independent multiplication of each data element in Source1 by its corresponding data element in Source 2. Of course, an algorithm requiring that elements of Source1 be multiplied by the same number may be performed by storing the same number in elements of Source 2. Furthermore, this multiply instruction prevents overflow by breaking the carry chain; thereby relieving the programmer of this responsibility, eliminating the need to prepare data to prevent instructions from overflowing, and resulting in more robust code.

In contrast, prior art general purpose processors that do not support this instruction need to perform this operation by breaking down the data elements, performing the multiplication, and then assembling the results for further packet processing. Thus, the processor 109 can multiply different data elements of the packet data by different multipliers in parallel using one instruction.

Typical multimedia algorithms perform a large number of multiplications. Thus, the performance of these multimedia algorithms can be improved by reducing the number of instructions required to perform these multiplication operations. Thus, by providing this multiply instruction in the instruction set supported by the processor 109, the processor 109 is able to execute algorithms requiring this functionality at a higher performance level.

Multiply-add/subtract

Multiply-add/subtract operation

In one embodiment, two multiply-add operations are performed using a single multiply-add instruction as shown in tables 17a and 17b below. Table 17a shows a simplified representation of the disclosed multiply-add instruction, while table 17b shows a bit-level example of the disclosed multiply-add instruction.

TABLE 17a

TABLE 17b

The multiply-subtract operation is the same as the multiply-add operation, except that the "add" is replaced with a "subtract". The operation of an example multiply-subtract instruction that performs two multiply-subtract operations is shown in table 12.

TABLE 12

In one embodiment of the invention, the SRC1 register contains packet data (Source1), the SRC2 register contains packet data (Source2), and the DEST register will contain the Result (Result) of executing a multiply-add or multiply-subtract instruction on Source1 and Source 2. In the first step of the multiply-add or multiply-subtract instruction, each data element of Source1 is independently multiplied by a corresponding data element of Source2 to generate a corresponding set of intermediate results. When the multiply-add instruction is executed, these intermediate results are added in pairs, generating two data elements, which are stored as the data elements of Result. Conversely, when a multiply-subtract instruction is executed, the intermediate results of the pair-wise subtraction generate two data elements, which are stored as data elements of Result.

Alternate embodiments may change the number of bits in the data elements of the intermediate Result and/or the data elements in the Result. Furthermore, alternative embodiments may change the number of data elements in Source1, Source2, and Result. For example, if Source1 and Source2 each have 8 data elements, a multiply-add/subtract instruction may be implemented to produce a Result with 4 data elements (each data element in the Result represents an addition of two intermediate results), two data elements (each data element in the Result represents an addition of four intermediate results), and so on.

Fig. 14 is a flow diagram illustrating a method of performing multiply-add and multiply-subtract on packet data according to one embodiment of the invention.

In step 1401, decoder 202 decodes control signal 207 received by processor 109. Thus, the decoder 202 decodes: the operation code of a multiply-add or multiply-subtract instruction.

At step 1402, the decoder 202 accesses the register 209 in the register file 150 via the internal bus 170, giving the SRC1602 and SRC2603 addresses. Register 209 provides the execution unit 130 with the packet data stored in the SRC1602 registers (Source1) and the packet data stored in the SRC2603 registers (Source 2). I.e., registers 209, pass the packet data to execution unit 130 via internal bus 170.

At step 1403, the decoder 202 enables the execution unit 130 to execute the instruction. If the instruction is a multiply-add instruction, flow proceeds to step 1414. If, however, the instruction is a multiply-subtract instruction, flow proceeds to step 1415.

In step 1414, the following operations are performed. Multiplying Source1 bits 15-0 by Source2 bits 15-0 generates a first 32-bit intermediate result (intermediate result 1). Multiplying Source1 bits 31-16 by Source2 bits 31-16 generates a second 32-bit intermediate result (intermediate result 2). Multiplying Source1 bits 47-32 by Source2 bits 47-32 generates a third 32-bit intermediate result (intermediate result 3). Multiplying Source1 bits 63-48 by Source2 bits 63-48 generates a fourth 32-bit intermediate result (intermediate result 4). Adding intermediate Result1 to intermediate Result 2 generates bits 31 to 0 of Result and adding intermediate Result 3 to intermediate Result 4 generates bits 63 to 32 of Result.

Step 1415 is the same as step 1414, except that intermediate Result1 is subtracted from intermediate Result 2 to generate bits 31 through 0 of Result, and intermediate Result 3 is subtracted from intermediate Result 4 to generate bits 63 through 32 of Result.

Different embodiments may perform multiplication and addition/subtraction of serial, parallel, or some combination of serial and parallel operations.

At step 1420, Result is stored in the DEST register.

Packet data multiply-add/subtract circuit

In one embodiment, each multiply-add and multiply-subtract instruction can appear on multiple data elements in the same number of clock cycles as a single multiplication on the decomposed data. To achieve execution in the same number of clock cycles, parallelism is employed. I.e. to instruct the registers to perform multiply-add or multiply-subtract operations on the data elements simultaneously. This is discussed in more detail below.

Fig. 15 illustrates a circuit for performing multiply-add and/or multiply-subtract operations on packet data in accordance with one embodiment of the present invention. The operation control 1500 processes control signals for multiply-add and multiply-subtract instructions. Operation control 1500 outputs a signal on enable 1580 to control packet multiply-adder/subtractor 1501.

The block multiply-adder/subtractor 1501 has the following inputs: source1[63:0]1531, Source2[63:0]1533, and Enable 1580. The block multiply-adder/subtractor 1501 includes 4 16 × 16 multiplier circuits: 16 × 16 multiplier a 1510, 16 × 16 multiplier B1511, 16 × 16 multiplier C1512, and 16 × 16 multiplier D1513. The 16 x 16 multiplier A1510 has inputs Source1[15:0] and Source2[15:0 ]. The 16 × 16 multiplier B1511 has inputs Source1[31:16] and Source2[31:16 ]. The 16 x 16 multiplier C1512 has inputs Source1[47:32] and Source2[47:32 ]. 16 x 16 multiplier D1513 has inputs Source1[63:48] and Source2[63:48], and the 32-bit intermediate results generated by 16 x 16 multiplier A1510 and 16 x 16 multiplier B1511 are received by virtual adder/subtractor 1550, while the 32-bit intermediate results generated by 16 x 16 multiplier C1512 and 16 x 16 multiplier D1513 are received by virtual adder/subtractor 1551.

Virtual adder/subtractors 1550 and 1551 either add or subtract their respective 32-bit inputs based on whether the current instruction is a multiply-add or multiply-subtract instruction. The output of virtual adder/subtractor 1550 (i.e., bits 31-0 of Result) and the output of virtual adder/subtractor 1551 (i.e., bits 63-32 of Result) are combined into a 64-bit Result and passed to Result register 1571.

In one embodiment, virtual adders/subtractors 1551 and 1550 are implemented in a similar manner to virtual adders/subtractors 1108b and 1108a (i.e., each virtual adder/subtractors 1551 and 1550 is composed of 4 8-bit adders with appropriate propagation delays). However, alternate embodiments can implement virtual adders/subtractors 1551 and 1550 in various ways.

To execute the equivalent of these multiply-add or multiply-subtract instructions on a prior art processor operating on decomposed data would require four independent 64-bit multiply operations and two 64-bit add or subtract operations, as well as the necessary load and store operations. This wastes data lines and circuitry above bit 16 and Result above bit 32 for Source1 and Source 2. Also, the entire 64-bit result generated by such prior art processors may be useless to the programmer. Therefore, the programmer would have to truncate each result.

The advantage of including the multiply-add operation described above in the instruction set

The multiply-add/subtract instruction described above may be used for several purposes. For example, a multiply-add instruction may be used for complex multiplication and accumulation of values. Several algorithms using multiply-add instructions will be described later.

Thus, by incorporating the multiply-add and/or multiply-subtract instructions described above into the instruction set supported by processor 109, many functions can be performed with fewer instructions than in prior art general purpose processors that lack such instructions.

Packet shifting

Packet shift operation

In one embodiment of the invention, the SCR1 register contains the data to be shifted (Source1), the SRC2 register contains the data representing the shift count (Source2), and the DEST register contains the Result of the shift (Result). I.e., each data element in Source1 is independently shifted by the shift count. In one embodiment, Source2 is interpreted as an unsigned 64-bit scalar. In another embodiment, Source2 is packetized data and contains shift counts for each corresponding data element in Source 1.

In one embodiment of the present invention, both arithmetic and logical shifts are supported. The arithmetic shift shifts the bits of each data element down by a specified number and fills the high order bits of each data element with the initial value of the sign bit. A shift count greater than 7 for packed byte data, greater than 15 for packed word data, or greater than 31 for packed doublewords results in padding each Result data element with an initial value of the sign bit. The logical shift may operate with an up or down shift. In the logical shift right, the high order bits of each data element are filled with 0 s. In the logical left shift, the least significant bits of each data element are filled with zeros.

In one embodiment of the present invention, arithmetic shift right, logical shift right, and logical shift left are supported for grouped bytes and grouped words. In another embodiment of the invention, these operations are also supported for packet doublewords.

In step 1601, decoder 202 decodes control signal 207 received by processor 109. Thus, the decoder 202 decodes: an opcode for the appropriate shift operation; SRC1602, SRC2603, and DEST605 addresses in register 209; saturation/non-saturation (not necessarily required for shift operations), signed/unsigned (also not necessarily required), and length of data elements in the packet data.

In step 1602, the decoder 202 accesses the register 209 in the register file 150 via the internal bus 170, giving the addresses of the SRC1602 and SRC 2603. Register 209 provides the execution unit 130 with the packet data stored in the SRC1602 registers (Source1) and the scalar shift count stored in the SRC2603 registers (Source 2). I.e., registers 209, pass the packet data to execution unit 130 via internal bus 170.

At step 1603, the decoder 202 enables the execution unit 130 to perform the appropriate packet shifting operation. The decoder 202 also passes the data element length, the shift operation type, and the shift direction (for logical shifts) over the internal bus 170.

At step 1610, the length of the data element determines which step is to be performed next. If the data element is 8 bits (byte data) in length, the execution unit 130 performs step 1612. However, if the length of the data element in the packet data is 16 bits (word data), the execution unit 130 performs step 1614. In one embodiment, only block shifts of 8-bit and 16-bit data element lengths are supported. However, in another embodiment, a packet shift of 32-bit data element length is also supported.

Assuming the data element is 8 bits in length, step 1612 is performed. In step 1612, the following operations are performed. Shifting the Source1 bit 7 through 0 shift count (Source2 bit 63 through 0) generates Result bits 7 through 0. Source1 bits 15 through 8 shift the shift count generates Result bits 15 through 8. Shifting Source1 bits 23 through 16 by the shift count generates Result bits 23 through 16. Shifting Source1 bits 31 through 24 by the shift count generates Result bits 31 through 24. Shifting Source1 bits 39 through 32 by the shift count generates Result bits 39 through 32. Shifting Source1 bits 47 through 40 by the shift count generates Result bits 47 through 40. Shifting Source1 bits 55 through 48 by the shift count generates Result bits 55 through 48. Shifting Source1 bits 63 through 56 by the shift count generates Result bits 63 through 56.

Assuming the data element is 16 bits in length, step 1614 is performed. The following operations are performed in step 1614. Shifting Source1 bits 15 through 0 by the shift count generates Result bits 15 through 0. Shifting Source1 bits 31 through 16 by the shift count generates Result bits 31 through 16. Shifting Source1 bits 47 through 32 by the shift count generates Result bits 47 through 32. Shifting Source1 bits 63 through 48 by the shift count generates Result bits 63 through 48.

In one embodiment, the shifting of step 1612 is performed simultaneously. However, in another embodiment, these shifts are performed serially. In another embodiment some of these shifts are performed simultaneously and some are performed serially. This discussion applies equally to the shifting of step 1614.

At step 1620, Result is stored in DEST register.

Table 19 illustrates a register representation of a byte grouping arithmetic shift right operation. The first row has bits of the packet data representation of Source 1. The second row has bits of the data representation of Source 2. Bit in the third row is resultPacket data representation of t. The number below each data element bit is the data element number. For example, Source1 data element 3 is 10000000₂。

Watch 19

Table 20 illustrates a register representation of a shift right operation of packet logic on packet byte data

Watch 20

Table 21 illustrates a register representation of a shift left operation of the packet logic on the packet byte data.

TABLE 21

Packet data shift circuit

In one embodiment, the shift operation may occur on multiple data elements in the same number of clock cycles as a single shift operation on the decomposed data. To achieve execution in the same number of clock cycles, parallelism is employed. I.e., the instruction register performs a shift operation on the data elements simultaneously, as discussed in more detail below.

Fig. 17 illustrates a circuit for performing packet shifting on individual bytes of packet data according to one embodiment of the invention. Fig. 17 shows the use of a modified slice shift circuit, slice stage i 1799. Each byte slice (except the most significant data element byte slice) contains a shift unit and bit control. The highest data element byte slice requires only one shift unit.

Shift cell i 1711 and shift cell i +11771 each allow shifting 8 bits from Source1 by the shift count. In one embodiment, the shift cells operate as known 8-bit shift circuits. Each shift cell has a Source1 input, a Source2 input, a control input, a next stage signal, a previous stage signal, and a result output. Thus, shift cell i 1711 has Source1_i1731 input, Source2[63:0]]1733 input, control i 1701 input, next stage i 1713 signal, previous stage i 1712 input, and result stored in result register i 1751. Thus, the shift unit i +11771 has Source1_i+11732 input, Source2[63:0]]1733 input, control i + 11702 input, next level i + 11773 signal, previous level i + 11772 input, and result stored in result register i + 11752.

The Source1 input is typically an 8-bit portion of Source 1. The 8 bits represent the minimum type of data element, one packed byte data element. The Source2 input represents a shift count. In one embodiment, each shift unit receives the same shift count from Source2[63:0] 1733. Operation control 1700 transmits a control signal to enable each shift cell to perform the desired shift. The control signals are determined from the type of shift (arithmetic/logic) and the direction of the shift. The next stage signal is received from the bit control of the shift unit. Depending on the direction of shift (left/right), the shift bit cell shifts the most significant bit out/in on the next stage signal. Similarly, each shift unit shifts out/in the lowest bit on the previous stage signal depending on the direction of shift (right/left). The previous stage signal is received from the bit control unit of the previous stage. The result output represents the result of the shift operation on the portion of Source1 on which the shift unit operates.

Bit control i1720 is initiated from operation control 1700 via packet data initiation i 1706. Bit control i1720 controls the next stage i 1713 and the previous stage i + 11772. For example, assume that shift cell i 1711 is responsible for the 8 least significant bits of Source1, and shift cell i +11771 is responsible for the next 8 bits of Source 1. If a shift on a packed byte is performed, bit control i1720 will not allow the least significant bit from shift element i +11771 to communicate with the most significant bit of shift element i 1711. However, when a shift on a block word is performed, then bit control i1720 will allow the least significant bit from shift unit i +11771 to communicate with the most significant bit of shift unit i 1711.

For example, in table 22, a shift of the packet bytes to the right arithmetic is performed. Assume that shift unit i +11771 operates on data element 1 and shift unit i 1711 operates on data element 0. The shift unit i +11771 shifts out its lowest bit. However, operation control 1700 will cause bit control i1720 to stop the bit received from the previous stage i + 11721 from propagating to the next stage i 1713. Conversely, shift unit i 1711 fills the highest order bit Source1[7] with a sign bit.

TABLE 22

However, if a packed word arithmetic shift is performed, the least significant bit of shift unit i +11771 is passed to the most significant bit of shift unit i 1711. Table 23 shows this result. This transfer is also allowed for packet double word shifting.

TABLE 23

Each shift cell is optionally coupled to a result register. The Result register temporarily stores the results of the shift operation until the entire Result [63:0]1760 can be transferred to the DEST register.

For a complete 64-bit block shift circuit, 8 shift cells and 7 bit control cells are used. This circuit can also be used to perform a shift on 64-bit non-packet data, thereby using the same circuit to perform a non-packet shift operation as a packet shift operation.

The advantages of including the shift operation described above in the instruction set

The above-described group-shift instruction causes the elements of Source1 to shift the specified shift count. By adding this instruction to the instruction set, a single instruction can be used to shift elements of a packet of data. Whereas prior art general purpose processors that do not support this operation must execute a number of instructions to decompose Source1, individually shift each decomposed data element, and then assemble the results into a packet data format for further packet processing.

Transfer operation

The transfer operation transfers data to or from register 209. In one embodiment, SRC2603 is the address containing the source data and DEST605 is the address to which the data is to be transferred. In this embodiment, SRC1602 is not used. In another embodiment, SRC1602 is equal to DEST 605.

For the purpose of explaining the transfer operation, the register is distinguished from the memory cell. Registers are in register file 150 and memory may be, for example, in cache 160, main memory 104, ROM 106, data storage device 107.

The transfer operation may transfer data from memory to register 209, from register 209 to memory, and from one of registers 209 to a second one of registers 209. In one embodiment, the packet data is stored in a different register than the integer data. In this embodiment, a transfer operation can transfer data from integer register 201 to register 209. For example, in the processor 109, if packet data is stored in the register 209 and integer data is stored in the integer register 201, a transfer instruction can be used to transfer data from the integer register 201 to the register 209, or vice versa.

In one embodiment, when a memory address is specified for the transfer, the 8 bytes of data in the memory location (the memory location containing the lowest byte) are loaded into one of registers 209 or stored from the register into the specified memory location. When one of the registers 209 is specified, the contents of that register are transferred to or loaded from a second one of the registers 209 or loaded from the second register to the specified register. If integer register 201 is 64 bits in length and specifies one integer register, then 8 bytes of data in the integer register are loaded into or stored from a register in registers 209 into the specified integer register.

In one embodiment, the integer is represented as 32 bits. When a transfer operation from register 209 to register 201 is performed, only the lower 32 bits of the packet data are transferred to the designated integer register. In one embodiment, the high order 32 bits are changed to 0. Similarly, when a transfer from integer register 201 to register 209 is performed, only the lower 32-bits of one of registers 209 are loaded. In one embodiment, the processor 109 supports 32-bit transfer operations between a register in registers 209 and memory. In another embodiment, the 32-bit only transfer is performed on the high order 32 bits of the packet data only.

Assembling operation

In one embodiment of the invention, the SRC1602 register contains data (Source1), the SRC2603 register contains data (Source2), and the DEST605 register will contain the Result data (Result) of the operation. This is the assembly of part of Source1 with part of Source2 to generate Result.

In one embodiment, the assembly operation converts a packet word (or doubleword) into a packet byte (or word) by assembling the lower order bytes (or words) of the source packet word (or doubleword) into the bytes (or words) of Result. In one embodiment, the packing operation converts 4 packed words into packed doublewords. This operation is optionally performed with signed data. Furthermore, this operation may optionally be performed in saturation. In an alternative embodiment, additional assembly operations are added that operate on the high order portions of the data elements.

In step 1801, decoder 202 decodes control signal 207 received by processor 109. Thus, the decoder 202 decodes: an operation code for appropriate assembly operations; SRC1602, SRC2603, and DEST605 addresses in register 209; saturated/unsaturated, signed/unsigned and data element length in packet data. As described above, SRC1602 (or SRC 2603) may be used as DEST 605.

At step 1802, the decoder 202 accesses the register 209 in the register file 150 given the addresses of the SRC1602 and SRC2603 via the internal bus 170. Register 209 provides the execution unit 130 with the packet data stored in the SRC1602 registers (Source1) and the packet data stored in the SRC2603 registers (Source 2). I.e., registers 209, pass the packet data to execution unit 130 via internal bus 170.

At step 1803, the decoder 202 enables the execution unit 130 to perform the appropriate assembly operations. Decoder 202 also passes saturation through internal bus 170 and the length of the data elements in Source1 and Source 2. Saturation may be selected to maximize the value of data in the resulting data element. If the value of a data element in Source1 or Source2 is greater than or less than the range of values that the data element of Result can represent, then the corresponding Result data element is set at its highest or lowest value. For example, if the signed value in the word data elements of Source1 and Source2 is less than 0 × 80 (or 0 × 8000 for a doubleword), the resulting byte (or word) data element is clamped to 0 × 80 (or 0 × 8000 for a doubleword). If the signed values in the word data elements of Source1 and Source2 are greater than 0x 7F (or 0x 7FFF for a doubleword), the resulting byte (or word) data element is clamped to 0x 7F (or 0x 7 FFF).

At step 1810, the length of the data element determines which step is to be performed next. If the data element is 16 bits in length (packed word 402 data), execution unit 130 performs step 1812. However, if the data elements in the packet data are 32 bits in length (packet doubleword 403 data), execution unit 130 performs step 1814.

Assuming the source data element is 16 bits in length, step 1812 is performed. In step 1812, the following operations are performed. Source1 bits 7 through 0 are Result bits 7 through 0. Source1 bits 23 through 16 are Result bits 15 through 8. Source1 bits 39 through 32 are Result bits 23 through 16. Source1 bits 63 through 56 are Result bits 31 through 24. Source2 bits 7 through 0 are Result bits 39 through 32. Source2 bits 23 through 16 are Result bits 47 through 40. Source2 bits 39 through 32 are Result bits 55 through 48. Source2 bits 63 through 56 are Result bits 31 through 24. If saturation is set, the high order bits of each word are tested to determine if the Result data element should be clamped.

Assuming the source data element is 32 bits in length, step 1814 is performed. In step 1814, the following operations are performed. Source1 bits 15 through 0 are Result bits 15 through 0. Source1 bits 47 through 32 are Result bits 31 through 16. Source2 bits 15 through 0 are Result bits 47 through 32. Source2 bits 47 through 32 are Result bits 63 through 48. If saturation is set, the high order bits of each doubleword are tested to determine if the Result data element should be clamped.

In one embodiment, the assembly of step 1812 is performed simultaneously. In yet another embodiment, this assembly is performed serially. In another embodiment, some of the assembly is performed simultaneously and some is performed serially. This discussion also applies to the assembly of step 1814.

At step 1820, Result is stored in DEST605 register.

Table 24 illustrates a register representation of a packed word operation. Subscripted Hs and Ls denote the high and low order bits of each 16-bit data element in Source1 and Source2, respectively. For example A_LRepresenting the low order 8 bits of data element a in Source 1.

Watch 24

Table 25 illustrates a register representation of a packed double word operation, wherein subscripted Hs and Ls denote the high and low order bits of each 32-bit data element in Source1 and Source2, respectively.

TABLE 25

Assembled circuit

In one embodiment of the invention, parallelism is employed in order to achieve efficient execution of assembly operations. Fig. 19a and 19b illustrate circuitry for performing assembly operations on packet data in accordance with one embodiment of the present invention. The circuit is capable of selectively performing a pack operation with saturation.

The circuits of FIGS. 19a and 19b include operation control 1900, result register 1952, result register 1953, 8 16-bit to 8-bit test saturating circuits, and 4 32-bit to 16-bit test saturating circuits.

Operation control 1900 receives information from decoder 202 to initiate the assembly operation. Operation control 1900 initiates a saturation test for each test saturation circuit using the saturation value. If the length of the source packet data is the word packet data 503, the operation control 1900 sets the output enable 1931. This enables the output of the result register 1952. If the source packet data has a length of the double word packet data 504, the operation control 1900 sets the output enable 1932. This enables the output of output register 1953.

Each test saturation circuit is capable of selectively testing for saturation. If saturation testing is disabled, each test saturation circuit passes only the low order bits to the corresponding location in the result register. If test saturation is allowed, each test saturation circuit tests the high order bit to determine if the result should be clamped.

Test saturations 1910 through 1917 have 16-bit inputs and 8-bit outputs. The 8-bit output is the lower 8 bits of the input, or alternatively a clamped value (0 × 80, 0 × 7F, or 0 × FF). Test saturation 1910 receives Source1 bits 15-0 and outputs bits 7-0 to result register 1952. Test saturation 1911 receives Source1 bits 31-16 and outputs bits 15-8 to result register 1952. Test saturation 1912 receives Source1 bits 47-32 and outputs bits 23-16 to result register 1952. Test saturation 1913 receives Source1 bits 63 through 48 and outputs bits 31 through 24 to result register 1952. Test saturation 1914 receives Source2 bits 15-0 and outputs bits 39-32 to result register 1952. Test result 1915 receives Source2 bits 31-16 and outputs bits 47-40 to result register 1952. Test saturation 1916 receives Source2 bits 47-32 and outputs bits 55-48 to result register 1952. Test saturation 1917 receives Source2 bits 63-48 and outputs bits 63-56 to result register 1952.

Test saturations 1920 through 1923 have 32-bit inputs and 16-bit outputs. The 16-bit output is the lower 16 bits of the input, or alternatively a clamped value (0 × 8000, 0 × 7FFF or 0 × FFFF). Test saturation 1920 receives Source1 bits 31 through 0 and outputs bits 15 through 0 for result register 1953. Test saturation 1921 receives Source1 bits 63-32 and outputs bits 31-16 for result register 1953. Test saturation 1922 receives Source2 bits 31-0 and outputs bits 47-32 for result register 1953. Test saturation 1923 receives Source2 bits 63-32 and outputs bits 63-48 for result register 1953.

For example, in table 26, no symbol word packing without saturation is performed. Operation control 1900 will enable result register 1952 to output result [63:0] 1960.

Watch 26

However, if unsigned doubleword packing is performed without saturation, operation control 1900 will enable result register 1953 to output result [63:0] 1960. Table 27 shows this result.

Watch 27

Advantages of including the above assembly operations in the instruction set

The assemble instruction assembles a predetermined number of bits from each data element in Source1 and Source2 to generate Result. In this manner, the processor 109 can assemble data in as few as half of the instructions required in prior art general purpose processors. For example, only one instruction (as opposed to two instructions) is required to generate a result containing 4 16-bit data elements from four 32-bit data elements, as shown below:

watch 28

Typical multimedia applications assemble large amounts of data. Thus, the performance of these multimedia applications is improved by reducing the number of instructions required to assemble these data to half.

Decomposition operation

In one embodiment, the parsing operation interleaves the lower order packet bytes, words, or double words of the two source packet data to generate a resultant packet byte, word, or double word. This operation is referred to herein as a split low operation. In another embodiment, the decomposition operation may also interleave higher order elements (referred to as decomposition high operations).

First, steps 2001 and 2002 are performed. In step 2003, decoder 202 enables execution unit 130 to perform the decomposition operation. Decoder 202 passes the length of the data elements in Source1 and Source2 over internal bus 170.

At step 2010, the length of the data element determines which step is to be performed next. If the data element is 8 bits in length (packet byte 401 data), execution unit 130 performs step 2012. However, if the data elements in the packet data are 16 bits in length (packet word 402 data) then execution unit 130 executes step 2014. However, if the data elements in the packet data are 32 bits in length (packet doubleword 503 data), execution unit 130 executes step 2016.

Assuming the source data element is 8 bits in length, step 2012 is performed. In step 2012, the following operations are performed. Source1 bits 7 through 0 are Result bits 7 through 0. Source2 bits 7 through 0 are Result bits 15 through 8. Source1 bits 15 through 8 are Result bits 23 through 16. Source2 bits 15 through 8 are Result bits 31 through 24. Source1 bits 23 through 16 are Result bits 39 through 32. Source2 bits 23 through 16 are Result bits 47 through 40. Source1 bits 31 through 24 are Result bits 55 through 48. Source2 bits 31 through 24 are Result bits 63 through 56.

Assuming the source data element is 16 bits in length, step 2014 is performed. In step 2014, the following operations are performed. Source1 bits 15 through 0 are Result bits 15 through 0. Source2 bits 15 through 0 are Result bits 31 through 16. Sourcel bits 31 through 16 are Result bits 47 through 32. Source2 bits 31 through 16 are Result bits 63 through 48.

Assuming the source data element is 32 bits in length, step 2016 is performed. In step 2016, the following operations are performed. Source1 bits 31 through 0 are Result bits 31 through 0. Source2 bits 31 through 0 are Result bits 63 through 32.

In one embodiment, the decomposition of step 2012 is performed concurrently. However, in another embodiment, this decomposition is performed serially. In another embodiment, some decompositions are performed simultaneously and some are performed serially. This discussion also applies to the decomposition of step 2014 and step 2016.

At step 2020, Result is stored in DEST605 register.

Table 29 illustrates a split doubleword operation (data elements A)_0-1And B_0-1Including 32 bits).

Watch 29

Table 30 illustrates the split word operation (data elements A)_0-3And B_0-3Including 16 bits).

Watch 30

Table 31 illustrates the split byte operation (data elements A)_0-7And B_0-7Containing 8 bits).

Watch 31

Decomposition circuit

Fig. 21 shows a circuit for performing a disaggregation operation on packet data according to an embodiment of the present invention. The circuit of fig. 21 includes operation control circuit 2100, result register 2152, result register 2153, and result register 2154.

The operation control 2100 receives information from the decoder 202 to initiate the decomposition operation. If the source packet data has a length of the byte packet data 502, the operation control 2100 sets the output enable 2132. This enables the output of result register 2152. If the length of the source packet data is the word packet data 503, the operation control 2100 sets the output enable 2133. This enables the output of output register 2153. If the source packet data has a length of the double word packet data 504, the operation control 2100 sets the output enable 2134. This enables the output of output result register 2154.

Result register 2152 has the following inputs. Source1 bits 7 through 0 are bits 7 through 0 of result register 2152. Source2 bits 7 through 0 are bits 15 through 8 of result register 2152. Source1 bits 15-8 are bits 23-16 of result register 2152. Source2 bits 15 through 8 are bits 31 through 24 of result register 2152. Source1 bits 23 through 16 are bits 39 through 32 of result register 2152. Source2 bits 23 through 16 are bits 47 through 40 of result register 2152. Source1 bits 31-24 are bits 55-48 of result register 2152. Source2 bits 31-24 are bits 63-56 of result register 2152.

Result register 2153 has the following inputs. Source1 bits 15 through 0 are bits 15 through 0 of result register 2153. Source2 bits 15 through 0 are bits 31 through 16 of result register 2153. Source1 bits 31-16 are bits 47-32 of result register 2153. Source2 bits 31-16 are bits 63-48 of result register 2153.

Result register 2154 has the following inputs. Source1 bits 31 through 0 are bits 31 through 0 of result register 2154. Source2 bits 31-0 are bits 63-32 of result register 2154.

For example, in table 32, a split word operation is performed. Operation control 2100 will enable result register 2153 to output results [63:0] 2160.

Watch 32

However, if a split doubleword is performed, operation control 2100 will enable Result register 2154 to output Result [63:0] 2160. Table 33 shows this result.

Watch 33

The advantage of including the above-described split instruction in the instruction set

Packet data may be interleaved or parsed by adding the above-described parsing instructions to the instruction set. This decompose instruction can be used to decompose the packet data by making the data elements in Source2 all 0's. An example of the decomposed bytes is shown in table 34 a.

Watch 34a

The same split instruction may be used to interleave the data as shown in table 34 b. Interleaving is useful in a variety of multimedia algorithms. For example, interleaving may be used to transpose the matrix and interpolate the pixels.

Table 34b

Thus, by adding this split instruction to the instruction set supported by the processor 109, the processor 109 is more versatile and can execute algorithms requiring this functionality at a higher performance level.

Number calculation

One embodiment of the present invention allows a number count operation to be performed on packet data. That is, the present invention generates one result data element for each data element of the first packet data. Each result data element represents a number of bits set in each corresponding data element of the first packet of data. In one embodiment, the count is set to a total number of bits of 1.

Table 35a illustrates a register representation of a number count operation on packet data. The first row of bits is the packet data representation of Source1 packet data. The second row of bits is the packet data representation of Result packet data. The data word below each data element bit is the data element number. For example, Source1 data element 0 is 1000111110001000₂. Thus, if the data element length is 16 bits (word data) and a number count operation is performed, the execution unit 130 generates a Result packet data as shown.

Watch 35a

In another embodiment, the number counting is performed on 8-bit data elements. Table 35b illustrates a register representation of the number count on packet data having 8-bit packet data elements.

TABLE 35b

In another embodiment, the number counting is performed on 32-bit data elements. Table 35c illustrates a register representation of the number count on packet data having two 32-bit packet data elements.

Table 35c

The number counting can also be performed on 64-bit integer data. That is, the total number of bits with a set value of 1 in the 64-bit data is obtained. Table 35d illustrates a register representation of the number count on the 64-bit integer data.

Table 35d

Method for counting execution number

Fig. 22 is a flow diagram illustrating a method of performing a number count operation on packet data in accordance with one embodiment of the present invention. In step 2201, the decoder 202 decodes a control signal 207 in response to receipt of the control signal 207. In one embodiment, the control signal 207 is supplied via the bus 101. In another embodiment, control signal 207 is supplied by cache memory 160. Thus, the decoder 202 decodes: the number-counted opcode, and the addresses of SRC1602 and DEST605 in register 209. Note that SRC2603 is not used in the current embodiments of the invention. Nor is the data element length in saturated/unsaturated, signed/unsigned, and packed data used in this embodiment. In the current embodiment of the invention only 16-bit data element length grouped additions are supported. However, those skilled in the art will appreciate that the number counting can be performed on packet data having 8 packet byte data elements or two packet doubleword data elements.

At step 2202, the decoder 202 accesses the register 209 in the register file 150 via the internal bus 170, giving the SRC1602 address. Register 209 provides the packet data Source1 stored in the register at this address to execution unit 130, i.e. register 209 passes the packet data to execution unit 130 via internal bus 170.

In step 2130, the decoder 202 activates the execution unit 130 to perform the number counting operation. In an alternative embodiment, decoder 202 also communicates the packet data element length over internal bus 170.

At step 2205, assuming the data element length is 16 bits, execution unit 130 totals the bits set in Source1 bits 15 through 0, producing Result packet data bits 15 through 0. In parallel with this counting, the execution unit 130 counts the total of the Source1 bits 31 through 16, producing Result packet data bits 31 through 16. In parallel with the generation of these totals, execution unit 130 sums bits 47 through 32 of Source1, producing bits 47 through 32 of Result packet data. In parallel with the generation of these sums, execution unit 130 sums bits 63 through 48 of Source1, producing bits 63 through 48 of Result packet data.

At step 2206, decoder 202 enables the register in register 209 with the DEST605 address of the destination register. Accordingly, Result packet data is stored in the register addressed by DEST 605.

Method for performing number counting on data element

Fig. 23 is a flow diagram illustrating a method for performing a number count operation on a data element of a packet and generating a single result data element of a result packet in accordance with one embodiment of the present invention. At 2310a, a column sum CSum1a and column carry CCary 1a are generated from Source1 bits 15, 14, 13, and 12. At step 2310b, the column sum CSum1b and column carry CCarry1b are generated from Source1 bits 11, 10, 9 and 8. At step 2310C, the column sum CSum1C and column carry CCarry1C are generated from Source1 bits 7, 6, 5 and 4. At step 2310d, the column sum CSum1d and column carry CCarry1d are generated from Source1 bits 3, 2, 1 and 0. In one embodiment of the invention, steps 2310a-d are performed in parallel. In step 2320a, column sum CSum2a and column carry CCarry2b are generated from CSum1a, CCarry 1a, CSum1b and CCarry1 b. In step 2320b, column sum CSum2b and column carry CCarry2b are generated from CSum1c, CCarry1, CSum1d and CCarry1 d. In one embodiment of the invention, steps 2320a-b are performed in parallel. In step (b)At step 2330, the column sum CSum3 and the column carry CCarry3 are generated from CSumm2a, CCarry 2a, CSum2b, and CCarry2 b. At step 2340, Result results are generated from CSum3 and CCarry 3. In one embodiment, Result is represented in 16 bits. In this embodiment, bits 15 through 5 are set to 0 since only bits 4 through 0 are needed to represent the maximum number of bits set in Source 1. The maximum number of bits of Source1 is 16. This occurs when Source1 equals 1111111111111111₂Then (c) is performed. Result will be 16 and 0000000000010000 used₂And (4) showing.

Thus, to calculate 4 result data elements for a number count operation on 64-bit packed data, the steps of FIG. 23 are performed for each data element in the packed data. In one embodiment, 4 16-bit result data elements are computed in parallel.

Circuit for counting execution number

Fig. 24 illustrates a circuit for performing a number count operation on packed data having 4 word data elements in accordance with one embodiment of the present invention. Fig. 25 illustrates a detailed circuit for performing a number count operation on a word data element of a packet data according to one embodiment of the present invention.

FIG. 24 shows a circuit in which Source1 bus 2401 passes through Source1_IN2406a-d carry information signals to popcnt circuits 2408 a-d. Thus, popcnt circuit 2408a counts the total number of bits set from bit 16 to bit 0 of Source1, and generates Result from bit 15 to bit 0. The popcnt circuit 2408b calculates the total number of bits set from bit 31 to bit 16 of Source1, and generates bit 31 to bit 16 of Result. The popcnt circuit 2408c calculates the total number of bits set from bit 47 to bit 32 of Source1, and generates bit 47 to bit 32 of Result. The popcnt circuit 2408d calculates the total number of set bits from the bits 63 to 48 of the Source1, and generates the bits 63 to 48 of Result. Initiators 2404a-d receive control signals from operation control 2410 through control 2403 that initiate the number counting operations performed by popcnt circuits 2408a-d and place results on Result bus 2409. Given the above description and the descriptions and illustrations in FIGS. 1-6b and 22-25, those skilled in the art will be able to establishThis circuit.

The popcnt circuits 2408a-d pass the Result information of the packet number count operation on to the Result bus 2409 through Result outputs 2407 a-d. This result information is then stored in the integer register specified by the DEST605 register address.

Number of executions on one data element

Counting circuit

Fig. 25 shows a detailed circuit for performing a number count operation on one word data element of packet data. Specifically, fig. 25 shows a portion of a popcnt circuit 2408 a. To achieve maximum performance for applications that employ a number count operation, this operation should be done within one clock cycle. Thus, assuming that accessing the registers and storing the results requires a certain percentage of a clock cycle, the circuit of FIG. 24 completes its operation in approximately 80% of the time of one clock cycle. This circuit has the advantage of allowing the processor 109 to perform a number count operation on four 16-bit data elements in one clock cycle.

popcnt circuitry 2408a employs a 4- >2 carry save adder (CSA will refer to a 4- >2 carry save adder unless otherwise specified), with possible 4- >2 carry save adders in popcnt circuitry 2408a-d being well known in the art. A 4- >2 carry save adder is an adder that adds 4 operands to yield two sums. Since the number count operation in popcnt circuit 2408a includes 16 bits, the first stage includes 4-to > 2-carry-save adders. These four 4- >2 carry-save adders transform 16 one-bit operands into 8 2-bit sums. The second stage transforms 8 2-bit sums into 4 3-bit sums, while the third stage transforms 4 3-bit sums into two 4-bit sums. A 4-bit full adder then adds the two four-bit sums to produce the final result.

While a 4- >2 carry save adder is employed, a 3- >2 carry save adder may be employed in alternative embodiments. In addition, a plurality of full adders can also be used; however, this configuration does not provide results as quickly as the embodiment shown in fig. 25.

Source1_IN15-02406a carries bit 15 through bit 0 of Source 1. The first four-bit coupling is at 4->Carry-2 save adder (CSA 2510a) on its input. The lower four bits are coupled to the input of the CSA2510 b. The next four bits are coupled to the input of the CSA2510 c. The last four bits are coupled to the input of the CSA2510 d. Each CSA2510 a-d generates two 2-bit outputs. The two 2-bit outputs of the CSA2510 a are coupled to the two inputs of the CSA2520 a. The two 2-bit outputs of the CSA2510 b are coupled to the other two inputs of the CSA2520 a. The two 2-bit outputs of the CSA2510 c are coupled to the two inputs of the CSA2520 b. The two 2-bit outputs of the CSA2510d are coupled to the remaining two inputs of the CSA2520 b. Each CSA2520a-b generates two 3-bit outputs. The two 3-bit outputs of 2520a are coupled to two inputs of CSA 2530. The two 3-bit outputs of 2520b are coupled to the remaining two inputs of CSA 2530. The CSA 2530 generates two 4-bit outputs.

These two 4-bit outputs are coupled to two inputs of a full adder (FA 2550). The FA 2550 adds the two 4-bit inputs and passes bit 3 to bit 0 of Result output 2407a as the sum of the two 4-bit inputs. FA 2550 generates bit 4 of Result output 2407a via a carry out (CO 2552). In an alternative embodiment, a 5-bit full adder is employed to generate bit 4 through bit 0 of Result output 2407 a. In either case, bit 15 through bit 5 of Result output 2407a are fixed at 0. Likewise, any carry input to the full adder is fixed at 0.

Although not shown in FIG. 25, those skilled in the art will appreciate that Result output 2407a may be multiplexed or buffered onto Result bus 2409. The multiplexer is controlled by enable 2404 a. This will allow other execution unit circuitry to write data onto Result bus 2409.

Advantages of adding the above-mentioned number counting operations to the instruction set

The number count instruction counts the number of bits set in each data element of the packet data, such as Source 1. Thus, by adding this instruction to the instruction set, the number counting operation can be performed on the packet data in a single instruction. In contrast, prior art general purpose processors must execute a number of instructions to decompose Source1, perform the function on each decomposed data element individually, and then assemble the results for further packet processing.

Thus, by including this one count instruction in the instruction set supported by processor 109, the performance of the algorithm requiring this functionality is improved.

Logical operations

In one embodiment of the invention, the SRC1 register contains packet data (Source1), the SRC2 register contains packet data (Source2), and the DEST register will contain the Result (Result) of performing the selected logical operation on Source1 and Source 2. For example, if a logical AND operation is selected, then Source1 is logically ANDed with Source 2.

In one embodiment of the invention, the following logical operations are supported: a logical AND, a logical NOT AND (ANDN), a logical OR, and a logical XOR (XOR). Logical AND, OR, and XOR operations are well known in the art. A logical NOT AND (ANDN) operation ANDs the logical NOT of Source2 and Source 1. Although the present invention is described with respect to these logical operations, other embodiments may implement other logical operations.

Fig. 26 is a flow diagram illustrating a method of performing several logical operations on packet data in accordance with one embodiment of the present invention.

In step 2601, decoder 202 decodes control signal 207 received by processor 109. Thus, the decoder 202 decodes: an opcode for the appropriate logical operation (i.e., AND, NOT AND, OR XOR); SRC1602, SRC2603, and DEST 604 addresses in register 209.

At step 2602, the decoder 202 accesses the register 209 in the register file 150 via the internal bus 170, giving the addresses of SRC1602 and SRC 2603. Register 209 provides the execution unit 130 with the packet data stored in the SRC1602 registers (Source1) and the packet data stored in the SRC2603 registers (Source 2). That is, register 209 passes packet data to execution unit 130 via internal bus 170.

At step 2603, the decoder 202 enables the execution unit 130 to perform a selected one of the block logic operations.

At step 2610, a selected one of the block logic operations determines which step is performed next. If the logical AND operation is selected, the execution unit 130 performs step 2612; if the logical NOT operation is selected, the execution unit 130 performs step 2613; if the logical OR operation is selected, the execution unit 130 performs step 2614; if the logical XOR operation is selected, the execution unit 130 proceeds to step 2615.

Assuming that a logical AND operation is selected, step 2612 is performed. In step 2612, Source1 bits 63 through 0 and Source2 bits 63 through 0 are ANDed to generate Result bits 63 through 0.

Assuming that a logical NOT operation is selected, step 2613 is performed. In step 2613, Source1 bits 63 through 0 and Source2 bits 63 through 0 are ANDed to generate Result bits 63 through 0.

Assuming a logical OR operation is selected, step 2614 is performed. In step 2614, Source1 bits 63 through 0 and Source2 bits 63 through 0 are ORed to generate Result bits 63 through 0.

Assuming a logical exclusive-or operation is selected, step 2615 is performed. In step 2615, Source1 bits 63 through 0 and Source2 bits 63 through 0 are XOR' ed to generate Result bits 63 through 0.

At step 2620, Result is stored in the DEST register.

Table 36 illustrates a register representation of a logical not operation on packet data. The first row of bits is the packet data representation of Source 1. The second row of bits is the packet data representation of Source 2. The third row of bits is the packet data representation of Result. The number below each data element bit is the data element number. For example, Source1 data element 2 is 1111111100000000₂。

Watch 36

Although the present invention is described with respect to performing the same logical operation on corresponding data elements in Source1 and Source2, alternative embodiments may support instructions that allow the selection of logical operations to be performed on corresponding data elements on an element-by-element basis.

Packet data logic circuit

In one embodiment, the above-described logical operations can occur on multiple data elements in the same number of clock cycles as a single logical operation on non-grouped data. To achieve execution in the same number of clock cycles, parallelism is employed.

Fig. 27 illustrates circuitry for performing logical operations on packet data in accordance with one embodiment of the present invention. The operation controller 2700 controls a circuit that performs a logical operation. Operation control 2700 processes the control signals and outputs select signals on control lines 2780. These selection signals pass a selected one of and, not, or and xor operations to the logic operation circuit 2701.

The logic operation circuit 2701 receives sources 1[63:0] and sources 2[63:0] and performs logic operations specified by the selection signals to generate Result. The logical operations circuit 2701 passes Result [63:0] to the Result register 2731.

The advantages of adding the above logic operations to the instruction set

The above-described logical instructions perform a logical AND, a logical NOT AND, a logical OR and a logical XOR. These instructions are useful in any application that requires logical manipulation of data. These logical operations may be performed on the packet data in one instruction by adding the instructions to the instruction set supported by the processor 109.

Packet comparison

Packet compare operation

In one embodiment of the invention, the SRC1602 register contains the data to be compared (Source1), the SRC2603 register contains the data against which to compare (Source2), and the DEST605 register contains the Result of the comparison (Result). That is, each data element with Source2 is independently compared in a specified relationship to each data element in Source 1.

In one embodiment of the invention, the following comparison relationship is supported: is equal to; a signed greater than; signed greater than or equal to; unsigned greater than; or unsigned greater than or equal to. This relationship is tested in each pair of corresponding data elements. For example, Source1[7:0] is greater than Source2[7:0], resulting in Result [7:0 ]. If the comparison satisfies the relationship, then the corresponding data elements in Result are set to all 1's in one embodiment. If the Result of the comparison does not satisfy the relationship, the corresponding data elements in Result are set to all 0 s.

At step 2801, decoder 202 decodes control signal 207 received by processor 109. Thus, the decoder 202 decodes: operation codes of the proper comparison operation; SRC1602, SRC2603, and DEST605 addresses in register 209; saturation/non-saturation (not necessary for comparison operations), signed/unsigned and length of data elements in the packet data. As described above, SRC1602 (or SRC 2603) may be used as DEST 605.

At step 2802, the decoder 202 accesses the register 209 in the register file 150 via the internal bus 170 given the addresses of the SRC1602 and SRC 2603. Register 209 provides the execution unit 130 with the packet data stored in the SRC1602 registers (Source1) and the packet data stored in the SRC2603 registers (Source 2). That is, register 209 passes packet data to execution unit 130 via internal bus 170.

At step 2803, decoder 202 enables execution unit 130 to perform the appropriate packet comparison operation. The decoder 202 also passes the relationship of the length of the data elements and the comparison operation over the internal bus 170.

At step 2810, the length of the data element determines the steps to be performed next. If the data element is 8 bits in length (packet byte 401 data), execution unit 130 performs step 2812. However, if the data elements in the packet data are 16 bits in length (packet word 402 data), execution unit 130 performs step 2814. In one embodiment, only a packed comparison of 8-bit to 16-bit data element lengths is supported. However, in another embodiment, a packed comparison of 32-bit data element lengths (packed doubleword 403) is also supported.

Assuming that the data element is 8 bits in length, step 2812 is performed. In step 2812, the following operation is performed. Comparing Source1 bits 7 through 0 to Source2 bits 7 through 0 generates Result bits 7 through 0. Comparing Source1 bits 15-8 to Source2 bits 15-8 generates Result bits 15-8. Comparing Source1 bits 23 through 16 to Source2 bits 23 through 16 generates Result bits 23 through 16. Comparing Source1 bits 31 through 24 to Source2 bits 31 through 24 generates Result bits 31 through 24. Comparing Source1 bits 39 through 32 to Source2 bits 39 through 32 generates Result bits 39 through 32. Comparing Source1 bits 47 through 40 to Source2 bits 47 through 40 generates Result bits 47 through 40. Comparing Source1 bits 55 through 48 to Source2 bits 55 through 48 generates Result bits 55 through 48. Comparing Source1 bits 63 through 56 to Source2 bits 63 through 56 generates Result bits 63 through 56.

Assuming that the data element is 16 bits in length, step 2814 is performed. In step 2814, the following operation is performed. Comparing Source1 bits 15 through 0 to Source2 bits 15 through 0 generates Result bits 15 through 0. Comparing Source1 bits 31 through 16 to Source2 bits 31 through 16 generates Result bits 31 through 16. Comparing Source1 bits 47 through 32 to Source2 bits 47 through 32 generates Result bits 47 through 32. Comparing Source1 bits 63 through 48 to Source2 bits 63 through 48 generates Result bits 63 through 48.

In one embodiment, the comparison of step 2812 is performed simultaneously. However, in another embodiment, these comparisons are performed serially. In another embodiment, some comparisons are performed simultaneously and some are performed serially. This discussion also applies to the comparison in step 2814.

At step 2820, Result is stored in DEST605 register.

Table 37 illustrates the register representation for the packet compare unsigned greater operation. The first row of bits is the packet data representation of Source 1. The second row of bits is the data representation of Source 2. The third row of bits is the packet data representation of Result. The number below each data element bit is the data element number. For example, Source1 data element 3 is 10000000₂。

Watch 37

Table 38 illustrates register representations for packet compare signed greater than or equal to operations on packet byte data.

Watch 38

Packet data comparison circuit

In one embodiment, the compare operation can be generated on multiple data elements in the same number of clock cycles as a single compare operation on non-packetized data. To achieve execution in the same number of clock cycles, parallelism is employed. I.e. simultaneously instructing the register to perform a compare operation on the data elements. This is discussed in more detail below.

Fig. 29 shows a circuit for performing a packet comparison operation on individual bytes of packet data according to one embodiment of the invention. Fig. 29 shows the use of a modified byte slice compare circuit, byte slice level i 2999. Each byte slice, except the most significant data element byte slice, contains a compare unit and bit control. The highest data element byte slice requires only one compare unit.

Comparison unit i 2911 and comparison unit i +12971 each allow 8 bits from Source1 to be compared with the corresponding 8 bits from Source 2. In one embodiment, each compare unit operates like a known 8-bit compare unit. This known 8-bit compare circuit includes a byte slice circuit that allows Source2 to be subtracted from Source 1. The result of the subtraction is processed to determine the result of the comparison operation. In one embodiment, the subtraction result contains an overflow information. This overflow information is tested to determine if the result of the compare operation is true.

Each of the comparison units has a Source1 input, a Source2 input, a control input, a next stage signal, a previous stage signal, and a result output. Thus, the comparison unit i 2911 has Source1_i2931 input, Source2_i2933 input, control i 2901 input, next level i 2913 signal, previous level i 2912 input, and result stored in result register i 2951. Thus, the comparison unit i +12971 has Source1_i+12932 input, Source2_i+1The 2934 input, the control i + 12902 input, the next level i + 12973 signal, the previous level i + 12972 input, and the result stored in the result register i + 12952.

The Source1n input is typically an 8-bit portion of Source 1. The 8 bits represent the smallest type of data element, one data element of the group byte 401. The Source2 input is the corresponding 8-bit portion of Source 2. Operation control 2900 transmits control signals to enable each comparison unit to perform the required comparison. The control signal is determined from the relationship of the comparison (such as signed greater) and the length of the data element (such as byte or word). The next stage signal is received from the bit control of the comparison unit. The bit control unit effectively combines the comparison units when using data elements larger than the byte length. For example, when comparing word packet data, the bit control unit between the first and second comparison units will cause the two comparison units to operate as one 16-bit comparison unit. Similarly, the control unit between the third and fourth comparison units will cause the two comparison units to operate as one comparison unit. This may continue to four packed word data elements.

Depending on the required relationship and the values of Source1 and Source2, the comparison unit performs the comparison by allowing the results of the higher order comparison units to propagate down to the lower order comparison units or vice versa. This is so that each comparison unit will provide a comparison result using the information passed by bit control i 2920. If double word packet data is used, the four comparison units work together to form one 32 bit long comparison unit for each data element. The result output of each compare unit represents the result of the compare operation on the portion of Source1 and Source2 on which the compare unit operates.

Bit control i 2920 is enabled from operation control 2900 via packet data enable i 2906. The bit control i 2920 controls the next stage i 2913 and the previous stage i + 12972. For example, assume that comparison unit i 2911 is responsible for the 8 least significant bits of Source1 and Source2, and comparison unit i +12971 is responsible for the next 8 bits of Source1 and Source 2. If the comparison is performed on the packed byte data, the bit control i 2920 will not allow the result information from the comparison unit i +12971 to pass to the comparison unit i 2911 and vice versa. However, if the comparison is performed on a packed word, bit control i 2920 will allow the result (overflow in one embodiment) information from compare unit i 2911 to pass to compare unit i +1, and the result (overflow in one embodiment) information from compare unit i +12971 to pass to compare unit i 2911.

For example, in table 39, a packet byte signed greater than compare is performed. Assume that compare unit i +12971 operates on data element 1 and compare unit i 2911 operates on data element 0. The comparison unit i +12971 compares the highest 8 bits of a word and passes the result information via the upper stage i + 12972. The comparison unit i 2911 compares the lowest 8 bits of the word and passes the result information through the next stage i 2913. However, operation control 2900 will cause bit control i 2920 to stop propagating result information received from the previous stage i + 12972 and the next stage i 2913 between the comparison units.

Watch 39

However, if the execution of the grouped word is signed more than the comparison, the result of the comparison unit i +12971 will be passed on to the comparison unit i 2911 and vice versa. Table 40 shows this result. This is also allowed for packet doubleword comparisons in type transfers.

Watch 40

Each comparison unit is optionally coupled to a result register. The Result register temporarily stores the Result of the compare operation until the complete Result [63:0]2960 can be transferred to the DEST605 register.

For a complete 64-bit block comparison circuit, 8 comparison units and 7 bit control units are adopted. This circuit can also be used to perform a comparison on 64-bit non-packet data, thereby utilizing the same circuit to perform both non-packet comparison and packet comparison operations.

Adding the above packet to the instruction set

Advantage of comparative operation

The group comparison instructions described above store the results of the comparison of Source1 and Source2 as a group mask. As described above, conditional branches on data are unpredictable and waste processor performance because they corrupt the branch prediction algorithm. However, by generating a packet mask, this compare instruction reduces the number of data-based conditional branches required. For example, functions may be performed on the packet data (if Y > A, then X ═ X + B; else X ═ X), as shown in Table 41 below (the values shown in Table 41 are shown in 16-ary notation).

Table 41

As can be seen from the above example, conditional branching is no longer required. Since no branch instruction is needed, the processor speculatively predicting branches does not degrade when performing this similar operation using this compare instruction. Thus, by providing this comparison instruction in the instruction set supported by the processor 109, the processor 109 can execute the algorithm requiring this function at a higher performance level.

Example of multimedia algorithms

To illustrate the generality of the disclosed instruction set, several multimedia algorithm examples are described below. In some cases, some steps in these algorithms may be performed with similar packet data instructions. In the following example, several steps that require the use of general-purpose processor instructions to manage data transfers, loops, and conditional branches have been omitted.

1)Complex multiplication

The disclosed multiply-add instruction can be used to multiply two complex numbers in a single instruction, as shown in table 42 a. Multiplication of two complex numbers (such as r1 i1 and r 2i 2) is performed according to the following equation:

real part r1 r2-i1 i2

Imaginary part r1 · i2+ r2 · i1

If the instruction is implemented to complete in one clock cycle, the invention can multiply two complex numbers in one clock cycle.

Table 42a

As another example, table 42b shows an instruction to multiply three complex numbers together.

Table 42b

2)Multiply-accumulate operation

The disclosed instructions can also be used to multiply and accumulate values. For example, two sets of 4 data elements (A) may be combined_1-4And B_1-4) Multiply and accumulate as shown in table 43 below. In one embodiment, the instructions shown in Table 43 are implemented to complete in each clock cycle.

Watch 43

If the number of data elements in each group exceeds 8 and is a multiple of 4, fewer instructions are required for the multiplication and accumulation of these groups if performed as shown in table 44 below.

Watch 44

As another example, Table 45 shows separate multiplications and accumulations for groups A and B and groups C and D, where each of these groups contains two data elements.

TABLE 45

As another example, table 46 shows separate multiplications and accumulations for groups A and B and groups C and D, where each of these groups contains 4 data elements.

TABLE 46

3)Dot product algorithm

Dot products (also known as inner products) are used in signal processing and matrix operations. Dot products are used, for example, in computing the product of matrices, digital filtering operations (such as FIR and IIR filtering), and computing correlation sequences. Since many speech compression algorithms (such as GSM, g.728, CELP, and VSELP) and high fidelity compression algorithms (such as MPEG and sub-band coding) make extensive use of digital filtering and related computations, improving the performance of the dot product is equal to improving the performance of these algorithms.

The dot product of two sequences of length N A and B is defined as:

performing dot product calculations widely utilizes multiply-accumulate operations, in which corresponding elements of each sequence are multiplied and the results are accumulated to form a dot product result.

The present invention allows performing dot product calculations using packet data by including transmit, packet add, multiply-add, and packet shift operations. For example, if a packet data type is used that contains 4 16-bit elements, a dot product calculation can be performed on two sequences that each contain 4 values using the following operations:

1) taking 4 16-bit values from the A sequence using the transfer instruction to generate Source 1;

2) taking 4 16-bit values from the B sequence using the transfer instruction to generate Source 2; and

3) multiply-add, block-add, and shift instructions are used to multiply and accumulate as described above.

For vectors with a few elements, the method shown in table 46 is used and the final results are added together at the end. Other support instructions include a block OR and XOR instruction to initialize the accumulator registers, a block shift instruction to shift out unneeded values at the final stage of the computation. The loop control operation is performed using instructions already in the instruction set of the processor 109.

4)Two-dimensional loop filter

Two-dimensional loop filters are used in some multimedia algorithms. For example, the filter coefficients shown in Table 47 below may be used in a video conferencing algorithm to perform low pass filtering on pixel data.

Watch 47

To calculate the new value of the pixel at position (x, y), the following equation is used:

as a result, pixel (x-1, y-1) +2(x, y-1) + (x +1, y-1) +2(x-1, y) +4(x, y) +2(x +1, y) + (x-1, y +1) +2(x, y-1) + (x +1, y +1)

The present invention allows a two-dimensional loop filter to be implemented using packet data by including assembly, packetization, transmission, packet shifting, and packet addition. According to one implementation of the loop filter described above, this loop filter is applied as two simple one-dimensional filters-i.e. the two-dimensional filter described above can be used as two 121 filters. The first filter is in the horizontal direction and the second filter is in the vertical direction.

Table 48 shows a representation of an 8 x8 block of pixel data.

Watch 48

The following steps are performed to achieve horizontal pass of the filter over this 8 x8 block of pixel data:

1) accessing 8-bit pixel values as packet data using a transfer instruction;

2) decomposing the 8-bit pixels into 16-bit packet data (Source1) containing 4 8-bit pixels to maintain accuracy in the accumulation;

3) copying Source1 twice to generate Source2 and Source 3;

4) performing a non-packet right shift on Source1 of 16 bits;

5) performing a non-packet left shift on Source3 of 16 bits;

6) generating (Source1+2 × Source2+ Source3) by performing the following packet addition;

a)Source1＝Source1+Source2；

b)Source1＝Source1+Source2；

c)Source1＝Source1+Source3；

7) storing the resulting grouped word data as part of an 8 x8 intermediate result array; and

8) these steps are repeated until the entire 8 x8 array of intermediate results (e.g., IA) is generated as shown in table 49 below₀Represents A from Table 49₀Intermediate results of (d).

Watch 49

The following steps are performed to achieve vertical pass of the filter over the 8 x8 intermediate result array:

1) accessing 4 x 4 blocks of data from the intermediate result array as grouped data using a transfer instruction to generate Source1, Source2, and Source3 (see, e.g., table 50);

watch 50

2) Generated by performing the following packet addition (Source1+2 × Source2+ Source 3):

a)Source1＝Source1+Source2；

b)Source1＝Source1+Source2；

c)Source1＝Source1+Source3；

3) performing a shift of the packet to the right by 4 bits on the resulting Source1 generates a sum of weighted values-effectively dividing by 16;

4) assembling the result Source1 with saturation, converting the 16-bit value back to an 8-bit pixel value;

5) storing the resulting grouped byte data as part of an 8 x8 result array (for the example shown in table 50, the four bytes represent new pixel values for B0, B1, B2, and B3); and

6) these steps are repeated until the entire 8 x8 array of results is generated.

It is worth noting that the top and bottom rows of the 8 x8 result array are determined using different algorithms, which are not described herein in order not to obscure the present invention.

Thus by providing the assemble, disassemble, transfer, packet shift, and packet add instructions on processor 109, the performance of the present invention is significantly higher than prior art general purpose processors that must perform the operations required by these filters one data element at a time.

5)Motion Estimation (Motion Estimation)

Motion estimation is used in several multimedia applications, such as video conferencing and MPEG (high quality television playback). For video conferencing, motion estimation is used to reduce the amount of data that must be transmitted between terminals. Motion estimation is performed by dividing a video frame into fixed-size video blocks. For each block in frame 1, it is determined whether there is a block containing a similar picture in frame 2. If such a block is contained in frame 2, it can be described by a motion vector reference in frame 1. Thus, instead of transmitting all the data representing the block, only one motion vector needs to be transmitted to the receiving terminal. For example, if a block in frame 1 is similar to a block in frame 2 and at the same screen coordinates, only one motion vector 0 needs to be sent for that block. However, if a block in frame 1 is similar to a block in frame 2 but at different screen coordinates, only one motion vector indicating the new location of the block need be sent. In one implementation, to determine whether block a in frame 1 is similar to block B in frame 2, the sum of absolute differences between pixel values is determined. The lower the sum, the more similar block a is to block B (i.e., if the sum is 0, block a equals block B).

The present invention allows motion estimation to be performed with packet data by including transmit, decompose, packet add, packet subtract with saturation, and logical operations. For example, if two 16 x 16 video blocks are represented as an array of two 8-bit pixel values stored as packet data, the absolute difference between the pixel values in the two blocks can be calculated by:

1) generating Source1 using the transfer instruction to fetch 8 bit values from Block A;

2) generating Source2 using the transfer instruction to fetch 8 bit values from Block B;

3) performing a block subtraction with saturation subtracting Source1 from Source2 generates Source 3-by subtraction with saturation, only the positive result of this subtraction will be included in Source3 (i.e., the negative result becomes 0);

4) performing a block subtraction with saturation subtracting Source2 from Source1 generates Source 4-by subtraction with saturation, only the positive result of this subtraction will be included in Source4 (i.e., the negative result becomes 0);

5) performing a group OR Operation (OR) on Source3 and Source4 to generate Source 5-by performing this OR operation, Source5 includes the absolute values of Source1 and Source 2;

6) these steps are repeated until the 16 x 16 block is processed.

The resulting 8-bit absolute value is decomposed into 16-bit data elements to allow 16-bit precision, and then summed using block addition.

Thus, by providing the transfer, decomposition, block addition, block subtraction with saturation, and logic operations on processor 109, the present invention provides a significant performance improvement over prior art general purpose processors that must perform the addition and absolute difference of motion estimation calculations one data element at a time.

6)Discrete cosine transform

The Discrete Cosine Transform (DCT) is a well-known function used in many signal processing algorithms. Video and image compression algorithms in particular make extensive use of this transformation.

In image and video compression algorithms, a block of pixels is transformed from a spatial representation to a frequency representation using DCT. In frequency representation, the picture information is divided into frequency components, some of which are more important than others. The compression algorithm selectively quantizes or discards frequency components that do not adversely affect the reconstructed picture content. Compression is achieved in this manner.

There are many implementations of DCT, the most popular of which is some fast transform method based on Fast Fourier Transform (FFT) computational flow modeling. In this fast transformation, the N-th order transformation is decomposed into a combination of N/2 order transformations and the results are recombined. This decomposition can be performed until a minimum second order transformation is reached. This elementary second-order transformation kernel is commonly referred to as a butterfly operation. The butterfly operation is represented as follows:

X＝a*x+b*y

Y＝c*x-d*y

where a, b, c, and d are called coefficients, X and Y are input data, and X and Y are transform outputs.

By including transmit, multiply-add, and packet shift instructions, the present invention allows DCT computations to be performed using packet data in the following manner:

1) generating Source1 (see Table 51 below) using the propagate and resolve instruction to fetch two 16-bit values representing x and y;

2) generating Source2 as shown in Table 51 below-note that Source2 is reusable over several butterfly operations; and

3) the multiply-add instruction is executed using Source1 and Source2 to generate Result (see table 51 below).

Watch 51

In some cases, the coefficient of the butterfly is 1. For these cases, the butterfly degenerates to only add and subtract, which may be performed using a packet add and packet subtract instruction.

The IEEE documentation specifies the precision with which the inverse DCT must be performed for a video conference. (see the institute of I EEE circuits and systems, "I EEE Standard Specification for implementation of 8 × 8 inverse discrete cosine transform", IEEE Std.1180-1990, IEEE Inc.345 East 47th st., NY, NY 10017, USA, 1991, 3/18/3). The disclosed multiply-add instruction meets this required precision because it uses a 16-bit input to generate a 32-bit output.

Thus by providing the transmit, multiply-add, and packet shift operations on processor 109, the present invention provides a significant performance improvement over prior art general purpose processors that must perform the addition and multiplication of DCT computations one data element at a time.

Alternative embodiments

Although the invention has been described as having separate circuits for each of the different operations, alternative embodiments can be implemented such that the different operations share certain circuits. For example, the following circuitry is used in one embodiment: 1) a single Arithmetic Logic Unit (ALU) to perform group addition, group subtraction, group comparison, and group logic operations; 2) a circuit unit for performing assembling, disassembling and grouping shift operations; 3) a circuit unit for performing block multiplication and multiply-add operations; and 4) one circuit unit to perform the number counting operation.

Correspondence and corresponding nouns are used herein to refer to a predetermined relationship between data elements stored in two or more grouped data. In one embodiment, this relationship is based on the bit positions of the data elements in the packet data. For example, data element 0 of the first packet data (e.g., stored in bit positions 0-7 in a packet byte format) corresponds to data element 0 of the second packet data (e.g., stored in bit positions 0-7 in a packet byte format). However, this relationship may be different in different embodiments. For example, corresponding data elements in the first and second packets of data may have different lengths. As another example, not the lowest bit data element of the first packet data corresponds to the lowest bit data element of the second packet data (and so on), and the data elements in the first and second packet data may correspond to each other in some other order. As another example, rather than having a one-to-one correspondence of data elements in the first and second packets of data, the data elements may correspond in different ratios (e.g., one or more data elements of the first packet of data may correspond to two or more different data elements in the second packet of data).

While the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described. The method and apparatus of the present invention may be practiced with modification and alteration within the spirit and scope of the appended claims. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

Claims

1. An apparatus for performing operations of a multimedia application, the apparatus comprising:

a storage area that stores data including packet data each having a plurality of data elements;

an execution unit coupled to the memory area, the execution unit to perform operations specified by a packet data instruction set having a format for identifying first source data and second source data, wherein the execution unit comprises:

parsing means for performing one or more parsing type operations, each of which results in generation and storage of result packed data in the memory area, the result packed data including less than all of the data elements of the first source data interleaved with corresponding data elements of the second source data;

assembling means for performing one or more assembling type operations, each of which results in generation and storage of result packed data in the storage area, the result packed data including a plurality of bit portions found from each data element in the first source data and the second source data, the plurality of bit portions from the first source data being adjacent to each other in the result packed data, and the plurality of bit portions from the second source data being adjacent to each other in the result packed data. (ii) a

Population counting means for performing one or more population count type operations, each of which results in generation and storage of result packed data in the storage area, the result packed data comprising at least first and second result data elements, the first result data element representing a total number of bit groups in a first data element of the first source data, the second result data element representing a total number of bit groups in a second data element of the first source data;

grouping and adding means for performing one or more grouping and type operations each of which results in generation and storage of result grouping data in the memory area, the result grouping data including each data element of the first source data added to a corresponding data element of the second source data as an independent result element; and

packet subtraction means for performing one or more packet subtraction type operations each of which results in generation and storage of result grouping data in the memory area, the result grouping data comprising each data element of the second source data subtracted from a corresponding data element of the first source data as an independent result element;

packet comparison means for performing one or more packet comparison type operations each of which results in the generation and storage of result packet data in the memory region, the result packet data comprising a mask which is an independent result element, the mask indicating a corresponding comparison of one of each data element in the first source data with a corresponding data element in the second source data, each of those of the masks which correspond more truly comprising a plurality of bits each having a first predetermined value, and each of those of the masks which correspond more falsely comprising a plurality of bits each having a second predetermined value;

multiply-add means for performing one or more multiply-add type operations, each of which results in the generation and storage of resultant packet data being performed in said memory area, the result grouped data includes a first result element and a second result element, the first and second result elements are stored without summing the first and second result elements, wherein the first result element represents a first sum of results of multiplying two pairs of corresponding data elements of the first source data and the second source data, the second result element represents a second sum of results of multiplying corresponding data elements of two different pairs of the first source data and the second source data, the first and second result elements have a higher precision than each of the first and second source data used to generate the first and second result elements; and

packet shifting means for performing one or more packet shifting type operations each of which results in generation and storage of result packed data at the store, the result packed data comprising the first source data as independent result elements, each data element in the first source data having been shifted by an amount specified by the second source data and having its bits padded, if necessary, by the amount.

2. The apparatus of claim 1, wherein for at least one of the one or more split type operations, the resultant packet data comprises upper or lower half bits of the first source data and the second source data.

3. The apparatus of claim 1, wherein, for at least one of the one or more operation of the assemble type, each data element of the first and second source data is an N-bit data element, and each of the plurality of bit portions is an N/2-bit result element.

4. The apparatus of any of claims 1-3, wherein, for at least one of the one or more multiply-add type operations, each data element in the first and second source data comprises an N-bit data element, and each of the first result element and the second result element is a 2N-bit result element.

5. The apparatus of any one of claims 1-3, wherein the storage area is a set of one or more registers, further comprising:

a first circuit unit comprising: said means for adding packets to perform said one or more packet adding type operations, said means for subtracting packets to perform said one or more packet subtracting type operations, and said means for comparing packets to perform said one or more packet comparing type operations;

a second circuit unit, comprising: said assembling means for performing said one or more assembly type operations, said disassembling means for performing said one or more disassembly type operations, and said packet shifting means for performing said one or more packet shifting type operations; and

a third circuit unit including: said packet multiply means for performing said one or more packet multiply type operations, and said packet multiply add means for performing said one or more multiply add type operations.

6. The apparatus of claim 1, wherein said format comprises designating the first storage location as a source and destination operand and designating the second storage location as a source operand.

7. The apparatus of claim 6, wherein each of said source operand and said source and destination operands is specified as a register number, said source operand being specified by bits 0-2 of an opcode byte, said source and destination operands being specified by bits 3-5 of said opcode byte.

8. The apparatus of any of claims 1-3, 6 or 7, wherein the packet shifting means for performing one or more packet shifting type operations comprises:

packet right shift arithmetic means for performing one or more packet right shift arithmetic type operations each of which causes the result packet data to contain the first source data as an independent result element, each data element of the first source data being right shifted by the amount specified by the second source data and padded with high order bits, if necessary, with a sign value;

packet left shifting means for performing one or more packet left shift type operations, each of which causes the resultant packetized data to contain the first source data as an independent resultant element, each data element of the first source data left shifted by the amount specified by the second source data and padded with low order bits, if necessary, padded with 0's; and

packet right shift logic means for performing one or more packet right shift logic type operations, each of which causes the result packed data to contain the first source data as an independent result element, each data element of the first source data being right shifted by the amount specified by the second source data and padded with high order bits, padded with 0 if required.

9. A system for performing operations of a multimedia application, the system comprising:

a storage device for storing a plurality of sequences of instructions;

a display device;

a sound reproducing device; and

a processor coupled to the storage device, the display device, and the sound playback device, the processor comprising:

an execution unit to perform an operation specified by a packet data instruction set having a format for identifying first source data and second source data, wherein the execution unit comprises:

first circuitry to perform one or more decomposition type operations, each of which results in generation and storage of result packed data in the memory region, the result packed data including less than all of the data elements of the first source data interleaved with corresponding data elements of the second source data;

second circuitry to perform one or more packed type operations, each of which results in generation and storage of result packed data in the memory region, the result packed data comprising a plurality of bit portions derived from each data element in the first source data and the second source data, the plurality of bit portions from the first source data being adjacent to one another in the result packed data, and the plurality of bit portions of the second source data being adjacent to one another in the result packed data;

third circuitry to perform one or more population count type operations, each of which results in generation and storage of result packed data in the storage area, the result packed data including at least first and second result data elements, the first result data element representing a total number of bit groups in a first data element of the first source data, the second result data element representing a total number of bit groups in a second data element of the first source data;

fourth circuitry to perform one or more packet-plus-type operations, each of which results in generation and storage of result packed data in the storage area, the result packed data including each data element of the first source data added with a corresponding data element of the second source data as an independent result element;

fifth circuitry to perform one or more packet-subtract-type operations, each of which results in generation and storage of result packet data in the memory region, the result packet data including each data element of the second source data subtracted from a corresponding data element of the first source data as an independent result element;

sixth circuitry to perform one or more packet comparison type operations, each of which results in generation and storage of result packed data at the memory region, the result packed data including a mask that is an independent result element, the mask indicating a corresponding comparison of one of each data element in the first source data with a corresponding data element in the second source data, each of those of the masks that correspond more truly comprising a plurality of bits each having a first predetermined value, and each of those of the masks that correspond more falsely comprising a plurality of bits each having a second predetermined value;

seventh circuitry for performing one or more multiply-add type operations, each of which results in generation and storage of resultant packet data being performed in said memory area, the result grouped data includes a first result element and a second result element, the first and second result elements are stored without summing the first and second result elements, wherein the first result element represents a first sum of results of multiplying two pairs of corresponding data elements of the first source data and the second source data, the second result element represents a second sum of results of multiplying corresponding data elements of two different pairs of the first source data and the second source data, the first and second result elements have a higher precision than each of the first and second source data used to generate the first and second result elements; and

eighth circuitry to perform one or more packet-shift-type operations, each of which results in generation and storage of result packed data at the store, the result packed data comprising, as independent result elements, the first source data, each data element in the first source data having been shifted by an amount specified by the second source data and padded with the amount of bits for each data element, if necessary, padded with a padding value,

wherein the circuits of the first to eighth circuits performing different types of operations are independent circuits or a certain circuit undertakes different types of operations.

10. The system of claim 9, wherein

For at least one of the one or more split type operations, resulting packet data includes upper or lower half bits of the first source data and the second source data;

for at least one of the one or more assembly-type operations, each data element of the first and second source data is an N-bit data element, each of the plurality of bit portions is an N/2-bit result element; and

for at least one of the one or more multiply-add type operations, each data element in the first and second source data comprises an N-bit data element, and each of the first result element and the second result element is a 2N-bit result element.

11. The system of any of claims 9-10, wherein the storage area is a set of one or more registers, further comprising the plurality of circuits coupled to the storage area, the plurality of circuits including circuits shared by different operations, comprising:

a first circuit unit comprising: said fourth circuitry to perform said one or more packet add type operations, said fifth circuitry to perform said one or more packet subtract type operations, and said sixth circuitry to perform said one or more packet compare type operations;

a second circuit unit, comprising: said second circuitry to perform said one or more assembly type operations, said first circuitry to perform said one or more disassembly type operations, and said eighth circuitry to perform said one or more packet shift type operations; and

a third unit comprising: said ninth circuit to perform one or more block multiply type operations and said seventh circuit to perform said one or more multiply add type operations.

12. The system of claim 9, wherein the format comprises designating the first storage location as a source and destination operand and designating the second storage location as a source operand.

13. The system of claim 12, wherein each of said source operand and said source and destination operands is specified as a register number, said source operand being specified by bits 0-2 of an opcode byte, said source and destination operands being specified by bits 3-5 of said opcode byte.

14. The system of claim 9, wherein said eighth circuitry to perform said one or more packet shifting type operations comprises:

packet right shift arithmetic means for performing one or more packet right shift arithmetic type operations each including the first source data as an independent result element, each data element of the first source data being right shifted by the amount specified by the second source data and padded with high order bits, if necessary, with sign values;

packet right shift logic means for performing one or more packet right shift logic type operations, each of which causes the result packet data to contain the first source data as an independent result element, each data element of the first source data being right shifted by the amount specified by the second source data and padded with high order bits, padded with 0 if required.

15. The system of any of claims 9-10 or 12-14, wherein the storage device stores a complex multiplication routine comprising instructions to specify one of the one or more multiply-add type operations to multiply at least two complex numbers with each other.

16. The system of any of claims 9-10 or 12-14, wherein the storage device stores a multiply-accumulate routine that includes at least one type of instruction for specifying each of the decompose and multiply-add type operations.

17. The system of any of claims 9-10 or 12-14, wherein the storage device stores a dot product routine comprising instructions for specifying at least one type of each of the multiply-add, block-shift, and block-add type operations.

18. The system of any of claims 9-10 or 12-14, wherein the storage device stores a loop filter routine comprising instructions for specifying at least one type of each of the split, packet shift, and packet add type operations.

19. The system of any of claims 9-10 or 12-14, wherein the storage device stores a motion estimation routine comprising instructions for specifying at least one type of each of the decomposition, saturated grouping subtraction, grouping addition type operation, and grouping logic type operation, the motion estimation routine processing data displayable on the display device.

20. The system of any of claims 9-10 or 12-14, wherein the storage device stores a discrete cosine transform routine comprising at least one type of instruction to specify each of the packet shift and multiply-add type operations, the discrete cosine transform routine processing data displayable on the display device.

21. A computer-implemented method to perform operations of a multimedia application, the method comprising:

receiving a plurality of packet data commands, each packet data command specifying an operation to be performed on first source data and second source data identified by the command; and

performing operations specified by each of the plurality of packet data directives, wherein performance of the operations specified by the plurality of packet data directives includes at least:

in response to receipt of any of one or more split type operations, performing generation and storage of result packed data in a storage area, the result packed data including less than all of the data elements of the first source data interleaved with corresponding data elements of the second source data;

in response to receipt of any of one or more pack-type operations, performing in the memory area generation and storage of result packed data comprising a plurality of bit portions derived from each data element in the first source data and the second source data, the plurality of bit portions from the first source data being adjacent to each other in the result packed data and the plurality of bit portions from the second source data being adjacent to each other in the result packed data. (ii) a

In response to receipt of any of the one or more population count type operations, performing generation and storage of result packed data in the storage area, the result packed data including at least first and second result data elements, the first result data element representing a total number of bit groups in a first data element of the first source data, the second result data element representing a total number of bit groups in a second data element of the first source data;

in response to receipt of any one of one or more packet multiply type operations, performing in the memory area generation and storage of result packed data including only high order or low order bits as independent result elements from a result of multiplying each data element of the first source data by a corresponding data element of the second source data;

in response to receipt of any one of one or more packet-plus-type operations, performing in the memory area generation and storage of result packet data, the result packet data including, as independent result elements, each data element of the first source data added with a corresponding data element of the second source data; and

in response to receipt of any of one or more packet-subtract-type operations, performing in the memory area generation and storage of result packet data comprising each data element of the second source data subtracted from a corresponding data element of the first source data as an independent result element;

in response to receipt of any of one or more packet comparison type operations, performing in the memory area generation and storage of result packet data, the result packet data including a mask that is an independent result element, the mask indicating a corresponding comparison of one of each data element in the first packet data with a corresponding data element in the second packet data, each of those of the masks that correspond more truly comprising a plurality of bits each having a first predetermined value, and each of those of the masks that correspond more falsely comprising a plurality of bits each having a second predetermined value;

in response to receipt of any one of one or more multiply-add type operations, performing in the memory area generation and storage of result packed data, the result packed data including a first result element and a second result element, the first and second result elements being stored without summing the first and second result elements, wherein the first result element represents a first sum of results of multiplying two pairs of corresponding data elements of the first source data and the second source data, the second result element represents a second sum of results of multiplying two different pairs of corresponding data elements of the first source data and the second source data, the first and second result elements having a higher accuracy than each of the first and second source data elements used to generate the first and second result elements; and

in response to receipt of any of one or more packet-shifting type operations, performing in the memory area generation and storage of result packed data comprising, as independent result elements, the first source data, each data element in the first source data having been shifted by an amount specified by the second source data and having its bits padded, if necessary, by the amount.

22. The method of claim 21, wherein for at least one of the one or more split type operations, the resulting packet data includes upper or lower half bits of the first source data and the second source data.

23. The method of claim 21, wherein for at least one of the one or more packed-type operations, each data element of the first and second source data is an N-bit data element, and each of the plurality of bit portions is an N/2-bit result element.

24. The method of any one of claims 21-23, wherein, for at least one of the one or more multiply-add type operations, each data element in the first and second source data comprises an N-bit data element, and each of the first result element and the second result element is a 2N-bit result element.

25. The method of any of claims 21-23, wherein the storage area is a set of one or more registers.

26. The method of claim 21, wherein each of said plurality of packed data instructions specifies one memory location as a source and destination operand and another memory location as a source operand.

27. The method of claim 26, wherein each of said source operand and said source and destination operands is specified as a register number, said source operand being specified by bits 0 through 2 of an opcode byte, said source and destination operands being specified by bits 3 through 5 of said opcode byte.

28. The method of any of claims 21-23 or 26-27, wherein in response to receipt of any of the one or more packet shifting type operations, performance of its operation comprises:

in response to receipt of any of one or more packet right shift arithmetic type operations, performing generation and storage of the result packet data to contain the first source data as independent result elements, each data element of the first source data right shifted by the amount specified by the second source data and padded with high order bits, if necessary, padded with symbol values;

in response to receipt of any of one or more packet left shift type operations, performing generation and storage of the resulting packetized data to contain the first source data as independent result elements, each data element of the first source data left shifted by the amount specified by the second source data and padded with lower order bits, if needed, with 0 s; and

in response to receipt of any of one or more packet right shift logical type operations, performing generation and storage of the resultant packetized data to contain the first source data as independent resultant elements, each data element of the first source data right-shifted by the amount specified by the second source data and padded with high order bits, padded with 0 if needed.