CN1672128A

CN1672128A - Method and apparatus for accessing multiple vector elements in parallel

Info

Publication number: CN1672128A
Application number: CN 03817860
Authority: CN
Inventors: A·A·M·范维尔
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2002-07-26
Filing date: 2003-07-10
Publication date: 2005-09-21
Also published as: EP1527385A1; JP2005534120A; WO2004013752A1; AU2003281792A1

Abstract

Vector processing is a suitable technique for processing applications with a large computational demand. Vector processors provide high-level operations that work on vectors (i.e., linear arrays of numbers). Vector operations can be made faster than sequences of scalar operations on the same number or data item. A typical application in which vector processing can be used is in the field of audio and video processing. Vector memory systems have a large data width, which allows a complete vector of data elements to be retrieved in one memory access using a single memory address. These data elements can then be processed in parallel. However, when using a vector memory system, problems of vector alignment and ordering of a set of vector elements may occur. The present invention provides an improved method for vector alignment and ordering of vector elements in a computer system comprising a Processor (PROC) and a multi-port memory (MEM), which results in better performance. The first step includes passing the base memory address to an Address Configuration Unit (ACU). Next, a set of memory addresses is defined by an Address Configuration Unit (ACU) using a base memory address and configuration instructions for configuring the address configuration unit. Finally, the vector is transmitted to or from the multi-port memory (MEM) using the set of memory addresses.

Description

Method and apparatus for accessing multiple vector elements in parallel

技术领域technical field

本发明涉及一种计算机系统，包括：The invention relates to a computer system comprising:

处理器；processor;

多端口存储器，所述多端口存储器是可由处理器访问的。A multi-port memory that is accessible by the processor.

本发明进一步涉及一种用于在所述计算机系统中传输矢量的方法。The invention further relates to a method for transferring vectors in said computer system.

更进一步，本发明涉及一种用于实现所述方法的计算机程序。Furthermore, the invention relates to a computer program for implementing said method.

背景技术Background technique

矢量处理是一种用于处理具有大量计算需求的应用的适用技术。矢量处理器提供对矢量(即数字的线性阵列)工作的高级运算。矢量处理器流水线传输矢量的单个元素上的运算。流水线不仅包括算术运算，而且包括存储器访问和有效地址计算。另外，大部分的高端矢量处理器都允许多个运算同时进行，在不同元素上的运算之间创建并行操作。矢量指令具有几个重要的特性。第一，每个结果的计算都与先前结果的计算无关，这允许很深的流水线而不会产生任何数据危险。第二，矢量指令相当于执行整个循环，这降低了指令带宽要求。第三，由于在单次访问中检索完整的矢量而非检索数据元素，因而降低了存储器访问的开销。为此，能让矢量运算比对相同的数字或数据项进行的标量运算序列更快。其中能够使用矢量处理的典型应用就是音频和视频处理的领域。Vector processing is a suitable technique for processing applications with heavy computational demands. Vector processors provide high-level operations that work on vectors, that is, linear arrays of numbers. A vector processor pipelines operations on individual elements of a vector. Pipelining includes not only arithmetic operations, but also memory accesses and effective address calculations. Additionally, most high-end vector processors allow multiple operations to be performed simultaneously, creating parallelism between operations on different elements. Vector instructions have several important properties. First, the computation of each result is independent of the computation of previous results, which allows for very deep pipelines without any data hazards. Second, vector instructions are equivalent to executing an entire loop, which reduces instruction bandwidth requirements. Third, memory access overhead is reduced because the complete vector is retrieved in a single access rather than data elements. This allows vector operations to be faster than sequences of scalar operations on the same number or data item. A typical application where vector processing can be used is the field of audio and video processing.

矢量存储器系统具有大的数据宽度，这允许在一次存储器访问中利用单个存储器地址来检索数据元素的完整矢量。随后，能够并行地处理这些数据元素。然而，当从矢量存储器系统中检索数据时，可能发生几种问题。第一，矢量对齐(alignment)的问题与从矢量存储器系统中读取横跨矢量边界的数据有关。在那种情况下，可能通过请求两个存储器地址(即，两个矢量)的内容来检索数据，并且随后把所请求的数据传送到新的矢量。第二，当需要一组矢量元素的次序与存储它们的次序不同时出现了问题。假若需要矢量具有在不同的矢量中存储的有序元素组，那么必须检索这些矢量的内容，需要至少两次存储器访问继之以选择适当的数据元素。美国专利5,933,650描述了用于矢量元素的对齐和排序的方法。在矢量元素的对齐中，将一个矢量从存储单元加载到第一寄存器中，而将另一个矢量从存储单元加载到第二寄存器中。确定指定对齐后的矢量的第一字节的起始字节。接下来，自第一寄存器的第一字节中的第一位起连续经由第二寄存器中的位来从第一寄存器和第二寄存器中提取矢量。最后，将提取出的矢量复制到第三寄存器中，以便第三寄存器包含对齐后的多个元素以供矢量处理。按照矢量元素的次序，将第一矢量从存储单元加载到第一寄存器中，而将第二矢量从存储单元加载到第二寄存器中。然后，从第一寄存器和第二寄存器中选出元素的子集。接着，按照适合于随后矢量处理的特定次序将来自于子集的元素复制到第三寄存器中的元素中。Vector memory systems have a large data width, which allows a complete vector of data elements to be retrieved with a single memory address in one memory access. These data elements can then be processed in parallel. However, several problems can occur when retrieving data from a vector memory system. First, the problem of vector alignment is related to reading data from a vector memory system that straddles a vector boundary. In that case, it is possible to retrieve data by requesting the contents of two memory addresses (ie, two vectors), and then transfer the requested data to the new vector. Second, problems arise when a set of vector elements is required in a different order than the order in which they were stored. In case vectors are required to have ordered groups of elements stored in different vectors, the contents of these vectors must be retrieved, requiring at least two memory accesses followed by selection of the appropriate data elements. US Patent 5,933,650 describes a method for alignment and ordering of vector elements. In the alignment of vector elements, one vector is loaded from memory location into a first register and the other vector is loaded from memory location into a second register. Determines the starting byte of the first byte of the specified aligned vector. Next, a vector is extracted from the first register and the second register, starting from the first bit in the first byte of the first register, successively through the bits in the second register. Finally, the extracted vector is copied into a third register so that the third register contains elements aligned for vector processing. In the order of the vector elements, the first vector is loaded from the memory location into the first register, and the second vector is loaded from the memory location into the second register. Then, a subset of elements is selected from the first register and the second register. The elements from the subset are then copied into the elements in the third register in a particular order suitable for subsequent vector processing.

现有技术中矢量元素的对齐和排序的方法的缺点就在于：需要对矢量存储器系统进行一次以上的读取访问，这增加了获取矢量数据的开销。此外，需要附加的硬件，例如，用于临时存储矢量的附加硬件，其中必须为矢量对齐或矢量排序从所述矢量中选择元素。The disadvantage of the alignment and sorting method of vector elements in the prior art is that more than one read access to the vector memory system is required, which increases the overhead of obtaining vector data. Furthermore, additional hardware is required, eg for temporary storage of vectors from which elements have to be selected for vector alignment or vector sorting.

发明内容Contents of the invention

本发明的一个目的是，提供一种用于矢量对齐和矢量元素排序的改善后的方法，这导致矢量处理器的更好的性能。It is an object of the present invention to provide an improved method for vector alignment and vector element ordering, which results in better performance of the vector processor.

这个目的是利用一种用于传输矢量的方法实现的，其特征在于：所述方法包括以下步骤：This object is achieved with a method for transmitting vectors, characterized in that said method comprises the following steps:

将存储器基地址传递到地址配置装置；passing the memory base address to the address configuration means;

通过地址配置装置利用存储器基地址和用于配置地址配置装置的配置指令来定义一组存储器地址；defining a set of memory addresses by means of address configuration means using a memory base address and configuration instructions for configuring the address configuration means;

利用该存储器地址组将矢量传输到多端口存储器/自多端口存储器传输矢量。The set of memory addresses is used to transfer vectors to/from the multi-port memory.

所述方法允许利用单个存储器基地址来将完整的矢量传输到多端口存储器或者自多端口存储器传输完整的矢量。能够将矢量的数据元素传输到存储器内的任意位置或者自存储器内的任意位置传输矢量的数据元素，这提高了灵活性并且避免了与矢量对齐和矢量元素排序有关的问题。此外，与所述地址配置装置相结合地使用多端口存储器减少了指令宽度。可以利用单个存储器基地址来传输完整的矢量，然而多端口存储器所使用的每个存储器地址都应该存在于指令中。对于某些类型的处理器而言，比如非常大的指令字处理器，减少代码尺寸是一个重要问题。The method allows a complete vector to be transferred to or from a multi-ported memory with a single memory base address. Being able to transfer data elements of a vector to or from any location in memory increases flexibility and avoids problems related to vector alignment and vector element ordering. Furthermore, the use of a multi-port memory in combination with the address configuration means reduces the instruction width. A complete vector can be transferred with a single memory base address, however every memory address used by a multi-ported memory should be present in the instruction. For certain types of processors, such as very large instruction word processors, code size reduction is an important issue.

根据本发明，一种计算机系统的特征在于：所述计算机系统进一步包括地址配置装置，其中所述地址配置装置被设计用于利用存储器基地址和用于配置地址配置装置的配置指令来定义一组存储器地址，并且其中所述多端口存储器被设计用于使用该存储器地址组。可以利用一个存储器基地址来将完整的矢量传输到多端口存储器或者自多端口存储器传输完整的矢量，这降低了存储器开销并且提高了计算机系统的性能。According to the present invention, a computer system is characterized in that the computer system further comprises address configuration means, wherein the address configuration means is designed to define a set of memory addresses, and wherein said multi-port memory is designed to use the set of memory addresses. Complete vectors can be transferred to and from the multi-ported memory with one memory base address, which reduces memory overhead and improves computer system performance.

在从属权利要求中限定了本发明的优选实施例。在权利要求8中限定了一种用于实现根据本发明的用于传输矢量的方法的计算机程序。Preferred embodiments of the invention are defined in the dependent claims. A computer program for implementing the method for transmitting vectors according to the invention is defined in claim 8 .

根据本发明的计算机系统的实施例的特征在于：An embodiment of the computer system according to the invention is characterized in that:

地址配置装置包括：设置成由配置指令来配置的多个寄存器堆，和用于计算存储器地址组的多个地址计算单元；The address configuration device includes: a plurality of register files configured by configuration instructions, and a plurality of address calculation units for calculating memory address groups;

所述寄存器堆可由地址计算单元访问；The register file is accessible by an address calculation unit;

所述地址计算单元耦合到多端口存储器。The address calculation unit is coupled to a multi-port memory.

所述配置指令配置多个寄存器堆，并且这些寄存器堆能够保存这个配置直到执行下一条配置指令。在这两者之间，例如在执行指令的循环期间，能够重复地使用这个配置。The configuration command configures a plurality of register files, and these register files can save this configuration until the next configuration command is executed. In between, this configuration can be used repeatedly, for example during a cycle of executing instructions.

根据本发明的计算机系统的实施例的特征在于：配置指令包括一组偏移量，每个偏移量都与定义第二存储器地址的存储器基地址相结合。可以将所述偏移量组直接加载到多个寄存器堆中，并为多个地址计算单元所使用，这提高了地址配置装置的性能。An embodiment of the computer system according to the invention is characterized in that the configuration instruction comprises a set of offsets each combined with a memory base address defining the second memory address. The offset group can be directly loaded into multiple register files and used by multiple address calculation units, which improves the performance of the address configuration device.

附图说明Description of drawings

将参照附图进一步阐明并描述所述实施例的特征：The features of the described embodiments will be further clarified and described with reference to the accompanying drawings:

图1示出了根据本发明的计算机系统的示意图。Fig. 1 shows a schematic diagram of a computer system according to the present invention.

图2示出了具有多端口存储器和地址配置装置的存储系统的示意图。Figure 2 shows a schematic diagram of a memory system with multi-port memory and address configuration means.

具体实施方式Detailed ways

图1示出了包括处理器PROC、地址配置单元ACU、多端口存储器MEM和系统总线SB的计算机系统的框图。处理器PROC、地址配置单元ACU和多端口存储器MEM都经由系统总线SB耦合在一起。在执行指令期间，为了读取或写入具有数据元素的矢量，处理器PROC可以发布操作以便访问多端口存储器MEM。在从多端口存储器MEM中读取或写入一组数据元素之前，应该通过由处理器PROC发出的配置指令来对地址配置单元ACU进行配置。配置指令对地址配置单元ACU进行配置，所以它能够利用存储器基地址来计算对于将从多端口存储器MEM中检索出的数据元素组而言特定的一组存储器地址。地址计算单元ACU的配置保持不变直到发出下一条配置指令。在对地址配置单元ACU进行配置之后，处理器发布包括存储器基地址在内的读取操作，并且该存储器基地址被发送给地址计算单元ACU。随后，地址计算单元ACU计算一组存储器地址。将这些存储器地址经由系统总线SB发送给多端口存储器MEM，继之以从多端口存储器MEM中读取数据元素。将这些数据元素作为单个矢量发送给处理器PROC，并且以供进一步处理使用。假若处理器PROC发布写入操作，则就向地址配置单元ACU发送存储器基地址。地址配置单元ACU计算一组存储器地址，将所述存储器地址组经由系统总线SB发送给多端口存储器MEM。还将数据元素经由系统总线SB发送给多端口存储器MEM。在下一步中，将数据元素写入多端口存储器MEM。在下次写入或读取操作之前，也许需要依据所需要的存储器地址组来发布新的配置指令。例如，假若必须读取的一组数据元素需要相同的存储器地址组并且施加相同的存储器基地址，那么不必重复配置命令。当使用不同的存储器基地址、但地址配置单元ACU的所需配置保持相同时，也不必发出新的配置指令。Fig. 1 shows a block diagram of a computer system comprising a processor PROC, an address configuration unit ACU, a multi-port memory MEM and a system bus SB. The processor PROC, the address configuration unit ACU and the multi-port memory MEM are all coupled together via the system bus SB. During execution of an instruction, the processor PROC may issue an operation to access the multi-ported memory MEM in order to read or write a vector with data elements. Before reading or writing a group of data elements from the multi-port memory MEM, the address configuration unit ACU should be configured by configuration instructions issued by the processor PROC. The configuration instructions configure the address configuration unit ACU so that it can use the memory base address to calculate a set of memory addresses specific to the set of data elements to be retrieved from the multi-ported memory MEM. The configuration of the address calculation unit ACU remains unchanged until the next configuration command is issued. After configuring the address configuration unit ACU, the processor issues a read operation including the memory base address, and the memory base address is sent to the address calculation unit ACU. Subsequently, the address calculation unit ACU calculates a set of memory addresses. These memory addresses are sent via the system bus SB to the multi-port memory MEM, followed by reading data elements from the multi-port memory MEM. These data elements are sent to the processor PROC as a single vector and made available for further processing. If the processor PROC issues a write operation, it sends the memory base address to the address configuration unit ACU. The address configuration unit ACU calculates a set of memory addresses and sends the set of memory addresses to the multi-port memory MEM via the system bus SB. The data elements are also sent to the multi-port memory MEM via the system bus SB. In the next step, the data elements are written to the multi-port memory MEM. Before the next write or read operation, it may be necessary to issue a new configuration command depending on the desired set of memory addresses. For example, if a set of data elements that must be read requires the same set of memory addresses and applies the same memory base address, then the configuration command does not have to be repeated. It is also not necessary to issue new configuration commands when a different memory base address is used, but the desired configuration of the address configuration unit ACU remains the same.

图2示出了包括多端口存储器MEM和地址配置单元ACU的一个实施例的存储系统MS的框图。所述多端口存储器MEM包括：RAM存储器、四个数据输入端口DatIn、四个地址端口Addr和四个数据输出端口DatOut。地址配置单元ACU包括：地址端口AddIn、四个地址计算单元AU、四个寄存器堆RF和四个数据输入端口DatIn。在这个实施例中，数据输入Datln是用于地址配置单元ACU和多端口存储器MEM两者的共享数据输入端口。地址输入端口AddrIn耦合到地址计算单元AU，而地址计算单元AU耦合到多端口存储器MEM的其对应的地址端口Addr。数据输入端口DatIn耦合到寄存器堆RF。寄存器堆RF是可由地址计算单元AU访问的。Fig. 2 shows a block diagram of a memory system MS comprising an embodiment of a multi-port memory MEM and an address configuration unit ACU. The multi-port memory MEM includes: RAM memory, four data input ports DatIn, four address ports Addr and four data output ports DatOut. The address configuration unit ACU includes: an address port AddIn, four address calculation units AU, four register files RF and four data input ports DatIn. In this embodiment, the data input Datln is a shared data input port for both the address configuration unit ACU and the multi-port memory MEM. The address input port AddrIn is coupled to the address calculation unit AU, which in turn is coupled to its corresponding address port Addr of the multi-port memory MEM. The data input port DatIn is coupled to the register file RF. The register file RF is accessible by the address calculation unit AU.

多端口存储器MEM支持用于对数据进行读取和写入的命令。通过利用地址端口Addr，能够经由数据输出端口DatOut从RAM存储器中读取数据。可以将从数据输出端口DatOut中读取的四个数据元素合并成一个矢量。可以经由数据输入端口DatIn并利用用于存储器编址的地址端口Addr来把一组四个数据元素写入多端口存储器。The multi-port memory MEM supports commands for reading and writing data. Data can be read from the RAM memory via the data output port DatOut by using the address port Addr. The four data elements read from the data output port DatOut can be combined into one vector. A set of four data elements can be written to the multi-port memory via the data input port DatIn and using the address port Addr for memory addressing.

地址配置单元ACU支持配置指令，所述配置指令相对于存储器基地址指定了一组偏移量。当执行配置指令时，经由对应的数据输入端口DatIn将偏移量值写入每个寄存器堆RF。随后，地址计算单元AU从它们对应的寄存器堆RF中取出偏移量值，并将这个值存储在内部。The address configuration unit ACU supports configuration instructions that specify a set of offsets relative to a memory base address. When a configuration command is executed, an offset value is written to each register file RF via the corresponding data input port DatIn. Subsequently, the address calculation units AU fetch the offset value from their corresponding register file RF and store this value internally.

假若处理器PROC向存储系统MS发布读取操作，则就在地址端口Addrln处提供存储器基地址。地址计算单元AU从地址输入端口AddrIn中获取存储器基地址的值，并增加它们相应的偏移量值。地址计算单元AU将得到的存储器地址组发送到对应的地址端口Addr，并随后向多端口存储器MEM发出读取命令。在多端口存储器MEM的数据输出端口DatOut处提供得到的数据元素组。处理器PROC还可以向存储系统MS发布写入操作，以便将一组数据元素写入RAM存储器。地址端口AddrIn接收存储器基地址。地址计算单元AU利用存储器基地址以及它们相应的偏移量值来计算一组存储器地址。将所得到的存储器地址组发送给多端口存储器MEM的对应的地址端口Addr。将数据元素发送给多端口存储器MEM的数据输入端口DatIn。随后，向多端口存储器MEM发出写入命令，并且将所述数据元素写入RAM存储器。In case the processor PROC issues a read operation to the memory system MS, it provides the memory base address at the address port Addrln. The address calculation unit AU obtains the value of the memory base address from the address input port AddrIn, and increases their corresponding offset value. The address calculation unit AU sends the obtained memory address group to the corresponding address port Addr, and then sends a read command to the multi-port memory MEM. The resulting set of data elements is provided at the data output port DatOut of the multi-port memory MEM. The processor PROC can also issue a write operation to the storage system MS in order to write a set of data elements to the RAM memory. The address port AddrIn receives the memory base address. The address calculation unit AU calculates a set of memory addresses using the memory base addresses and their corresponding offset values. The obtained memory address group is sent to the corresponding address port Addr of the multi-port memory MEM. The data elements are sent to the data input port DatIn of the multi-port memory MEM. Subsequently, a write command is issued to the multi-port memory MEM and the data elements are written into the RAM memory.

在其它的实施例中，配置指令可以包括向地址配置单元AU发出用于计算偏移量组的的一组命令。In other embodiments, the configuration instruction may include sending a set of commands to the address configuration unit AU for calculating the offset group.

利用适当的配置指令，由寄存器堆RF接收的偏移量组是这样与存储器基地址相结合，使得地址计算单元AU能够定义任意的存储器地址组。利用该存储器地址组，能够同时将一组数据元素写入多端口存储器MEM或者同时从多端口存储器MEM中检索出一组数据元素。因此，存储系统MS起矢量存储器系统的作用，它在允许利用一个存储器基地址从任意的存储器位置处检索一组数据元素的方面占优势。此外，与多端口存储器相比，存储系统MS有下列优点：通过利用一个存储器地址，能够寻址到一组数据元素，而不需要来自外部源的一组存储器地址。其结果是，能够减小指令宽度，这对于非常大的指令字处理器而言是尤其关心的，在所述指令字处理器中代码尺寸的缩小是个重要问题。Using appropriate configuration instructions, the set of offsets received by the register file RF is combined with the memory base address in such a way that the address calculation unit AU can define an arbitrary set of memory addresses. Using this set of memory addresses, a group of data elements can be simultaneously written into the multi-port memory MEM or retrieved from the multi-port memory MEM simultaneously. Thus, the memory system MS functions as a vector memory system, which has the advantage of allowing a set of data elements to be retrieved from an arbitrary memory location using a memory base address. Furthermore, the memory system MS has the advantage over multi-ported memories that by using one memory address a set of data elements can be addressed without requiring a set of memory addresses from an external source. As a result, the instruction width can be reduced, which is of particular concern for very large instruction word processors where code size reduction is an important issue.

应当注意的是，上述实施例举例说明了本发明而非限制本发明，而且本领域的技术人员将在不背离所附权利要求的范围的情况下能设计许多可替换的实施例。在权利要求中，不应将位于括号内的任何参考标记视作为是限制权利要求。单词“包括”不排除除了那些列在权利要求中的元件或步骤之外的其它元件或步骤的存在。元件前的单词“一”或“一个”不排除多个这类元件的存在。可以通过包括几个不同元件的硬件并借助于适当编程的计算机来实现本发明。在枚举几个装置的装置权利要求中，这些装置中的几个都能够由同一个硬件项来实现。在相互不同的从属权利要求中叙述的某些措施的纯粹事实不表示这些措施的组合就不具备优势。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of other elements or steps than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, by means of a suitably programmed computer. In a device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method for transmitting a vector in a computer system, said computer system comprising:

processor;

multi-ported memory accessible by the processor,

It is characterized in that the method comprises the following steps:

passing the memory base address to the address configuration means;

defining a set of memory addresses by means of address configuration means using a memory base address and configuration instructions for configuring the address configuration means;

Vectors are transferred to/from the multi-port memory using the set of memory addresses.

2. The method of claim 1, wherein:

The address configuration device includes: a plurality of register files configured by configuration instructions, and a plurality of address calculation units for calculating the memory address group;

The register file is accessible by an address calculation unit;

The address calculation unit is coupled to a multi-port memory.

3. The method of claim 1, wherein:

The configuration instructions include a set of offsets, each offset combined with a memory base address to define a second memory address.

4. A computer system comprising:

processor;

a multi-port memory accessible by the processor,

It is characterized in that the computer system further includes: an address configuration device, wherein the address configuration device is designed to define a group of memory addresses using a memory base address and configuration instructions for configuring the address configuration device, and wherein the multiple Port memories are designed to use the set of memory addresses.

5. The computer system of claim 4, wherein:

The register file is accessible by an address calculation unit;

The address calculation unit is coupled to a multi-port memory.

6. The computer system of claim 4, wherein:

7. The computer system according to claim 4, wherein: said multi-port memory and address configuration means are both included in a memory system.

8. A computer program comprising computer program code means for instructing a computer system to perform the steps of the method as claimed in claim 1.