HK1053728B - Sram controller for parallel processor architecture - Google Patents
Sram controller for parallel processor architecture Download PDFInfo
- Publication number
- HK1053728B HK1053728B HK03106072.5A HK03106072A HK1053728B HK 1053728 B HK1053728 B HK 1053728B HK 03106072 A HK03106072 A HK 03106072A HK 1053728 B HK1053728 B HK 1053728B
- Authority
- HK
- Hong Kong
- Prior art keywords
- memory
- controller
- queue
- address
- read
- Prior art date
Links
Description
Technical Field
The present invention relates to memory controllers, particularly for use in parallel processing systems.
Background
Parallel processing is an effective form of concurrent event information processing in the computing process. Parallel processing requests concurrent execution of many programs in a computer, as opposed to serial processing. Under context using parallel processors, parallelism involves doing more than one thing at a time. Unlike the serial paradigm, where all tasks can be performed sequentially at a single station, and the pipelined machine, where tasks are performed using parallel processing at dedicated stations, multiple stations can be provided, each capable of performing all tasks. That is, generally, all or multiple stations independently process the same or common elements of a problem at the same time. Some of the problems are suitable to be solved by applying parallel processing.
Memory systems used in parallel processing tasks can be inefficient. The memory system will have dead time (i.e., bubble shadow (bubble) which may be 1 or 2 cycles, depending on the type of storage device used).
Disclosure of Invention
According to one aspect of the present invention, a controller for a random access memory includes an address and command queue module that maintains memory references (references) from a plurality of micro-control functional units. The address and command queue module includes a read queue module and a first read/write queue module that holds memory references from the core processor. The controller also includes a control logic module including an arbiter that detects a fullness of each queue and a completion status of outstanding (outstanding) memory references to select a memory reference instruction from one of the queue modules.
One or more aspects of the invention may provide one or more of the following advantages.
The memory controller performs memory reference ordering (sorting) to minimize latency (bubble shadowing) in the pipelined thread from the interface to the memory. Memory systems are designed to be essentially flooded with independent memory requests. The memory controller may perform memory reference ordering, which may reduce dead time or bubble artifacts that occur when accessing the SRAM. For memory references to SRAM, switching the direction of current on the signal line between read and write can create bubble shadowing or dead time while waiting for the current to stabilize on the conductors coupling the SRAM to the SRAM controller. That is, the driver driving the current on the bus needs to stabilize before changing state. The repetition period of the read operation followed by the write reduces the peak bandwidth. Memory reference ordering organizes references to storage so that a read of a long string can be followed by a write of the long string. This can be used to minimize dead time in the pipeline thread in order to effectively achieve closer to maximum available bandwidth. Packet reads and writes improve cycle time by eliminating dead cycles. The memory controller performs memory reference ordering based on the read memory references.
The memory controller may also include a lock lookup device to lookup a read lock. The address and command queue module also includes a read lock miss queue module to hold read memory reference requests that miss due to a lock on a portion of memory as determined by the lock lookup device.
Brief description of the drawings
FIG. 1 is a block diagram of a communication system using a hardware-based multithreaded processor.
Fig. 2 is a detailed block diagram of the hardware-based multithreaded processor of fig. 1.
Fig. 3 is a block diagram of the functional units of a microengine used in the hardware-based multithreaded processor of fig. 1 and 2.
FIG. 3A is a block diagram of a pipeline thread in the microengine of FIG. 3.
FIG. 3B shows one format of an associated text switch instruction.
FIG. 3C is a block diagram showing general register address arrangement.
FIG. 4 is a block diagram of a memory controller for enhanced bandwidth operation used in a hardware-based multithreaded processor.
Fig. 4A is a flow chart illustrating arbitration policies in the SDRAM controller of fig. 4.
FIG. 4B is a timing diagram showing the advantages of optimizing an SDRAM controller.
FIG. 5 is a block diagram of a memory controller for latency limited operations used in a hardware-based multithreaded processor.
FIG. 5A is a timing diagram showing the advantages of optimizing an SRAM controller.
FIG. 6 is a block diagram of a communication bus interface in the processor of FIG. 1.
The structure is as follows:
referring to fig. 1, a communication system 10 includes a parallel, hardware-based multithreaded processor 12. The hardware-based multithreaded processor 12 is coupled to a bus (e.g., PCI bus 14), a memory system 16, and a second bus 18. The system 10 is particularly useful for tasks that may be divided into multiple parallel subtasks or functions. The hardware-based multithreaded processor 12 is particularly useful for bandwidth-oriented (rather than latency-oriented) tasks. The hardware-based multithreaded processor 12 has a plurality of microengines 22, each having a plurality of hardware-controlled threads that can simultaneously run and independently perform tasks.
The hardware-based multithreaded processor 12 also includes a central controller 20 that facilitates loading microcode control for other resources of the hardware-based multithreaded processor 12, and executing other general-purpose computer typesFunctions such as handling protocols, exceptions, additional support for data packet (packet) processing where, for example, microengines pass data packets under boundary conditions for more detailed processing. In one embodiment, processor 20 is based on a "Strong Arm7"(Arm is a trademark of Arm limited, uk). The general purpose microprocessor 20 has an operating system. Through the operating system, the processor 20 may invoke various functions to operate the microengines 22a-22 f. The processor 20 may use any supported operating system, preferably a real-time operating system. For the core processor implemented as a "StrongArm" architecture, various operating systems such as Microsoft NT real-time, VXWorks and μ CUS, free software operating systems available on the Internet, and the like may be used.
The hardware-based multithreaded processor 12 also includes a plurality of functional microengines 22a-22 f. Each functional microengine (microengine) 22a-22f maintains a number of program counters and states associated with the program counters in hardware. A corresponding series of threads may effectively run on each of the microengines 22a-22f simultaneously, but only one thread is actually operating at a time.
In one embodiment, six microengines 22a-22f are shown. Each microengine 22a-22f has the capability to process four hardware threads. The shared resources owned by the six microengines 22a-22f operations include the memory system 16 and the bus interfaces 24 and 28. The memory system 16 includes a Synchronous Dynamic Random Access Memory (SDRAM) controller 26a and a Static Random Access Memory (SRAM) controller 26 b. The SDRAM memory 16a and SDRAM controller 26a are typically used to process large amounts of data (e.g., to process network payloads from network data packets). The SRAM controller 26b and SRAM memory 16b are used in networking implementations for low latency and fast access tasks (e.g., accessing lookup tables, storage for the core processor 20, etc.).
The six microengines 22a-22f access either the SDRAM16a or the SRAM 16b depending on the characteristics of the data. In this way, low latency, low bandwidth data may be stored in and retrieved from the SRAM, while higher bandwidth data for which latency is not important is stored in and retrieved from the SDRAM. The microengines 22a-22f can execute memory reference instructions to the SDRAM controller 26a or SRAM controller 16 b.
The advantages of hardware multithreading may be illustrated by SRAM or SDRAM memory accesses. For example, an SRAM access requested by thread _0 from the microengine will cause the SRAM controller 26b to begin accessing the SRAM memory 16 b. The SRAM controller controls the arbitration of the SRAM bus, accesses the SRAM 16b, fetches data from the SRAM 16b, and returns data to the requesting microengines 22a-22 b. During an SRAM access, if only a single thread of the microengine (e.g., 22a) is operational, the microengine will be in a sleep state until data is returned from the SRAM. By using a hardware associative text exchange within each microengine 22a-22f, the hardware associative text exchange enables other associative text with unique program counters to be executed within that same microengine. Thus, another thread (e.g., thread _1) may be functioning while the first thread (e.g., thread _0) is waiting for read data to return. During execution, thread _1 may access SDRAM memory 16 a. thread _1 operates the SDRAM unit, thread _0 is operating the SRAM unit, and a new thread (e.g., thread _2) can now operate in the microengine 22 a. thread _2 may operate for a certain amount of time until it needs to access memory or perform some other long latency operation (e.g., access a bus interface). Thus, the processor 12 may have bus operations, SRAM operations, and SDRAM operations all being completed or operated on by the microengines 22a at the same time, and have one more thread to handle more work in the datapath.
Hardware-associated text exchange also synchronizes completion of tasks. For example, two threads may find the same shared resource (e.g., SRAM). Each of these separate functional units (e.g., FBUS interface 28, SRAM controller 26a, and SDRAM controller 26b) reports back a flag indicating an operation complete signal upon completion of the requested task from one of the microengine thread contexts. When the microengine receives the tag, the microengine can determine which thread to turn on.
One example of an application of the hardware-based multithreaded processor 12 is as a network processor. As a network processor, the hardware-based multithreaded processor 12 is connected to a network device such as a media access controller device (e.g., 10/100BaseT octal MAC 13a or gigabit ethernet device 13 b). Generally, as a network processor, the hardware-based multithreaded processor 12 can connect to any type of communication device or interface that receives/transmits large amounts of data. Communication system 10, operating in a networked application, may receive multiple network data packets from devices 13a and 13b and process those data packets in a parallel manner. With the hardware-based multithreaded processor 12, each network data packet can be processed independently.
Another example of the use of processor 12 is a print engine of a postscript processor or a processor as a storage subsystem (i.e., RAID disk storage). A further use is as a matching engine. For example, in the securities industry, the advent of electronic trading requires the use of an electronic matching engine to match orders between buyers and sellers. These and other parallel types of tasks may be accomplished on the system 10.
The processor 12 includes a bus interface 28 that couples the processor to the second bus 18. The bus interface 28 in one embodiment couples the processor 12 to a so-called FBUS 18(FIFO bus). The FBUS interface 28 is responsible for controlling and connecting the processor 12 to the FBUS 18. FBUS 18 is a 64-bit wide FIFO bus used to connect to "Media Access Controller (MAC)" devices.
The processor 12 includes a second interface (e.g., a PCI bus interface 24) that couples other system components located on the PCI 14 bus to the processor 12. The PCI bus interface 24 provides a high speed data path 24a to the memory 16 (e.g., SDRAM memory 16 a). Through this path, data can be moved quickly from the SDRAM16a through the PCI bus 14 via Direct Memory Access (DMA) transfers. The hardware-based multithreaded processor 12 supports image transfers. The hardware-based multithreaded processor 12 may use multiple DMA channels, so if one target of a DMA transfer is busy, another of the DMA channels may take over the PCI bus to transfer information to another target, thereby maintaining high efficiency of the processor 12. In addition, PCI bus interface 24 supports both target and master operations. In the target operation, the slave devices on the bus 14 access the SDRAMs serving as slaves for the target operation by reading and writing. In a master operation, the processor core 20 sends data directly to the PCI interface 24 or receives data directly from the PCI interface 24.
Each functional unit is coupled to one or more internal buses. As described below, the internal bus is a dual 32-bit bus (i.e., one bus for read operations and one bus for write operations). The hardware-based multithreaded processor 12 is also configured such that the total bandwidth of the internal bus in the processor 12 exceeds the bandwidth of the external bus coupled to the processor 12. The processor 12 includes an internal core processor bus 32 (e.g., an ASB bus (advanced system bus)) that couples the processor core 20 to the memory controllers 26a, 26c and ASB translator 30 described below. The ASB bus is a subset of the so-called AMBA bus used in conjunction with the "Strong Arm" processor core. The processor 12 also includes a dedicated bus 34 that couples the microengine units to the SRAM controller 26b, the ASB translator 30 and the FBUS interface 28. A memory bus 38 couples the memory controllers 26a and 26b to the bus interfaces 24, 28 and the memory system 16 including a flash read-only memory (flash) 16c or the like for boot operations.
Referring to FIG. 2, each microengine 22a-22f includes an arbiter that examines the flags to determine the available threads to be operated on. Any thread from any of the microengines 22a-22f can access the SDRAM controller 26a, SDRAM controller 26b, or FBUS interface 28. Both memory controllers 26a and 26b include a plurality of queues to store outstanding memory reference requests. These queues either maintain the order of memory references or arrange memory references to optimize memory bandwidth. For example, if thread _0 is not dependent on thread _1 or has no relationship to thread _1, then the order of memory references is reversed, and threads 1 and 0 have no reason to be unable to complete their memory references to the SRAM cells. The microengines 22a-22f issue memory reference requests to the memory controllers 26a and 26 b. The microengines 22a-22f flood the memory subsystems 26a and 26b with enough memory reference operations so that the memory subsystems 26a and 26b become bottlenecks to the operation of the processor 12.
If the memory subsystem 16 is full of essentially independent memory requests, the processor 12 may perform memory reference ordering. Memory reference ordering improves the achievable memory bandwidth. As described below, memory reference ordering reduces dead time or bubble artifacts that occur when accessing SRAM. For memory references to SRAM, switching the direction of current on the signal line between read and write creates a bubble or dead time waiting for the current to reach a steady state on the conductors coupling the SRAM 16b to the SRAM controller 26 b.
That is, the driver driving the current on the bus needs to stabilize before changing states. Thus, completing a read operation followed by a repeat cycle of writes reduces peak bandwidth. Memory reference ordering allows the processor 12 to organize references to storage so that reading of long strings may be followed by writing of long strings. This can be used to minimize dead time in the pipeline to effectively achieve a closer approximation to the maximum available bandwidth. Reference ordering helps maintain parallel hardware-associated text threads. On SDRAM, reference ordering allows hiding precharge from one bank (bank) to another. In particular, if the memory system 16b is organized into an odd rank and an even rank, the memory controller may begin precharging the even ranks while the processor is operating on the odd ranks. Precharge may be performed if the memory reference alternates between odd and even ranks. By ordering memory references to alternate accesses to opposite banks, the processor 12 improves SDRAM bandwidth. In addition, other optimizations may be used. For example, merge optimization (where operations that can be merged are merged prior to a memory access), open page optimization (where an opened page of memory is not reopened by checking an address), chaining (to be described below), and a flush mechanism may be used.
The FBUS interface 28 supports a "send and receive" flag for each port supported by the MAC device, as well as an "interrupt" flag indicating when service is warranted. The FBUS interface 28 also includes a controller 28a that performs header (head) processing of incoming data packets from the FBUS 18. Controller 28a extracts the header of the data packet and performs a microprogrammable source/destination/protocol hash lookup in SRAM (for address smoothing). If the hash is not successfully resolved, the header of the data packet is sent to processor core 20 for additional processing. The FBUS interface 28 supports the following internal data processing:
FBUS cells (shared bus SRAM) to/from the microengines.
The FBUS unit writes from the SDRAM unit (via a dedicated bus).
The FBUS unit reads to SDRAM (via the M bus).
The FBUS 18 is a standard industry bus that includes a data bus (e.g., 64 bits wide for addresses and sideband control and read/write control). The FBUS interface 28 provides the ability to input large amounts of data by using a series of input and output FIFOs 29a-29 b. From the FIFOs 29a-29b, the microengines 22a-22f fetch data from them or command the SDRAM controller 26a to move data from a receive FIFO (where the data comes from a device on the bus 18) into the FBUS interface 28. Data may be sent to the SDRAM memory 16a through the memory controller 26a via direct memory access. Similarly, via the FBUS interface 28, the microengines can move data from the SDRAM 26a to the interface 28 and out to the FBUS 18.
Data functions are distributed among the individual microengines. Connectivity to SRAM26a, SDRAM 26b, and FBUS 28 is via command requests. The command request may be a memory request or a FBUS request. For example, the command request may move data from a register located in the microengine 22a to a shared resource (e.g., an SDRAM location, an SRAM location, a flash memory, or some MAC address). These commands are sent to each functional unit and shared resource. However, the shared resources do not need to maintain local buffering of data. While the shared resource accesses distributed data located within the microengines. This enables the microengines 22a-22f to have local access to data rather than arbitrating access on the bus and risking connections to the bus. With this feature, there may be a0 cycle stall waiting for internal data of the microengines 22a-22 f.
The data buses (e.g., ASB bus 30, SRAM bus 34, and SDRAM bus 38) coupling these shared resources (e.g., memory controllers 26a and 26b) have sufficient bandwidth so as to not have an internal bottleneck. Thus, to avoid bottlenecks, the processor 12 has a bandwidth requirement: each functional unit is provided with a bandwidth at least twice the maximum bandwidth of the internal bus. For example, SDRAM may run a 64 bit wide bus at 83 MHz. The SRAM data bus may have separate read and write buses (e.g., may be a 32-bit wide read bus operating at 166MHz and a 32-bit wide write bus operating at 166 MHz). Essentially, 64 bits running at 166MHz are effectively twice the bandwidth of SDRAM.
The core processor 20 may also access shared resources. The core processor 20 has direct communication with the SDRAM controller 26a, the bus interface 24, and the SRAM controller 26b via the bus 32. However, to access the microengines 22a-22f and transfer registers located at any of the microengines 22a-22f, the core processor 20 accesses the microengines 22a-22f via the ASB translator 30 on the bus 34. The ASB converter 30 may be physically located in the FBUS interface 28, but logically distinct. The ASB translator 30 performs address translation between FBUS microengine transfer register locations and core processor addresses (i.e., ASB buses) so that the core processor 20 can access registers belonging to the microengines 22a-22 f.
Although the microengines 22 may use register sets to exchange data as described below, a scratch pad 27 is also provided to allow the microengines to write data out to memory for other microengines to read. The scratch pad memory 27 is coupled to the bus 34.
The processor core 20 includes a RISC core 50 implemented in five-stage pipeline threads (performing a single cycle shift of one operand or two operands in a single cycle) and provides multiplication support and 32-bit pipe transform support. This RISC core 50 is a standard "Strong Arm 7" architecture, but for performance reasons it is implemented with five-stage pipeline threads. The processor core 20 also includes a 16 kilobyte instruction cache 52, an 8 kilobyte data cache 54, and a prefetch stream buffer 56. The core processor 20 performs arithmetic operations in parallel with store writes and instruction fetches. The core processor 20 interfaces with other functional units via the ASB bus defined by the ARM. The ASB bus is a 32-bit bidirectional bus 32.
A micro engine:
referring to FIG. 3, an exemplary one of the microengines 22a-22f (e.g., microengine 22f) is shown. The microengine includes a control memory 70, which in one implementation comprises a 32-bit 1,024 word RAM. The RAM stores a microprogram. The microprogram may be loaded by the core processor 20. The microengine 22f also includes controller logic 72. The controller logic includes an instruction decoder 73 and Program Counter (PC) units 72a-72 d. Four microprogram counters 72a-72d are maintained in hardware. The microengine 22f also includes associated text event switching logic 74. Associated text EVENT logic 74 receives messages (e.g., SEO _ # EVENT _ RESPONSE; FBI _ EVENT _ RESPONSE; SRAM _ EVENT _ RESPONSE; SDRAM _ EVENT _ RESPONSE; and ASB _ EVENT _ RESPONSE) from each shared resource (e.g., SRAM26a, SRAM26 b, or processor core 20, control and status register, etc.). These messages provide information as to whether the requested function has been completed. Depending on whether the function requested by the thread has completed and issued a completion signal, the thread needs to wait for that completion signal; if a thread is enabled, the thread is placed on an available thread list (not shown). The microengine 22f can have a maximum (e.g., 4) available threads.
The microengines 22 use global signaling state in addition to local event signals belonging to the execution threads. With the signaling state, the execution thread may broadcast the signaling state to all microengines 22. A "receive request available" signal, any and all threads in the microengine can branch on these signaling states (branch). These signaling states may be used to determine the availability of resources or whether resources should be used for a service.
Associated text event logic 74 arbitrates for four (4) threads. In one embodiment, the arbitration is a round-robin mechanism. Other techniques may be used including priority queues or weighted fair queues. The microengine 22f also includes an Execution Box (EBOX) data path 76 that includes an arithmetic logic unit 76a and general purpose register set 76 b. Arithmetic logic unit 76a performs arithmetic and logic functions, as well as shift functions. Register set 76b has a relatively large number of general purpose registers. As will be described in FIG. 3B, in this implementation there are 64 general purpose registers in the first line, line A, and 64 in the second line, line B. As will be described, the general purpose registers are set as windows so that they are relatively and absolutely addressable.
The microengine 22f also includes a write transfer register stack 78 and a read transfer stack 80. These registers are also set as windows so that they are relatively and absolutely addressable. The write transfer register stack 78 is where data written to a resource is located. Similarly, the read register stack 80 is used for return data from shared resources. After or while the data arrives, an event signal from each shared resource (e.g., the SRAM controller 26a, SDRAM controller 26b, or core processor 20) will be provided to the associated text event arbiter 74, which will then alert the thread to: data is available or has been sent. Transfer register banks 78 and 80 are connected to Execution Box (EBOX)76 by a data path. In one implementation, the read transfer register has 64 registers and the write transfer register has 64 registers.
As shown in FIG. 3A, the microengine datapath (datapath) maintains a 5-stage micro-pipeline thread 82. This pipeline thread includes a lookup 82a of a micro instruction word, a format 82b of a register file address, a read 82c of an operand from a register file, an ALU, a shift or compare operation 82d, and a write back of the result to a register 82 e. By providing a write back data bypass into the ALU/shifter unit, and by assuming the registers are implemented as register files (rather than RAM), the microengines can perform simultaneous register file reads and writes, which completely hides the write operation.
The SDRAM interface 26a returns a signal to the microengine requesting the read (indicating whether a parity error occurred with respect to the read request). The microengine microcode is responsible for checking the SDRAM read Parity flag when the microengine uses any returned data. Once the flag is checked, if it is set, the branch jump action on it clears it. When the SDRAM can be checked, only the Parity mark is sent, and the SDRAM is subjected to consistency check protection. Only the microengines and PCI units are requestors (requestors) that are notified about the parity error. So, if the processor core 20 or FIFO requests parity protection, the microengines assist in fulfilling the request. The microengines 22a-22f support conditional branch transfers. The worst case conditional branch latency (excluding jumps) occurs when the branch decision is the result of the condition code being set by a preceding micro-control instruction. Table 1 below shows the wait times;
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
------------+----+----+----+----+----+----+----+----+----+
micro-storage lookup | n1 | cb | n 2| XX | b1 | b 2| b3 | b4|
Register address generation | n1 | cb | XX | XX | b1 | b 2| b3 |
Register file lookup | n1 | cb | XX | XX | b1 | b 2|
ALU/shifter/cc | | | | n1 | cb | XX | XX | b1 |
Write back | m2| n1 | cb | XX | into the circuitry
Where nx is a pre-branch transfer microword (n1 set cc's)
cb is conditional branch transfer
bx is a post branch microford
XX being suspended microford
As shown in Table 1, the condition code for n1 is not set until cycle 4 and a branch decision can be made (in which case this causes the branch path to be looked up in cycle 5). The microengine incurs a2 cycle branch transfer latency penalty because it must abort operations n2 and n3 (2 microwords directly after the branch transfer) before the branch transfer path starts to fill the pipeline thread with operation b 1. If no branch transfer is taken, the microword is not aborted and normal execution continues. The microengine has several mechanisms to reduce or eliminate the effective branch transfer latency.
The microengines support deferred branch transfers. Branch branches are deferred (i.e., the effect of the branch is "deferred" in time) when the microengine allows 1 or 2 microwords after the branch is taken and before the branch takes effect. In this way, branch latency may be hidden if useful work is found to fill wasted cycles after branch microwords. The following illustrates a 1-cycle deferred branch transfer, where n2 is allowed to execute after cb but before b 1:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
-------------------+----+----+----+----+----+----+----+----+
micro-storage lookup | n1 | cb | n 2| XX | b1 | b 2| b3 | b4|
Register address generation | | n1 | cb | n 2| XX | b1 | b 2| b3 |)
Register file lookup | n1 | cb | n 2| XX | b1 | b 2|
ALU/shifter/cc | | | | n1 | cb | n2 | XX | b1 |
Write back | n1 | cb | n 2| XX |
The following illustrates a 2-cycle deferred branch transfer, where n2 and n3 are allowed to complete before the branch transfer to b1 occurs. Note that when the condition code is set on the micro-word before the branch jump, only 2 cycles of branch jump stalls are allowed.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
----------------+----+----+----+----+----+----+----+----+----+
Micro-storage lookup | n1 | cb | n 2| n3 | b1 | b 2| b3 | b4| b5 | (R) cells
Register address generation | n1 | cb | n 2| n3 | b1 | b 2| b3 | b4|
Register file lookup | | | n1 | cb | n 2| n3 | b1 | b 2| b3 |
ALU/shifter/cc | | | | n1 | cb | n2 | n3 | b1 |b2 |
Write back | n1 | cb | n 2| n3 | b1 |)
The microengines also support condition code evaluation. If the condition codes (from which branch decision is made) are set 2 or more micro-words before the branch, then the 1-cycle branch latency can be eliminated because the branch decision can be made 1 cycle earlier:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
-------------+----+----+----+----+----+----+----+----+
micro-storage lookup | n1 | n 2| cb | XX | b1 | b 2| b3 | b4|
Register address generation | | n1 | n 2| cb | XX | b1 | b 2| b3 |
Register file lookup | n1 | n 2| cb | XX | b1 | b 2|
ALU/shifter/cc | | | | n1 | n2 | cb | XX | b1 |
Write back | | | n1 | n 2| cb | XX |, the write back
In this example, n1 sets a condition code, and n2 does not set a condition code. Therefore, a branch decision may be made at cycle 4 (instead of 5) to cancel 1 cycle of branch latency. In the following example, a 1-cycle branch stall and early setting of condition codes are combined to completely hide branch latency:
the condition code (cc's) is set 2 cycles before the 1-cycle deferred branch transition:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8|
---------------+----+----+----+----+----+----+----+----+
micro-storage lookup | n1 | n 2| cb | n3 | b1 | b 2| b3 | b4 |)
Register address generation | | n1 | n 2| cb | n3 | b1 | b 2| b3 |
Register file lookup | | | n1 | n 2| cb | n3 | b1 | b2 |)
ALU/shifter/cc | | | | n1 | n2 | cb | n3 |b1 |
Write back | n1 | n 2| cb | n3
In the event that the condition codes cannot be set early (i.e., they are set in the micro-word before the branch jump), the micro-engine supports branch jump speculation in an attempt to reduce the exposed branch jump latency of the reserved 1 cycle. By "speculating" a branch path or a sequential path, the micro-sequencer fetches the speculated path 1 cycle before explicitly knowing what path to execute. If it speculates correctly, the 1 cycle branch latency is eliminated as follows:
speculative taken branch/taken branch transfer
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
------------------+----+----+----+----+----+----+----+----+
Micro-storage lookup | n1 | cb | n1 | b1 | b 2| b3 | b4| b5
Register address generation | n1 | cb | XX | b1 | b 2| b3 | b4|
Register file lookup | n1 | cb | XX | b1 | b 2| b3 |
ALU/shifter/cc | | | | n1 | cb | XX | b1 | b2 |
Write back | | | n1 | cb | XX | b1 | non-woven hair
If the microcode does not speculate correctly on the branch taken, the microengine still wastes only 1 cycle:
speculatively taken/not taken branch branches
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8|
--------------+----+----+----+----+----+----+----+----+
Micro-storage lookup | n1 | cb | n1 | XX | n 2| n3 | n 4| n5 |
Register address generation | | n1 | cb | n1 | XX | n 2| n3 | n4 |)
Register file lookup | n1 | cb | n1 | XX | n 2| n3 |
ALU/shifter/cc | | | | n1 | cb | n1 | XX | n2 |
Write back | n1 | cb | n1 | XX |
However, when microcode speculation does not take branch branches, the latency penalty has a different distribution:
no wasted cycles are listed below with respect to branch/no branch taken speculatively.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
-----------------+----+----+----+----+----+----+----+----+
Micro-storage lookup | n1 | cb | n1 | n 2| n3 | n 4| n5 | n6 |
Register address generation | | n1 | cb | n1 | n 2| n3 | n 4| n5 |
Register file lookup | n1 | cb | n1 | n 2| n1 | b4|
ALU/shifter/cc | | | | n1 | cb | n1 | n2 | n3 |
Write back | n1 | cb | n1 | n2
However, there are 2 wasted cycles for branch/taken branch that are speculatively not taken.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
------------------+----+----+----+----+----+----+----+----+
Micro-storage lookup | n1 | cb | n1 | XX | b1 | b 2| b3 | b4|
Register address generation | n1 | cb | XX | XX | b1 | b 2| b3 |
Register file lookup | n1 | cb | XX | XX | b1 | b 2|
ALU/shifter/cc | | | | n1 | cb | XX | XX |b1 |
Write back | | | n1 | cb | XX |
The microengine may combine branch prediction with a1 cycle branch stall to further improve results. Regarding the branch taken/taken branch taken with a 1-cycle deferred branch taken, the following are:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
----------------+----+----+----+----+----+----+----+----+
micro-storage lookup | n1 | cb | n 2| b1 | b 2| b3 | b4| b5
Register address generation | | n1 | cb | n 2| b1 | b 2| b3 | b4|
Register file lookup | n1 | cb | n 2| b1 | b 2| b3 |
ALU/shifter/cc | | | | n1 | cb | n2 | b1 | b2 |
Write back | n1 | cb | n 2| b1
In the above case, by correctly presuming the branch transfer direction, the branch transfer latency of 2 cycles is hidden by the execution of n 2. If microcode speculates incorrectly, the 1 cycle branch latency remains exposed as follows:
speculative taken branch/not taken branch with 1 cycle deferred
|1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
-------------------+----+----+----+----+----+----+----+----+----+
Micro-storage lookup | n1 | cb | n 2| XX | n3 | n 4| n5 | n6 | n7 |
Register address generation | n1 | cb | n 2| XX | n3 | n 4| n5 | n6 |
Register file lookup | n1 | cb | n 2| XX | n3 | n 4| n5 |
ALU/shifter/cc | | | | n1 | cb | n2 | XX | n3 | n4 |
Write back | n1 | cb | n 2| XX | n3 |
If microcode correctly speculates that it is not being employed, then the pipeline threads flow sequentially under normal undisturbed conditions. If the microcode incorrectly speculates on an unexplored branch, the microengine again incurs 1 cycle of unproductive execution as follows:
branch/taken branch transfer with speculative not taken
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---------------+----+----+----+----+----+----+----+----+----+
Micro-storage lookup | n1 | cb | n 2| XX | b1 | b 2| b3 | b4| b5 |
Register address generation | n1 | cb | n 2| XX | b1 | b 2| b3 | b4|
Register file lookup | n1 | cb | n 2| XX | b1 | b 2| b3 |
ALU/shifter/cc | | | | n1 | cb | n2 | XX |b1 |b2 |
Write back | n1 | cb | n 2| XX | b1 |
Where nx is a pre-branch transfer microword (n1 set cc's)
cb is conditional branch transfer
bx is a post branch microford
XX being suspended microford
In the case of a jump instruction, 3 additional cycles of latency result, since the branch transfer address is not known until the end of the cycle in which the jump is in the ALU stage:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
-----------------+----+----+----+----+----+----+----+----+----+
micro-storage lookup | n1 | jp | XX | XX | XX | j1 | j 2| j3 | j4
Register address generation | | n1 | jp | XX | XX | XX | j1 | j 2| j3 |
Register file lookup | n1 | jp | XX | XX | XX | j1 | j 2|
ALU/shifter/cc | | | | n1 | jp |XX | XX | XX | j1 |
Write back | | | | n1 | jp | XX | XX | XX | ventilation
And (3) switching associated texts:
referring to FIG. 3B, one format from an associated text switch instruction is shown. Associated text switching is a special format for branching that causes a different associated text (and associated PC) to be selected. The associated text switch also enters a certain branch transition latency. Consider the following associated text switch:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---------------+----+----+----+----+----+----+----+----+----+
micro-storage lookup | o1 | ca | br | n1 | n 2| n3 | n 4| n5 | n6 |
Register address generation | | o1 | ca | XX | n1 | n 2| n3 | n 4| n5 |
Register file lookup | | | o1 | ca | XX | n1 | n 2| n3 | n 4|
ALU/shifter/cc | | | | o1 | ca | XX | n1 | n2 | n3 |
Write back | | | ol | ca | XX | n1 | n2 to non-woven fly
Wherein ox is the old associated text flow
br is branch transfer microword in old associated text
ca is associated text reauthorization (resulting in associated text switching)
nx is a new associated text flow
XX being suspended microford
In the associated text switch Abr @ microword is aborted to avoid control and timing complexities that might result from saving the correct old associated text PC.
Conditional branch transfers operate according to the ALU condition code (set on the micro-word before the branch transfer) before the branch transfer can select a0, 1 or 2 cycle branch transfer stall mode. The condition code sets 2 or more micro-words before the conditional branch branching of the operational micro-words can select a0 or 1 cycle branch branching deferral mode. All other branch transfers (including the associated text re-arbitration) may select a0 or 1 cycle branch transfer deferral mode. The structure may be designed to make the associated text arbitration microword, jump, or associated text arbitration microword within the branch transition deferral window of a previous branch transition a choice of violation. That is, in some embodiments, the associated text switch is not allowed to occur during branch transition transitions in the pipeline thread because, as noted, it complicates over-saving the old associated text PC. The structure may also be designed to make branch branches, jumps, or associated text-arbitrated micro-words within a branch-branch deferral window of a previous branch illegitimate to avoid complex, possibly unpredictable, branch-branch behavior.
Each microengine 22a-22f supports multi-threaded execution of four associated texts. One reason for this is to allow one thread to start executing after another thread has just issued a memory reference and has to wait until the reference is completed before doing more work. This behavior is critical to maintaining efficient hardware execution of the microengines because memory latency is important. Stated differently, if only a single thread execution is supported, the microengine will wait many cycles for the reference to return, thereby reducing overall computational throughput. Multi-threaded execution allows the microengines to hide memory latency by performing useful independent operations across several threads. Two synchronization mechanisms are provided to allow a thread to issue an SRAM or SDRAM reference, then synchronize sequentially with the point in time when the reference is completed.
One mechanism is "immediate synchronization". In immediate synchronization, the microengine issues a reference and immediately swaps out the associated text. When the corresponding reference is completed, the associated text will be signaled. Once signaled, when an associated text-swapping event occurs and it is its turn to run, the associated text will be swapped back for execution. Thus, from the perspective of an instruction stream of a single associative text, the microword is not executed until the reference is complete after the memory reference is issued.
Another mechanism is "delayed synchronization". In delayed synchronization, the microengines issue references and then proceed to perform some other useful work that is not related to the reference. Some time later, the thread's execution flow must be synchronized with the completion of the issued reference before further work is performed. At this point, a synchronized microword is executed that will swap out the current thread and later swap it back when the reference has been completed; alternatively, the current thread continues to execute because the reference has been completed. The synchronization of the delays is performed by using two different signaling schemes.
If the memory reference is associated with a transfer register, a signal is generated from which the thread is triggered when the corresponding transfer register valid bit is set or cleared. For example, an SRAM read that stores data into transfer register A will be signaled when the valid bit of A is set. If the memory reference is associated with a translation FIFO or a receive FIFO (rather than a transfer register), a signal is generated when the reference is completed in the SDRAM controller 26 a. Only one signal state per context is saved in the microengine scheduler and therefore only one incomplete signal can exist in this scheme.
There are at least two general operational paradigms from which microcontroller micro-programs can be designed. An example would be: the overall microcontroller computational throughput and overall memory bandwidth are optimized at the expense of single thread execution latency. Such a paradigm may be significant when the system has multiple microengines, each executing multiple threads on unrelated data packets.
The second operational example is: the execution latency of the microengines should be optimized at the expense of overall microengine computational throughput and overall memory bandwidth. This example may involve the execution of threads with real-time constraints, i.e., constraints that indicate that a certain job must be done by a certain specified time. This constraint requires: priority is given to optimization of single thread execution compared to memory bandwidth or overall computational throughput considerations. A real-time thread would mean a single microengine executing only one thread. Multiple threads will not be processed because the goal is: allowing a single real-time thread to be executed as quickly as possible-the execution of multiple threads may hinder this capability.
The coding styles of these two examples may differ significantly in terms of issuing memory references and associated text switches. In the real-time case, the goals are: as many memory references as possible are issued as soon as possible in order to minimize the memory latency caused by those references. Having issued as many references as possible as early as possible, the goal will be: let the microengine perform as many computations as possible in parallel to the references. The calculation procedure corresponding to the real-time optimization is:
o) issuing a memory reference 1
o) issuing a memory reference 2
o) issuing a memory reference 3
o) performing work independent of memory references 1, 2, and 3
o) completion synchronization with memory reference 1
o) performing work related to memory reference 1 and unrelated to memory references 2 and 3
o) issuing any new memory references based on previous work
o) completion synchronization with memory reference 2
o) performing work related to memory references 1 and 2 and unrelated to memory reference 3
o) issuing any new memory references based on previous work
o) completion synchronization with memory reference 3
o) performing work related to the completion of all 3 references
o) issue any new memory references according to the previous work.
By contrast, optimization of throughput and bandwidth will take a different approach. With respect to optimizing microengine computational throughput and overall memory bandwidth, single thread execution latency is less of a concern. To achieve this, the goal would be: memory references on the microprogram for each thread are equally spaced. This will provide a stream of unified memory references for the SRAM and SDRAM controllers and maximize the likelihood that 1 thread can always hide the resulting memory latency when another thread is swapped out.
Register file address type:
referring to fig. 3C, there are two register address spaces that are locally accessible registers and globally accessible registers that are accessible by all microengines. "general purpose registers" (GPRs) are implemented as two separate banks (bank a and bank B) whose addresses are interleaved on a word-by-word basis so that the bank a registers have 1 sb-0 and the bank B registers have 1 sb-1. Each bank is capable of performing simultaneous reading and writing of two different words within its bank.
Across rows A and B, register set 76B is also organized into four windows 76B of 32 registers (relatively addressable per thread)0-76 bx. Thus, thread _0 will find its register 0 at 77a (register 0), thread _1 will find its register _0 at 77b (register 32), thread _2 will find its register _0 at 77c (register 64), and thread _3 will find its register _0 at 77d (register 96). Relative addressing is supported so that multiple threads can use exactly the same control memory and location, but access different windows of registers and perform different functions. The use of register window addressing and rank addressing provides the necessary read bandwidth to utilize only dual-ended RAMS in microengines 22 f.
These settings do not have to hold data to switch from one associated text to another in order to eliminate normal push/pop of the associated text exchange file or stack. The associated text switch here has a 0-cycle assist operation that changes from one associated text to another. Relative register addressing divides the register bank into windows across the address width of the general purpose register set. Relative addressing allows access to any window relative to its beginning. Absolute addressing is also supported in architectures where any one of the absolute addresses can be accessed by any thread by providing the exact address of the register.
The addressing of the general purpose registers 78 may occur depending on the 2 modes of the micro-word format. The two modes are an absolute mode and a relative mode. In absolute mode, the addressing of register addresses is specified directly in the 7-bit source field (a6-a0 or b6-b 0):
7 6 5 4 3 2 1 0
+--- +--- +--- +--- +--- +--- +--- +---+
A GPR: | a6 | 0 | a5 | a4 | a3 | a2 | a1 | a0 | a6=0
B GPR: | b6 | 1 | b5 | b4 | b3 | b2 | b1 | b0 | b6=0
SRAM/ASB:| a6 | a5 | a4 | 0 | a3 | a2 | a1 | a0 | a6=1,a5=0,a4=0
SDRAM: register addresses | a6 | a5 | a 4| 0 | a3 | a 2| a1 | a0 | a6 ═ 1, a5 ═ 0, a4 ═ 1, specified directly in the 8-bit dest field (d7-d 0):
7 6 5 4 3 2 1 0
+----+----+----+----+----+----+----+----+
A GPR: | d7 | d6 | d5 | d4 | d3 | d2 | d1 | d0 | d7=0,d6=0
B GPR: | d7 | d6 | d5 | d4 | d3 | d2 | d1 | d0 | d7=0,d6=1
SRAM/ASB:| d7 | d6 | d5 | d4 | d3 | d2 | d1 | d0 | d7=1,d6=0,d5=0
SDRAM: | d7 | d6 | d5 | d4 | d3 | d2 | d1 | d0 | d7=1,d6=0,d5=1
if < a6: a5> -1, 1, < b6: b5> -1, 1, or < d7: d6> -1, 1, then the lower bits are interpreted as the address field (described below) associated with the associated text. When an unrelated a or B source address is specified in the A, B absolute field, only the lower half of the SRAM/ASB and SDRAM address spaces can be addressed. In effect, a read absolute SRAM/SDRAM device has an effective address space; however, since this restriction does not apply to the dest field, the write SRAM/SDRAM still uses the full address space.
In the relative mode, the specified address is offset within the associated text space as defined by the 5-bit source field (a4-a0 or b4-b 0):
7 6 5 4 3 2 1 0
+--- +---+---+---+---+---+---+---+
GPR: | a 4| 0 | associated text | a3 | a 2| a1 | a0 | a 4| 0
B GPR: | b7 | 1 | associated text | b3 | b 2| b1 | b0 | b4 ═ 0
SRAM/ASB: | ab 4| 0 | ab3 | associated text | b 2| b1 | ab0 | ab4 ═ 1, ab3 ═ 0
SDRAM: | ab 4| 0 | ab3 | associated text | b 2| b1 | ab0 | ab4 ═ 1, ab3 ═ 1 or, as defined by the 6-bit dest field (d5-d 0):
7 6 5 4 3 2 1 0
+--- +--- +---+--- +--- +--- +--- +--- +
GPR: | d5 | d 4| associated text | d3 | d 2| d1 | d0 | d5 ═ 0, and d4 ═ 0
B GPR: | d5 | d 4| associated text | d3 | d 2| d1 | d0 | d5 ═ 0, and d4 ═ 1
SRAM/ASB: | d5 | d 4| d3 | associated text | d 2| d1 | d0 | d5 ═ 1, d4 ═ 0, d3 ═ 0
SDRAM: | d5 | d 4| d3 | associated text | d 2| d1 | d0 | d5 ═ 1, d4 ═ 0, d3 ═ 1
If < d5: d4> is 1, 1, then the target address does not address a valid register, and thus the dest operand is not written back.
The following registers are globally accessible from the microengines and the memory controller:
hash unit register
Scratch pad and general register
Reception FIFO and reception status FIFO
Transmit FIFO
Transmit control FIFO
The microengines are not driven by interrupts. Each micro-process is executed until completion and then a new process is selected based on the status signaled by the other devices in processor 12.
Referring to FIG. 4, the SDRAM memory controller 26a includes a memory reference queue 90, where memory reference requests come from the various microengines 22a-22 f. The memory controller 26a includes an arbiter 91 that selects the next microengine to refer to the request to any functional unit. If a microengine is providing a quote request, the quote request will come in the SDRAM controller 26a through the address and command queue 90. If the reference request has a bit means called "optimized MEM bit @," the incoming reference request will be classified as either an even queued 90a or an odd queued 90 b. If the memory reference does not set the memory optimization bit, an order queue 90c will be entered by default. The SDRAM controller 26 is a resource shared among the FBUS interface 28, the core processor 20, and the PCI interface 24. The SDRAM controller 26 also maintains a state machine for performing the "read-modify-write" atomic operation. The SDRAM controller 26 also performs byte justification for requests for data from the SDRAM.
The order queue 90c maintains the order of the quote requests from the microengines. With a series of odd and even bank references, it may be requested that a signal be returned only when a series of memory references to odd and even banks is completed. If the microengine 22f orders the memory references into odd and even bank references and one of the banks (e.g., the even bank exhausts the memory reference before the odd bank, but a signal is indicated on the last even reference), the memory controller 26a may signal back to the microengine that the memory request has completed even though the odd bank reference has not been serviced. This occurrence can create an attendant problem. This can be avoided by providing a sequence queue 90c that allows the microengine to have multiple memory references outstanding, but only its last memory reference needs to signal completion.
The SDRAM controller 26a also includes a high priority queue 90 d. In the high priority queue 90d, incoming memory references from one microengine are directed to the high priority queue and are operated on at a higher priority than other memory references in other queues. All of these queues, the even-queued queue 90a, the odd-queued queue 90b, the sequential queue 90c and the high priority queue, are implemented as a single RAM structure that is logically partitioned into four distinct windows, each having its own head and tail indicators. Since the fill and extract operations are only a single input and a single output, they can be placed into the same RAM structure to increase the density of the RAM structure.
The SDRAM controller 26a also includes core bus interface logic (i.e., ASB bus 92). ASB bus interface logic 92 connects core processor 20 to SDRAM controller 26 a. The ASB bus is a bus that includes a 32-bit data path and a 28-bit address path. Data is accessed to and from memory through MEM ASB data device 98 (e.g., a buffer). MEM ASB data device 98 is a queue for write data. If there is incoming data from the core processor 20 via the ASB interface 92, the data may be stored in the MEM ASB device 98 and subsequently moved sequentially from the MEM ASB device 98 through the SDRAM interface 110 to the memory 16 a. Although not shown, the same queue structure may be provided for reads. The SDRAM controller 26a also includes an engine 97 to pull data from the microengines and PCI bus.
The additional queues include a PCI address queue 94 and an ASB read/write queue 96 that hold a number of requests. The memory requests are sent to the SDRAM interface 110 via multiplexer 106. The multiplexer 106 is controlled by the SDRAM arbiter 91 which detects the fullness of each queue and the status of the respective requests and determines therefrom the priority according to a programmable value stored in the priority service control register 100.
Once control of the multiplexer 106 selects a memory reference request, the memory reference request is sent to the decoder 108, where it is decoded and an address is generated. The decoded address is sent to the SDRAM interface 110 where it is decomposed into row and column address strobes to access the SDRAM16a and write or read data on the data thread 16a (send data to the bus 112). In one implementation, the bus 112 is actually two separate buses, rather than a single bus. These separate buses would include a read bus coupling the distributed microengines 22a-22f and a write bus coupling the distributed microengines 22a-22 f.
One feature of the SDRAM controller 26a is: when a memory reference is stored in queue 90, there is an Achaining @ (chaining bit @) in addition to the optimized MEM bit that may be set. When set, the chaining bit allows special handling of nearby memory references. As previously described, the arbiter 12 controls which microengines will be selected to provide memory reference requests on the command bus to the queue 90 (FIG. 4). The assignment of the chain bit will control the arbiter to select the functional unit that previously requested the bus because the setting of the chain bit indicates that the microengine issued a chained request.
When the chain bit is set, the adjacent memory reference will be received in queue 90. Those nearby references will typically be stored in the order queue 90c because the nearby memory references are multiple memory references from a single thread. To provide synchronization, the memory controller 26a need only signal the end of the chained memory references (done). However, in an optimized memory chain, memory references may enter different ranks (e.g., when setting the optimized MEM bit and the chain bit), and potentially complete on one of the ranks (signaling Adone @ before the other ranks are completely depleted), thus breaking consistency. Therefore, the chain bit is used by the controller 110 to maintain a memory reference from the current queue.
Referring to fig. 4A, a flow diagram representative of an arbitration policy in the SDRAM controller 26a is shown. The arbitration policy supports chained microengine memory requests. Program 115 begins by checking for chained microengine memory reference requests 115 a. Process 115 holds the chained request until the chain bit is cleared. The process examines ASB bus request 115b, followed by PCI bus request 115c, "high priority queue service 115 d", "relative rank request 115 e", "sequential rank request 115 f", and "same rank request 115 g". The chained requests are fully serviced while the services 115b-115d are serviced in a rotating order. The program processes the services 115e-115g only when the services 115a-115d are completely exhausted. The chained microengine memory reference request is when the previous SDRAM memory request has a chained bit device. When the chain bit is set, the arbitration engine simply services the same queue again until the chain bit is cleared. ASBs have a higher priority than PCI due to the harsh performance penalty imposed on the Strong arm core when the ASB is in a wait state. PCI has a higher priority than microengines due to its latency requests. However, with other buses, the arbitration priority may be distinct.
As shown in fig. 4B, typical timing of memory with no running memory optimization and memory with running memory optimization is shown. It can be seen that the use of memory optimization for operation makes maximum use of the bus, and therefore hides the inherent latency within the physical SDRAM device. In this example, the non-optimized accesses may be used for 14 cycles, while the optimized accesses may be used for 7 cycles.
Referring to fig. 5, a memory controller 26b of the SRAM is shown. Memory controller 26b includes an address and command queue 120. Memory controller 26a (fig. 4) has a memory optimization queue that is sorted on an odd and even basis, while memory controller 26b is optimized according to the type of memory operation (i.e., read or write). The address and command queue 120 includes a high priority queue 120a, a read queue 120b, which is the primary memory reference function performed by the SRAM, and a sequence queue 120c, which will typically include all writes to the SRAM and reads that will not be optimized. Although not shown, address and command queue 120 may also include a write queue.
The SRAM controller 26b also includes core bus interface logic (i.e., ASB bus 122). ASB bus interface logic 122 connects core processor 20 to SRAM controller 26 b. The ASB bus includes a 32-bit data path and a 28-bit address path. Data is accessed to and from memory through the memab data device 128 (e.g., a buffer). The MEM ASB data device 128 is a queue for write data. If there is incoming data from the core processor 20 via the ASB interface 122, the data may be stored in the MEM ASB device 128 and subsequently moved from the MEM ASB device 128 to the SRAM memory 16b through the SRAM interface 140. Although not shown, the same queue structure may be provided for reads. The SRAM controller 26b also includes an engine 127 to pull data from the microengines and PCI bus.
The memory request is sent to the SRAM interface 140 via the multiplexer 126. The multiplexer 126 is controlled by an SRAM arbiter 131 which detects the fullness of each queue and the status of the respective request and determines the priority therefrom according to a programmable value stored in a priority service control register 130. Once a memory reference request is selected by control of multiplexer 126, the memory reference request is sent to decoder 138, which decodes it and generates an address. The SRAM cells maintain control of both "memory mapped off-chip SRAM" and "expansion ROM". The SRAM controller 26b may address, for example, 16 megabytes, for example, 8 megabytes mapped to SRAM 16b, with 8 megabytes reserved for special functions, including: boot space via flash ROM 16 c; console port access of the MAC devices 13a, 13b and access to Related (RMON) counters. SRAM is used for local lookup table and queue management functions.
The SRAM controller 26b supports the following transactions:
microengine requests (via dedicated bus) to/from SRAM
Core processor (via ASB bus) to/from SRAM
The SRAM controller 26b performs memory reference ordering to minimize delays (bubble shadowing) in the pipeline from the SRAM interface 140 to the memory 16 b. The SRAM controller 26b performs memory reference sorting according to the read function. One bubble shadow may be 1 cycle or 2 cycles depending on the type of storage device used.
The SRAM controller 26b includes a lock lookup apparatus 142, which is an eight (8) entry address content addressable memory for read lock lookup. Each location includes a valid bit that is checked by subsequent read lock requests. The address and command queue 120 also includes a read lock miss queue 120 d. The "read lock miss queue 120 d" is used to hold read memory reference requests that miss due to a lock being present on a portion of memory. That is, a microengine issues a memory request with a read lock request processed in the address and control queue 120. The memory request will operate on the sequence queue 120c or the read queue 120b and identify it as a read lock request. The controller 26b will access the lock lookup device 142 to determine if this memory location has been locked. If this memory location is locked by leaving any previous read lock request, then this memory lock request will fail and will be stored in the read lock fail queue 120 d. If it is unlocked, or if 142 shows no lock on that address, the address of the memory reference will be used by the SRAM interface 140 to perform a conventional SRAM address read/write request to the memory 16 b. The command controller and address generator 138 will also input a lock into the lock lookup device 142 so that subsequent read lock requests will find the locked memory location. After the need for locking is over, the memory location is unlocked by operation of the micro-control instructions in the program. The location is unlocked by clearing the valid bit in the CAM. After unlocking, the read lock miss queue 120d becomes the highest priority queue, providing an opportunity for all missed queued read locks to issue memory lock requests.
As shown in fig. 5A, typical timing of a static random access memory without running memory optimization and with running memory optimization is shown. It can be seen that packet reads and writes improve the like time (cycle) to eliminate dead cycles.
Referring to fig. 6, communication between the microengines 22 and the FBUS interface logic (FBI) is shown. The FBUS interface 28 in the network application may perform header processing of incoming data packets from the FBUS 18. One key function performed by the FBUS interface is to extract the header of the data packet and a microprogrammable source/destination/protocol hash lookup in SRAM. If the hash is not successfully resolved, the header of the data packet is promoted to the core processor 28 for more complex processing.
FBI 28 includes transmit FIFO 182, receive FIFO 183, hash unit 188, and FBI control and status register 189. These four units communicate with the microengine 22 via time multiplexed access to the SRAM bus 38 (connected to the transfer registers 78 and 80 in the microengine). That is, all communication to and from the microengines is via transfer registers 78 and 80. The FBUS interface 28 includes a push state machine 200 for pushing data into the transfer registers during various time periods in which the SRAM is not using the SRAM data bus (part of the bus 38) and a pull state machine 202 for pulling data from the transfer registers in the respective microengines.
The hashing unit includes a pair of FIFOs 188a and 188 b. The hash unit determines that FBI 28 received an FBI _ hash request. The hash unit 188 fetches the hash key value from the calling microengine 22. After these key values are fetched and hashed, the index number is passed back to the call microengine 22. Up to three hashes are performed under a single FBI _ hash request. Both buses 34 and 38 are unidirectional: SDRAM _ push/pull _ data and Sbus _ push/pull _ data. Each of these buses requires control signals to provide read/write control to the appropriate microengine 22 transfer registers.
Typically, the transfer registers are to be protected from the associated text that controls them to ensure the correctness of the read. In particular, if the write transfer register is being used by thread _1 to provide data to SDRAM16a, thread _1 may not overwrite this register until a signal back from SDRAM controller 26a indicates that the register has been upgraded and can now be reused. Each write does not require a signal back from the destination (indicating that the function has been completed) because if the thread writes to the same command queue at the destination with multiple requests, the order of completion is guaranteed within that command queue so that only the last command needs to signal back to the thread. However, if multiple command queues (sequential and read) are used by a thread, then these command requests must be separated into separate associated text tasks in order to maintain command order via associated text exchanges. The exceptions noted at the beginning of this paragraph relate to certain types of operations that use unsolicited "pushes" of transfer registers from the FBI in order to obtain FBUS status information. To protect read/write determinism on the transfer registers, the FBI provides a special Push _ protect signal when setting up these special FBI Push operations.
Any microengine 22 using FBI unsolicited push technology must test the protection flag before the FBUS interface/microengine agrees with the transfer register. If the flag is not set, the transfer register may be accessed by the microengine. If the flag is set, the associated text should wait N cycles before accessing the register. This a priori calculation is determined by the number of transfer registers being pushed and a front end protection window. The basic idea is that: the microengine must test this flag and then quickly move the data it wishes to read from the read transfer register to the GPR's in the adjacent cycle so the push engine does not collide with the microengine reads.
Other embodiments
It is to be understood that the invention has been described in detail, but the foregoing description is intended to illustrate and not limit the scope of the invention.
Claims (17)
1. A controller for a random access memory, the controller comprising:
an address and command queue module for holding memory references from a plurality of microcontrol functional units, the address and command queue module comprising:
a read queue module;
a first read/write queue module to hold memory references from a core processor; and the number of the first and second groups,
a control logic module including an arbiter that detects a fullness of each queue module and a completion status of outstanding memory references to select a memory reference from one of the queue modules.
2. The controller of claim 1, wherein the control logic module selects a memory reference from one of said queue modules to provide a next memory reference based on the priority of the memory reference represented by the programmable value stored in the priority service control register.
3. The controller of claim 1, wherein the address and command queue module further comprises:
a high priority queue holding memory references from high priority tasks.
4. The controller of claim 1, wherein the microengine orders the memory references into read memory references and write memory references.
5. The controller of claim 1, wherein the address and command queue module further comprises:
a sequential queue module that holds write memory requests, wherein the controller examines incoming memory reference requests and classifies the incoming memory reference requests as read queues or sequential queues according to a specified address mode.
6. The controller of claim 1, wherein the address and command queue module further comprises:
a sequence queue; and the number of the first and second groups,
wherein if the memory reference request has no memory optimization bit set, the memory reference request is stored in the order queue.
7. The controller of claim 1, wherein the address and command queue module is implemented in a single memory structure, and further comprising:
a sequence queue module for saving memory references;
a read queue module to hold memory references;
a high priority queue module for holding memory references;
a read lock miss queue module to hold read lock memory reference requests that miss due to a pre-existing lock on a portion of random access memory controlled by the controller; and
the storage structure is partitioned into four different queue areas, each with its own head and tail pointers.
8. The controller of claim 7, wherein the address and command queue module further comprises:
an insert queue control and remove queue arbitration logic module to control the insertion and removal, respectively, of memory references from the queue.
9. The controller of claim 1, further comprising:
a command decoder and address generator generates addresses and commands to control the memory interface in response to an address of a selected memory reference of one of the queues.
10. The controller of claim 1, further comprising:
a memory interface responsive to the generated address and command to generate memory control signals.
11. The controller of claim 9, wherein the controller further comprises:
a lock lookup content addressable memory for lookup of read locks.
12. The controller of claim 10, wherein the address and command queue module further comprises:
a read lock miss queue module that holds read lock memory reference requests that miss due to a pre-existing lock on a portion of random access memory controlled by the controller.
13. The controller of claim 12, wherein the command decoder responds if one of the microcontrol functional units issues a read lock request by accessing the lock lookup memory to determine if the memory location specified in the read lock request has been locked.
14. The controller of claim 13, wherein if the memory location is locked at any previous read lock request, the issued memory lock request fails and is stored in the read lock fail queue module.
15. The controller of claim 14 wherein the memory interface translates the issued read reference to an address signal for the memory if the memory location is not locked.
16. The controller of claim 15, wherein the command decoder and address generator inputs a lock into the lock lookup memory corresponding to the memory address of the issued read reference.
17. The controller of claim 1, wherein the controller is configured to control Static Random Access Memory (SRAM).
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US09/387,110 | 1999-08-31 | ||
| US09/387,110 US6427196B1 (en) | 1999-08-31 | 1999-08-31 | SRAM controller for parallel processor architecture including address and command queue and arbiter |
| PCT/US2000/022653 WO2001016769A1 (en) | 1999-08-31 | 2000-08-17 | Sram controller for parallel processor architecture |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1053728A1 HK1053728A1 (en) | 2003-10-31 |
| HK1053728B true HK1053728B (en) | 2005-10-07 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN1185592C (en) | Parallel processor architecture | |
| CN1296818C (en) | Instruction used in multithreaded parallel processor | |
| CN1399739A (en) | SRAM controller for parallel processor architecture | |
| CN1387644A (en) | SDRAM controller for parallel processor architecture | |
| CN1387641A (en) | Multithreaded Execution in Parallel Processors | |
| CN1199109C (en) | Memory controller that improves bus utilization by reordering memory requests | |
| CN1205553C (en) | Method and apparatus for prioritization of access to extenal device | |
| CN100342349C (en) | Out-of-pipeline trace buffer for instruction replay following misspeculation | |
| CN1282925C (en) | Using page tag registers to track state of physical pages in memory device | |
| CN1249963C (en) | Equipment and method for fast and self adapting processing block by using block digest information | |
| CN1210649C (en) | Scheduler capable of issuing and reissuing dependency chains | |
| CN1782989A (en) | Multi-threaded processor that processes multiple instruction streams independently and softly controls the processing functions of each instruction stream | |
| CN1357121A (en) | Methods and appts. for detecting data collision on data bus for different times of memory access execution | |
| CN1619511A (en) | Microprocessor and apparatus for performing fast speculative load operation | |
| CN1846194A (en) | An integrated mechanism for suspending and deallocating computing threads executing in a processor | |
| CN101034381A (en) | Multi-master system and data transfer system | |
| CN1311357C (en) | Universal resource access controller | |
| CN1967506A (en) | Coalescing entries in cache memory processors | |
| CN1758213A (en) | Heterogeneous parallel multithread processor (HPMT) with shared contents | |
| CN1026037C (en) | Production line method and apparatus for high performance instruction execution | |
| CN1068445C (en) | Instruction Scheduling Method and Register Contention Checking Method | |
| HK1053728B (en) | Sram controller for parallel processor architecture | |
| HK1051729B (en) | Method and processor for branch instruction | |
| HK1051728B (en) | Instruction for multithreaded parallel processor | |
| HK1051247B (en) | Multithreaded processor and a method for operating the processor |