HK1195958A

HK1195958A - Multi-core interconnect in a network processor

Info

Publication number: HK1195958A
Application number: HK14109343.9A
Authority: HK
Inventors: R.E.凯斯勒; D．H．阿舍; J．M．珀维勒; B．D．多比
Original assignee: 马维尔亚洲私人有限公司
Priority date: 2011-10-31
Filing date: 2012-10-29
Publication date: 2014-11-28

Description

Multi-core interconnect in a network processor

RELATED APPLICATIONS

A continuation of and claiming priority from U.S. application No. 13/285,629 filed on 31/10/2011, the entire teachings of which are incorporated herein by reference.

Background

Typical network processors schedule and queue the work of higher level network protocols, such as packet processing operations, and allow processing with respect to higher level network protocols (e.g., transport and application layers) within received packets before forwarding the packets to a connected device. Functions typically performed by a network processor include packet filtering, queue management and prioritization, quality of service enhancement, and access control. By utilizing features specific to processing packet data, the network server can optimize the interface of the networked devices.

Disclosure of Invention

Embodiments of the present invention provide a system for controlling data transfer and processing in a network processor. The interconnect circuitry directs communication between a set of multiple processor cores and the cache. A plurality of memory buses each connect a corresponding set of the plurality of processor cores to the interconnect circuitry. The cache is divided into a plurality of stripes, where each stripe is connected to the interconnect circuit by a separate bus. The interconnect circuitry provides for distributing requests received from the plurality of processor cores among the cache banks.

In further embodiments, the interconnect circuitry may translate the requests by modifying an address portion of the requests. This conversion may include performing a hash function on each request, which provides a pseudo-random distribution of the requests among the plurality of strips. The interconnect circuitry or the cache banks may be further configured to maintain tags indicating the state of an L1 cache coupled to the plurality of processor cores. The interconnect circuitry directs tags within the received request to multiple channels, thereby processing multiple tags simultaneously.

In yet further embodiments, the interconnect circuitry may include a plurality of data output buffers. Each of the data output buffers may receive data from each of the plurality of banks and output data through a corresponding one of the memory buses. The interconnect circuitry may also include a plurality of request buffers, wherein each of the request buffers receives requests from each set of processors and outputs the request to a corresponding one of the banks.

In further embodiments, one or more bridge circuits may be coupled to the memory buses. The bridge circuits may connect the processor cores to one or more on-chip coprocessors. Further, to maintain memory coherency, the cache banks may delay transmission of a commit signal to the plurality of processor cores. The cache banks then transmit the commit signal in response to receiving an indication that the invalidate signal has been transmitted to all of the plurality of processor cores. The interconnect circuitry and memory buses may be configured to control the arrival of the invalidation signal at an L1 cache in less time than the time required for a commit to one of the plurality of stripes and to control the arrival of a subsequent signal at one of the plurality of processor cores receiving the invalidation signal.

Drawings

The foregoing will be apparent from the following more particular description of exemplary embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the invention.

FIG. 1 is a block diagram illustrating a network services processor in which embodiments of the present invention may be implemented.

FIG. 2 is a block diagram of a Coherent Memory Interconnect (CMI) circuit and associated components, in one embodiment.

FIG. 3 is a block diagram that illustrates the processing of requests to the cache banks, in one embodiment.

FIG. 4 is a block diagram illustrating buffers implemented in store requests to the cache banks.

FIG. 5 is a block diagram illustrating buffers implemented in data output by the cache banks.

FIG. 6 is a block diagram of a cache bank in one embodiment.

Detailed Description

Before describing in detail exemplary embodiments of the present invention, an exemplary network security processor in which these embodiments may be implemented is described immediately below to assist the reader in understanding the inventive features of the present invention.

Fig. 1 is a block diagram illustrating a network services processor 100. The network services processor 100 provides high application performance using at least one processor core 120.

The network services processor 100 handles the open systems interconnection network L2-L7 layer protocol encapsulated in received packets. As is well known to those skilled in the art, the Open Systems Interconnection (OSI) reference model defines seven layers of network protocol layers (L1-7). The physical layer (L1) represents the actual interface that connects a device to a transmission medium, including electrical and physical interfaces. The data link layer (L2) performs data framing. The network layer (L3) formats the data into packets. The transport layer (L4) handles end-to-end transport. The session layer (L5) manages communication between devices, e.g., whether the communication is half-duplex or full-duplex. The presentation layer (L6) manages data formatting and presentation, such as syntax, control codes, special graphics and character sets. The application layer (L7) allows communication between users, such as file transfers and e-mail.

The network service processor 100 may plan and arrange work (packet processing operations) for upper network protocols (e.g., L4-L7) and allow processing of the upper network protocols in received packets to be executed in order to forward the packets at wire speed. Line speed is the rate at which data is transmitted over a network that transmits and receives data. By processing these protocols to forward these packets at wire speed, the network services processor does not slow down the network data transfer rate.

A plurality of interface units 122 receive a packet for processing. PCI interface 124 may also receive a data packet. The interface units 122 perform preprocessing of the received packet by checking various fields in the L2 network protocol header included in the received packet, and then forward the packet to a packet input unit 126. The at least one interface unit 122a may receive data packets from multiple X Attachment Unit Interfaces (XAUIs), Reduced X Attachment Unit Interfaces (RXAUI), or Serial Gigabit Media Independent Interfaces (SGMII). At least one interface unit 122b may receive connections from an instradken Interface (ILK).

The packet input unit 126 performs further pre-processing of network protocol headers (e.g., L3 and L4 headers) included in the received packet. This preprocessing includes checksum checking for TCP/User Datagram Protocol (UDP) (L3 network protocol).

A free pool allocator 128 maintains pools of pointers to free memory in the level 2 cache memory 130 and the external DRAM 108. The packet input unit 126 uses one of the pools of pointers to store received packet data in the level 2 cache memory 130 or external DRAM108 and uses another of the pools of pointers to allocate work entry queues for the processor cores 120.

The packet input unit 126 then writes the packet data into the level-2 buffer 130 or a buffer in the external DRAM 108. Preferably, the packet data is written into the buffers in a format that is convenient for higher level software executing in at least one of the processor cores 120. Thus, further processing of the higher level network protocol is facilitated.

Network services processor 100 may also include one or more application specific coprocessors. When included, the coprocessors offload some of the processing from the cores 120, thereby enabling the network services processor to achieve high throughput packet processing. For example, a compression/decompression co-processor 132 is provided that is dedicated to performing compression and decompression of received data packets. Other embodiments of the co-processing unit include a RAID/De-Dup unit 162, which speeds up the data chunking and data copying process for disk storage applications.

Another coprocessor is a Hyper Finite Automata (HFA) unit 160 that includes specialized HFA thread engines adapted to expedite pattern and/or feature matching necessary for anti-virus, intrusion detection systems, and other content processing applications. Using one HFA unit 160, pattern and/or feature matching is accelerated, for example performed at rates exceeding multiples of ten gigabits per second. The HFA unit 160 may, in some embodiments, include any of a Deterministic Finite Automata (DFA), a non-deterministic finite automata (NFA), or an HFA algorithm unit.

One I/O interface 136 manages overall protocol and arbitration and provides consistent I/O partitioning. The I/O interface 136 includes an I/O bridge 138 and a fetch and add unit 140. The I/O bridge includes two bridges, an I/O packet bridge (IOBP)138a and an I/O bus bridge (IOBN)138 b. The I/O packet bridge 138a is configured to manage overall protocol and arbitration and provide I/O partitioning that is primarily consistent with packet input and output. I/O bus bridge 138b is configured to manage overall protocols and arbitration and provide I/O partitioning that is primarily consistent with the I/O bus. Registers in the fetch and add unit 140 are used to maintain the length of output queues used to forward processed packets through a packet output unit 146. I/O bridge 138 includes buffer queues for storing information to be transferred between a Coherent Memory Interconnect (CMI)144, an I/O bus 142, packet input unit 126, and packet output unit 146.

The various I/O interfaces (MIOs) 116 may include a number of auxiliary interfaces such as general purpose I/O (GPIO), flash memory, IEEE802 two-wire management interface (MDIO), Serial Management Interrupt (SMI), universal asynchronous receiver/transmitter (UART), Reduced Gigabit Media Independent Interface (RGMII), Media Independent Interface (MII), two-wire serial interface (TWSI), and others.

The network service provider 100 may also include a Joint Test Action Group (JTAG) interface 123 that supports the MIPS EJTAG standard. According to the JTAG and MIPS EJTAG standards, the cores within the network service provider 100 will each have an internal test access port ("TAP") controller. This allows for multi-core debug support for the network service provider 100.

A schedule/Synchronize and Sequence (SSO) module 148 queues and schedules work for the processor cores 120. Work is queued by adding a queue of work items to a queue. For example, the packet input unit 126 adds a work entry queue for each packet arrival. A timer unit 150 is used to schedule work for the processor cores 120.

Processor core 120 requests work from SSO module 148. SSO module 148 selects (i.e., schedules) work for one of the processor cores 120 and returns a pointer to the work entry queue to describe the work to the processor core 120.

The processor core 120, in turn, includes an instruction cache 152, a level 1 data cache 154, and an encryption accelerator 156. In one embodiment, the network services processor 100 includes 32 superscalar Reduced Instruction Set Computer (RISC) type processor cores 120. In some embodiments, these superscalar RISC-type processor cores 120 each comprise an extension of the MIPS643 version of the processor core. In one embodiment, each of these superscalar RISC-type processor cores 120 includes a cNMPS II processor core.

The level 2 cache memory 130 and external DRAM108 are shared by all processor cores 120 and I/O coprocessor devices. Each processor core 120 is coupled to the level 2 cache memory 130 by the CMI 144. The CMI144 is the communication channel for all memory and I/O transactions between the processor cores 100, the I/O interface 136, and the level 2 cache memory 130 and controllers. In one embodiment, the CMI144 may be extended to 32 processor cores 120, supporting a fully coherent level 1 data cache 154 with full writes. Preferably, the CMI144 is highly buffered for the ability to prioritize I/O. The CMI is coupled to a trace control unit 164 configured to capture bus requests so that software can then read the requests and generate traces of the sequence of events on the CMI.

Level 2 cache memory controller 131 maintains memory reference coherency. It returns an up-to-date copy of a block for each fill request, whether the block is stored in level 2 cache memory 130, external DRAM108, or "in flight". It also stores a copy of these tags for the data cache 154 in each processor core 120. It compares the address of the cache-block-store request to the data-cache tags and invalidates (both copies) a data-cache tag to a processor core 120 whenever a store instruction comes from another processor core or from an I/O component through the I/O interface 136.

In some embodiments, multiple DRAM controllers 133 support up to 128 gigabytes of DRAM. In one embodiment, the plurality of DRAM controllers includes four DRAM controllers each supporting 32 gigabytes of DRAM. Preferably, each DRAM controller 133 supports a 64-bit interface to DRAM 108. In addition, DRAM controller 133 may support a preferred protocol, such as the DDR-III protocol.

After a packet has been processed by the processor cores 120, the packet output unit 146 reads the packet data from the level 2 cache memory 130, 108, performs L4 network protocol post processing (e.g., generates a TCP/UDP checksum), forwards the packet through the interface units 122 or PCI interface 124 and frees the L2 cache memory 130/DRAM108 used by the packet.

These DRAM controllers 133 manage in-flight transactions (load/store) to/from the DRAM 108. In some embodiments, the DRAM controllers 133 comprise four DRAM controllers, the DRAM108 comprises four DRAM memories, and each DRAM controller is connected to one DRAM memory. The DFA unit 160 is coupled directly to the DRAM controllers 133 on a bypass cache access path 135. The bypass cache access path 135 allows the HFA unit to read directly from memory without using the level 2 cache memory 130, which may improve the efficiency of HFA operations.

Embodiments of the invention may be implemented in the network service processor 100 shown in fig. 1, and it may be more specifically directed to a packet input unit (PKO)126 and an interface unit 122. Example embodiments are described in further detail below with reference to fig. 2-4.

FIG. 2 is a block diagram of a Coherent Memory Interconnect (CMI) circuit 244 and associated components, in one embodiment. CMI244 is a communication channel and control circuitry for directing memory and I/O transactions between the sets of processor cores 220A-D, I/O bridges 238A-B and the level 2 cache banks 230A-D. CMI244 may be implemented within network processor 100 as CMI144, with processor cores 220A-D implemented as processor cores 120, I/O bridges 238A-B implemented as I/O bridges 138A-B, and level 2 cache banks 230A-D implemented as level 2 cache 130.

As the number of processor cores implemented in network processors increases, providing controlled access to such a large source of memory subsystems becomes problematic. A first challenge that exists in network processors having a large number of processor cores (e.g., 32) is how to transfer requests from these cores to the memory system. Previous designs used a ring bus that could produce higher (and variable) delays. One second challenge in designing multi-core chips is to service the large number of requests generated by that large number of cores. A third similar challenge involves the structure of the L1 tag of the processor core (hereinafter referred to as the replication tag, or DUT), which must accommodate the requirements of each lookup request and possibly update the DUT. Fourth, the response data must be transferred from the cache back to the FILL bus. The FILL bus may not be fully utilized in cases where each cache bank has the capability to service only one bus request at a time and each request requires up to 4 cycles. A fifth challenge relates to the data associated with the bus store request that must be transferred from the source of the request to the cache line that will service the request. This question may be compared to the fifth challenge (involving response data), but with the source and destination reversed. Sixth, the processor cores need to access devices on the other side of the I/O bridges. Finally, a seventh challenge involves maintaining memory coherency throughout the storage subsystem.

Embodiments of the present invention provide for processing transactions between the multiple processor cores and the L2 cache and memory subsystem through four Coherent Memory Bus (CMB)225A-D sets. Each CMB225A-D includes multiple separate ADD/STORE/COMMIT/FILL buses. The entire group of four CMBs 225A-D and the I/O bridge bus IOC/IOR are connected together by a Coherent Memory Interconnect (CMI) 244. Likewise, four additional CMBs 235A-D include multiple separate ADD/STORE/COMMIT/FILL buses and connect these cache banks 230A-D to CMI 244.

Each of these CMBs 225A-D may support a corresponding set of processor cores 220A-D. In the present example embodiment, each set of processor cores 220A-D includes 8 processor cores, but may be modified to include additional or fewer cores. To provide memory access to the I/O portion of the network processor, two of the buses 225A-B have attached to them an IO bridge (IOB) 238A-B. The IOB0238A may be used to provide the processor cores 220A-D with access to NCB-side I/O devices over dedicated I/O command (IOC) and I/O response (IOR) buses. Both IOB0238A and IOB1238B may access the L2 cache and memory subsystem by sharing the CMB bus 225A-B with the processor cores 220A-B, respectively.

Each of the cache banks 230A-D may include a level 2 cache controller (L2C) that controls the transfer of commands and responses between CMB225A-D and cache banks 230A-D while maintaining the shared memory coherency model of the system. This L2C is described in further detail below with reference to fig. 6.

By grouping processor cores 220A-D and I/O bridges 238A-B into four groups, each group being serviced by a single CMB225A-D, lower latency arbitration logic can be used. Local arbitration decisions are made only among a set of processor cores 220A-D (and I/O bridges 238A-B, in the case of CMB with attached IO bridges) as a source covering a much smaller physical area. Arbitration requests and grant of those requests may be accomplished in a single cycle, which would be an unachievable rate if an attempt were made to arbitrate between all of the processor cores and the I/O bridge of the network processor. In addition, all CMBs 225A-D may be connected to the request buffers in the interconnect circuit with the same low, fixed delay. As a result, requests are transferred from the cores to the memory system with low latency.

To service the large number of requests generated by the large number of processor cores, the L2 cache is divided into four separate cache banks 230A-D. As a result, the bandwidth of requests that can be serviced is quadrupled. The physical address of each request may be hashed using a mutual exclusive or (XOR) function configured to produce a near random distribution of cache blocks across the 4 cache banks for all common address strides. This translates the spatial locality of the CMB request into a near random distribution across the four cache banks 230A-D, allowing four tag lookups to be better utilized per cycle. Conversely, if the L2 cache is instead a coherent structure, only one tag lookup occurs per cycle, severely limiting L2 cache bandwidth. As a result, the network processor may service a large number of requests generated by the large number of cores. One example configuration of request processing and bootstrapping is described below with reference to fig. 3.

In processing data requests, CMI244 must look up and possibly update the DUT. With 4 bus feed requests, this process requires up to 4 DUT lookups per cycle. To accommodate multiple DUT lookups, the DUTs may be divided into 8 sets (also referred to as "channels"), each of which may perform a lookup once per cycle. Interconnect circuitry 244 can slot up to 4 CMB requests to the DUT per cycle, provided that they require the use of different sets. This configuration provides a 2:1 ratio of resources to requests, increasing the chance that multiple requests can be serviced at the same cycle. As a result, the L1 tag of the network processor core can accommodate the requirements of each request to find and update the DUT. The process of updating the DUT is described in further detail below with reference to fig. 3.

Requests to store or retrieve data are transmitted from one of a set of processor cores 220A-D to a cache bank 230A-D that will service the request through a corresponding bus 225A-D and CMI 244. To service requests from the multiple processor cores, the process of reading storage data from the storage sources (either I/O bridges 238A-B or processor cores 220A-D) may be decoupled from the process of writing to the cache banks 230A-D storage buffers. This can be done using four 2 read/2 write port custom data buffers. Each buffer may receive data from both CMB225A-D buses and send data to both cache banks 230A-D. This configuration allows each CMB225A-DSTORE bus to provide a quantitative value (e.g., 128 bytes) of stored data during each cycle and allows each cache bank to receive the same amount of stored data (e.g., 128 bytes) per cycle, independent of the particular CMB225A-D or cache bank 230A-D that needs to provide or receive the data. This configuration CMB stores arbitration for data requests and arbitration for write cache banks 230A-D buffers, thereby allowing full utilization of available bus resources. As a result, the data input capacity of the cache bar is fully utilized. The configuration of the data buffer is described in detail below with reference to fig. 4.

The cache bar provides a response to the request and the response must be transmitted from the cache bar back to the CMB235A-D FILL bus. Each cache bank 230A-D may only be able to service one CMB request at a time, and each request may require up to 4 cycles. To keep full use of these CMB235A-D FILL buses, the cache bank 230A-DFILL buffer read port may be decoupled from the CMB235A-D FILL bus, and a 3 write port may be implemented in the interconnect circuitry 244 that may bypass the buffer to pre-fetch data destined for the CMB235A-D FILL bus. This allows up to 3 cache banks to read out the response data and queue it for transmission to the same bus. These buffers provide for coordination of the CMB235A-D FILL buses and the cache banks 230A-D FILL ports to maximize utilization. As a result, the data output capacity of the cache banks is fully utilized. This configuration of the FILL buffer is described in further detail below with reference to fig. 5.

In addition to cache banks 230A-D, processor cores 220A-D also need to access devices on the other side of IOB238A-B (e.g., interface units 122a-B and other devices on I/O bus 142 of network processor 100 in FIG. 1). This access is provided through a dedicated I/O command (IOC) bus of IOB 0238A. These CMB235A ADD/STORE buses provide requests from processor cores 220A-D, and interconnect circuitry 244 may translate those requests into the form required by the IOC bus. Furthermore, interconnect circuitry 244 must handle arbitration for this single IOC bus. When I/O bridge 238A provides response data, it places the data on an I/O response (IOR) bus. Interconnect circuitry 244 then receives this data, formats it appropriately, and returns it to the requesting core via the CMB235A-D FILL bus. As a result, processor cores 220A-D are provided access to devices across IOB 238A-B.

In order to maintain memory subsystem coherency, the invalidate and commit signals generated by a store request must be considered in view of the multi-bus architecture. When a processor core 220A-D or I/O device requests a store operation (via I/O bridges 238A-B), it receives a commit signal from L2C of the corresponding cache bank 230A-D to notify the other cores or I/O devices that data can now be seen from that memory. By waiting for all outstanding commit signals for its own memory, it can be determined that its earlier store will be visible before the subsequent store. This provides a mechanism to signal other cores and devices that they can proceed. Because multiple cores may be on different buses than the core/device that generated the store, there is an important ordering relationship between the commit signal and its associated L1 invalidate signal. If the core being signaled receives an invalid signal before it receives it, it can see the old data, causing a loss of coherency in the memory system. The L2C of cache banks 230A-D may prevent this loss of coherency by refraining from transmitting the commit signal until it is first confirmed that all buses' invalidate signals have been sent to processor cores 220A-D across all buses 225A-D. In some embodiments, the circuitry may be configured to ensure that the invalid signals arrive at the L1 in less time than the time required to arrive at the memory source's commit and to ensure that a subsequent signal arrives at the core receiving the invalid signal. As a result, memory consistency is maintained.

FIG. 3 is a block diagram that illustrates the processing of requests to the cache banks, in one embodiment. As described above with reference to FIG. 2, both the core-side CMBs 225A-D and the cache-side CMBs 235A-D include multiple ADD buses that carry address and control information for initiating memory transactions. The sources of these transactions may be processor cores 220A-B and IOB 238A-B. As shown in FIG. 3, CMI interconnect circuitry 244 is provided for processing and forwarding addresses from the core-side CMB225A-D to the cache-side CMB 235A-D. Here, referring to FIG. 2, the requested address is received at the ADD bus (ADD0-ADD3) of CMB225A-D corresponding to any one of the FIFO buffers of cache banks 230A-D (also referred to as "TAD" (TAD0-TAD 1)). Any of the four ADD buses may direct transactions to any of cache banks 230A-D. Whichever ADD bus the transaction is initiated from, the address of the request selects the cache bank 230A-D that is processing the transaction. The physical address of each request may be hashed using an exclusive-or (XOR) function configured to produce a near random distribution of cache blocks across the four cache banks 230A-D (TAD0-TAD3) for all common address strides. This translates the spatial locality of CMB requests into a near random distribution across the four cache banks 230A-D (TAD0-TAD3), allowing four tag lookups to be better utilized per cycle.

A transaction arriving on the ADD bus first enters one of the FIFOs of the destination cache entry. Each FIFO can buffer up to four ADD bus transactions per cycle. An algorithm determines an order in which to further process the plurality of addresses.

A scheduler at CMI interconnect circuitry 244 may determine which transactions may exit these FIFOs. Up to four transactions (each of which may issue one transaction) compete for the L1D tag pipe in circuit 244 per cycle. These L1D tag pipes (as shown by pipes 0-7) have copies of these L1 data cache tags (i.e., they are duplicate tags, or DUTs). These L1D tag pipes determine whether the transaction invalidates the copy of the block in the L1 cache. If so, interconnect circuitry 244 will eventually send an L1D cache invalidate command on the COMMIT and FILL buses for the transaction.

When a request is received, interconnect circuitry 244 may analyze the address (or a portion of the address, such as address bits <9:7>) of the request to select which L1D tag pipe to use per transaction. If each of up to four transactions presented differ by all address bits <9:7>, the crossbar schedules all transactions during that cycle. This implementation separates the DUTs into DUT0-3 containing L1D tag tubes 0-3, and DUT4-7 containing L1D tag tubes 4-7.

When interconnect circuitry 244 schedules a transaction from a cache bank FIFO, the transaction enters both an L1D tag pipe and an L2 tag pipe, within interconnect circuitry 244 (via the ADD bus) and the cache bank, respectively, at the same time. The state of the DUT is updated to match the state of the L1 tags. The in-flight buffer of the cache line eventually completes the transaction using the results from the tagpipes. A bypass (not shown) may avoid any excessive FIFO delay when there is no competing DUT and cache bar resources.

Each request may require a lookup of the L2 tag of the cache bank, as well as the state of the L1 tags of all processor cores. This state is maintained as the DUT, as a copy of the L1 tag in L2. Because the address hash function that evenly distributes the requests across the cache banks may not be used for L1, the bits needed by the hash function may not be available in time. To provide sufficient bandwidth to perform 4 lookups per cycle in the DUT, the DUT can be divided into 8 separate sets ("channels"), with each address mapped into one channel. Because only 4 addresses are selected per cycle with 8 DUT channels, it is likely that more than one request (up to 4 at maximum) will be selected based on a normal address distribution.

FIG. 4 is a block diagram illustrating buffers implemented in requests to the cache banks. As described above with reference to FIG. 2, the cache-side CMBs 235A-D include STORE buses that each carry data to be stored to cache banks 230A-D during a memory transaction. Requests to store or retrieve data are transmitted from one of a set of processor cores 220A-D to a cache bank 230A-D that will service the request through a corresponding bus 225A-D and CMI 244. To service requests from the multiple processor cores, four 2 read/2 write port data buffers 422A-D receive data from the STORE buses (STORE0-STORE 1). Each buffer 422A-D may receive data from both CMB225A-D STORE buses and may send data to both cache banks 230A-D. This configuration allows each CMB225A-D STORE bus to provide a quantitative value (e.g., 128 bytes) of stored data during each cycle and to receive the same amount of stored data (e.g., 128 bytes) per cycle per cache bank, independent of the particular CMB225A-D or cache bank 230A-D that needs to provide or receive the data.

FIG. 5 is a block diagram illustrating buffers implemented in data output by the cache banks. As described above with reference to FIG. 2, a cache stripe provides a response to a request and the response must be transmitted from the cache stripe back to the CMB235A-D FILL buses (shown in FIG. 5 as TAD0FILL … TAD1 FILL). Each cache bank 230A-D may only be able to service one CMB request at a time, and each request may require up to 4 cycles. To keep the CMB235A-D FILL bus fully utilized, the FILL buffers 532A-D may be implemented to decouple the cache banks 230A-D read port from the CMB235A-D FILL. FILL buffers 532A-D may be 3 write ports that may be implemented in interconnect circuitry 244 that may bypass the buffers to pre-fetch data destined for the CMB235A-D FILL bus. This allows up to 3 cache banks to read out the response data and queue it for transmission to the same bus. Buffers 532A-D are provided to coordinate the CMB235A-D FILL buses and the cache banks 230A-D FILL ports to maximize utilization of each of the CMB235A-DFILL buses.

FIG. 6 is a block diagram of the L2C control circuitry present in each of the cache banks 230A-D described above with reference to FIG. 2. The cache bar contains both cache tags and data for its portion of the L2 cache. Four quad groups contain the data. Each quad group has a 256KB L2 cache. The cache bank also includes a plurality of address and data buffers. These include an in-flight address buffer (LFB) that tracks all received L2 read and write operations, and a victim (victim) address buffer (VAB) that tracks all blocks written to DRAM (through LMC). L2C holds and processes up to 16 simultaneous L2/DRAM transactions in its LFB, and also manages up to 16 in-flight L2 cache victim operations/full write operations in the VAB/VBF.

The data buffer includes: fill Buffers (FBF) that are used whenever data is read from the L2 cache or DRAM; a STORE Buffer (SBF) for all STORE transactions; and a Victim Data Buffer (VDB) for writing data to the DRAM. For the L2/DRAM fill transaction, L2C returns data from either the L2 cache or memory through the FBF entry associated with the LFB entry. For an L2/DRAM STORE transaction, L2C first STOREs the STORE bus data into the SBF entry associated with the LFB entry, and then either updates the cache or writes the full cache block STORE directly to the DRAM. All L2/DRAM transactions that are lost in the L2 cache require a DRAM fill operation, except for store operations to all bytes in the cache block. Partial cache block store operations require DRAM fill operations to obtain bytes that are not stored. The L2C puts the DRAM FILL data into the FBF, then writes it to the L2 cache (if needed), and forwards it onto the FILL bus (if needed).

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

Claims

1. A computer system on a computer chip, comprising:

an interconnect circuit;

a plurality of memory buses, each bus connecting a corresponding set of the plurality of processor cores to the interconnect circuit; and

a cache divided into a plurality of stripes, wherein each stripe is connected to the interconnect circuit by a separate bus;

the interconnect circuitry is configured to distribute a plurality of requests received from the plurality of processor cores among the plurality of stripes.

2. The system of claim 1, wherein the interconnect circuitry translates the requests by modifying an address portion of the requests.

3. The system of claim 2, wherein the interconnect circuitry performs a hash function on each of the requests, the hash function providing a pseudo-random distribution of the requests among the plurality of stripes.

4. The system as recited in claim 1, wherein the interconnect circuitry is configured to maintain tags indicating a state of an L1 cache coupled to one of the plurality of processor cores, and wherein the interconnect circuitry is further configured to direct tags in the plurality of requests to a plurality of channels, thereby processing the corresponding tags simultaneously.

5. The system of claim 1, wherein the interconnect circuit further comprises a plurality of data output buffers, each of the data output buffers configured to receive data from each of the plurality of stripes and output data over a corresponding one of the plurality of memory buses.

6. The system of claim 1, wherein the interconnect circuit further comprises a plurality of request buffers, each of the request buffers receiving requests from each set of the plurality of processors and outputting the request to a corresponding one of the plurality of banks.

7. The system of claim 1, further comprising at least one bridge circuit coupled to at least one of the memory buses, the at least one bridge circuit connecting the plurality of processor cores to at least one on-chip coprocessor.

8. The system as recited in claim 1, wherein the bars are configured to delay transmission of a commit signal to the plurality of processor cores, the bars transmitting the commit signal in response to receiving an indication that invalid signals have been transmitted to all of the plurality of processor cores.

9. The system of claim 1, wherein the interconnect circuitry and the plurality of memory buses are configured to control a plurality of invalidate signals to an L1 cache in less time than required for a commit to one of the plurality of stripes and to control a subsequent signal to one of the plurality of processor cores receiving the invalidation.