HK1058562A

HK1058562A - High-speed data processing using internal processor memory space

Info

Publication number: HK1058562A
Application number: HK04101206.4A
Authority: HK
Inventors: Hussey Terrence; W. Monroe Donald; N. Sodder Arnold
Original assignee: Tenor Networks, Inc.
Priority date: 2000-03-03
Filing date: 2001-03-02
Publication date: 2004-05-21

Description

High speed data processing using internal processor memory space

CROSS-REFERENCE TO RELATED APPLICATIONS

Priority and benefit of U.S. provisional patent application, filed 3/2000 and having application number 60/186,782, is claimed herein and incorporated herein by reference in its entirety.

Technical Field

The present invention relates generally to information processing, and in particular to processing activities occurring within internal elements of a processor.

Background

Data processing typically involves retrieving data from memory, processing the data, and storing the results of the processing activity back to the memory. The hardware architecture that supports this data processing generally controls the flow of information and control between the individual hardware units of the information handling system. One such hardware unit is a processor or processing engine that includes arithmetic and logic processing circuitry, general or special purpose registers, processor control or sequencing logic, and data paths that interconnect these elements. In some implementations, the processor may be configured as a stand-alone Central Processing Unit (CPU) implemented as a custom designed integrated circuit or within an Application Specific Integrated Circuit (ASIC). The processor has internal registers for use with operations defined by a set of instructions. These instructions are typically stored in an instruction memory and specify a set of hardware functions that are available on the processor.

When performing these functions, a processor typically retrieves "transient" data from a memory external to the processor, loads portions of the data into its internal registers sequentially or randomly by executing a "load" instruction, processes the data as instructed, and then stores the processed data back into the external memory using a "store" instruction. In addition to loading transient data into internal registers and shifting execution results out of internal registers, load and store instructions are also frequently used during the actual processing of transient data to access additional information needed to complete processing activities (e.g., access status and command registers). Frequent load/store accesses to external memory are generally inefficient because the processor's execution capabilities are substantially faster than its external interface capabilities. Thus, the processor is often idle waiting for the accessed data to be loaded into its internal register file.

This inefficiency is limited particularly in devices operating within a communication system, since the net effect will be to limit the overall data processing capability of the device and the maximum information rate of the network itself unless some data is removed rather than transmitted.

Summary of The Invention

The present invention takes into account that frequent accesses to external memory are not necessary to process data sets that are small enough to be contained within the local register file space allocated to processing the data sets. Thus, the present invention incorporates data access techniques that are executed, at least in part, independently of the processor and which avoid the execution of load and store instructions by the processor.

In one embodiment, an information handling system and method incorporating aspects of the present invention confines the operation of a processor allocated to processing a data set to the processor's internal register file. The information processing system includes a processor, an inlet element, and an outlet element. The portal element receives unprocessed data from an interface to a data source corresponding to, for example, a network interface that receives data from a communication network. The entry element feeds this unprocessed data, or a portion thereof, into the internal register file space by directly accessing the internal register file space. A unit for manipulating data within a processor (e.g., an arithmetic logic unit) operates on and processes the data and fully controls its execution within its internal register file space in response to transfers to the processor's register file. When the processing activity is complete, the entry element directly accesses and fetches the unprocessed data from the internal register file space. Alternatively, an intermediate state machine directly accesses the unprocessed data and passes it to the egress element.

In one aspect of the invention, one or more state machines are included and manage the operation of the ingress and egress elements. One or more state machines may also be included within the processor. The state machine directly accesses the internal register file space of the processor to transfer data thereto or to retrieve data therefrom. In one embodiment, the data transfer activity of the state machine is initiated in response to a) receipt of unprocessed data on an entry element, b) a signal representing with the processor logic that unprocessed data is to be transferred into the register file space of the processor, and/or c) a change in a value stored in a logic element, such as a command register.

The benefits of the present invention can be realized in many information processing systems, such as those systems that focus on image processing, signal processing, video processing, and network packet processing. As an example, the present invention can be embodied within a communication device, such as a router, to implement network services such as routing processing, path determination, and path switching functions. The routing processing function determines the type of routing required for a packet, and the path switching function allows the router to accept a packet on one interface and forward the packet on a second interface. The path determination function selects the most appropriate interface for forwarding the packet.

The path switching functionality of the communication device can be implemented within one or more forwarding engine ASICs incorporating aspects of the present invention to support packet transfers between multiple interfaces of the communication device. In this illustrative embodiment, packet data is received by ingress logic associated with a particular input of the network interface of the communication device via the communication network. A processor is then selected by ingress logic from a candidate combination of processors (pool) associated with the receiving port to process the packet.

Once the processor has been allocated, the packet is divided into header and body portions. The header is written to an appropriate location within a memory element, such as an internal register file associated with the allocated processor, by at least one state machine configured to use direct memory/register access and without the processor referencing the ingress logic that loads or stores the instruction. The packet body portion is written to an output buffer. The processor then processes the packet header according to locally stored instructions (again, without reference to load or store instructions) and transfers the processed packet header to a selected output buffer where it is combined with the packet body and then transferred from the communication device to a destination output for transmission.

The assigned processor repeatedly executes an instruction stored at a first known location/address in an instruction memory of the processor in an infinite loop before receiving the packet header. Hardware in the processor detects that address 0 is a "special" address for circuit instruction return, rather than from instruction memory coupled to the processor. When a packet header is transferred from the ingress logic to the processor, a control signal indicates to the processor that a header transfer is in progress. When this signal is activated, the processor hardware forces the processor program counter to an unspecific address (e.g., address 2), which terminates execution of the infinite loop. Upon completion of the transfer of the packet header, the processor begins executing instructions beginning at address 2 of its instruction memory. Upon completion of packet processing activities, the processor is reset (e.g., to set the program counter to address 0) to repeatedly execute instructions at the particular address.

In this manner, the packet header is written directly to the processor's register file, and the processor need not require any interaction or prior knowledge until the packet header is ready to be processed. Other information (e.g., length) related to the state or characteristics of the packet can also be stored locally within the register file using similar procedures so that the processor does not have to access external resources to obtain this information.

To simplify the programming module for multiple processors, a separate processor can be assigned to each packet, with each processor configured to execute a common set of instructions in its respective instruction memory. Enough processors are allocated to ensure that packets can be processed at the cable/line rate of the communication network (i.e., the maximum bit rate of the network interface). The reduced instruction set implemented when incorporating aspects of the present invention into multiple processors within an ASIC reduces the die size of the ASIC, allowing greater density in many processors within the ASIC without encountering technical hurdles and the adverse limits that lead to the manufacture of such an ASIC. The ASIC implementation of the present invention is also scalable, for example, by increasing the clock rate of the processor, by adding more processors to the ASIC, and by aggregating processor(s) (having a common instruction set) combinations from multiple ASICs.

In one embodiment, the present invention can be used in a Symmetric Multiprocessing (SMP) system, exhibiting a Reduced Instruction Set Computer (RISC) architecture to process packets received over a communication network. An SMP system includes multiple identical processors with common software running as a combination, any of which is suitable for processing a particular packet. Each incoming packet is assigned to an active processor within the combination, and the processors use a common instruction set to process the packets in parallel. The SMP system reconstructs the processed packet stream so that it exhibits the correct packet order.

Brief description of the drawings

The foregoing discussion will be more readily understood from the following detailed description when taken in conjunction with the accompanying drawings, wherein:

FIG. 1 schematically illustrates a communication device coupling a communication network to other networks, such as LANs, MANs, and WANs;

FIG. 2 schematically illustrates several elements of a network interface card installed in the communication device of FIG. 1, in accordance with one embodiment of the present invention;

FIG. 3 schematically illustrates several elements of a forwarding engine that forms part of the network interface card of FIG. 2 in accordance with one embodiment of the present invention;

FIG. 4 provides a flowchart of steps performed when running the forwarding engine of FIG. 3, in accordance with one embodiment of the present invention;

FIG. 5 schematically illustrates several elements of the ingress logic and processor of the forwarding engine of FIG. 3 performing direct memory and direct register access, in accordance with an embodiment of the present invention;

FIG. 6 provides a flowchart of steps performed during operation of the entry logic and processor of FIG. 5, in accordance with one embodiment of the present invention;

FIG. 7 schematically illustrates a more detailed set of components making up the processor of FIG. 5, in accordance with one embodiment of the present invention; and

FIG. 8 provides a flowchart of steps performed when executing the processor element depicted in FIG. 7, in accordance with one embodiment of the present invention.

Detailed description of the invention

A typical microprocessor executes load and store instructions to load a temporary image of data representing a data structure stored in a storage element external to the processor into the processor's local register file for further execution. As used herein, the term "local register file" means the totality of registers within the internal structure of the processor that are available in the operating data. "register" refers to a distinct set of storage elements, such as D flip-flops. Depending on the processor design, the register file space can be made up of a combination of memory and flip-flops. In any event, the register file is typically implemented using a high-speed memory element that provides multiple independently accessible read and write ports. During execution of a software program, a typical processor executes a relatively large number of load/store instructions to move data from external memory to the local register file and to move execution results from the local register file to external memory. These frequent accesses to external memory are forced because the data sets to be processed are too long to fit into the execution space of the local register file.

The present invention recognizes that frequent accesses to external memory are not necessary to process data sets that are small enough (e.g., 128 to 512, 8-bit data elements) to fit entirely within the local register file space. As described in detail below, the present invention combines Direct Memory Access (DMA) with Direct Register Access (DRA) techniques to place data and execution results within and out of a register file of a processor without requiring the processor to execute instructions, such as load and store instructions, to move data. DMA herein refers to the use of one or more state machines to move blocks of data into or out of an internal or external memory independent of the processor. Similarly, a DRA refers to a special type of DMA, i.e., a DMA that contains one or more blocks of data that are moved into or out of a processor register file space independent of the processor. In one embodiment, the register file area is allocated as a five-port register file space with two write ports and three read ports (as opposed to a standard three-port register file space with one write port and two read ports) to facilitate direct register file access. This approach avoids relatively slow (compared to operations within the register file) accesses to external memory, avoids memory wait states, and reduces the size of the processor instruction set. Thus and in addition to greatly increasing the performance of an individual processor, the die size and power consumption of Application Specific Integrated Circuits (ASICs) containing these processors can be reduced and the total number of processors within the ASICs can be greatly increased without incurring prohibitive costs.

Although the present invention will be described hereinafter as being implemented in a network interface card of a communication device for the purpose of processing packets received over a network, this particular implementation is merely an illustrative embodiment and those skilled in the art will recognize any number of other embodiments and applications that would benefit from the claimed invention. By way of example and not limitation, the present invention can be beneficial to information processing applications that contain relatively small data sets, such as those present in image processing, signal processing, and video processing. The present invention can also be implemented in a wide variety of network communication devices (e.g., switches and routers) and other information processing environments.

Referring to fig. 1, a communication device 150 receives information (e.g., in the form of packets/frames, cells, or TDM frames) from a communication network 110 via a communication link 112 and transmits the received information to a different communication network or branch such as a Local Area Network (LAN)120, a Metropolitan Area Network (MAN)130, or a Wide Area Network (WAN)140 or to a local attached end station (not shown). The communication device 150 can include a plurality of Network Interface Cards (NICs), such as NIC160 and NIC180, each having a series of inputs (e.g., 162, 164, and 166) and outputs (e.g., 168, 170, and 172). The inputs 162, 164 and 166 receive information from the communication network 110 and pass them to a plurality of packet processing engines (not shown) that process the packets and prepare them for transmission at one of the outputs 168, 170 and 172, the outputs 168, 170 and 172 corresponding to the communication network containing the end station, such as the LAN120, the MAN130 or the WAN 140.

Referring to fig. 2, a Network Interface Card (NIC)160 embodying aspects of the present invention includes inputs 162, 164, 166, a packet processing or forwarding engine 220, an Address Lookup Engine (ALE)210, a statistics module 230, a queuing/dequeuing module 240, and outputs 168, 170, 172. NIC160 receives data from packet-based communication network 110 (fig. 1) at inputs 162, 164, 166. Forwarding engine 220, in conjunction with ALE210, determines the destination output of packets by looking up the appropriate output 168, 170, 172 associated with the destination and prepends the forwarding vector to the packet to assist in routing them to the appropriate output.

The altered packets are passed to the enqueue/dequeue module 240 where the forwarding vector is used to organize the packets into queues associated with particular destination outputs 168, 170, 172. The forwarding vector for each packet is then removed and the packet is scheduled for transmission to the selected output 168, 170, 172. The packets are then transmitted from the selected output 168, 170, 172 to a communication network such as the LAN120, the MAN130, or the WAN 140. In one embodiment, the queuing/dequeuing module 240 of the NIC160 receives the change packet via a full-network interconnect (not shown) so it can aggregate packets originally received on the input of any NIC160, 180 installed within the communication device 150, including packets received by the input 162, 164, 166 of its own NIC160, onto one or more of the outputs 168, 170, 172 of its own NIC 160. In another embodiment, packets received on the inputs 162, 164, 166 are passed directly by the forwarding engine 220 to the enqueue/dequeue module 240.

Referring to fig. 3 and 4, an illustrative embodiment of the architecture of forwarding engine 220 includes ingress logic 310, ALE interface 350, statistics interface 360, egress logic 370, and one or more processors representatively shown at 320, 330, 340. In operation, data corresponding to a packet is transmitted over the communications network 110 and received at a particular one of the inputs 162, 164, or 166 of the NIC160 or 180 coupled to the communications network 110 (step 410). Processor 330 is then selected from a combination of processors (represented at 320, 330, 340) associated with inputs 162, 164, or 166 (step 420). Once the processor 330 has been allocated, the packet is divided into header and body portions by ingress logic 310 (step 430). The header is written to a particular location within register file 710 (fig. 7) associated with processor 330 using direct register access, and the body is written to an output buffer in egress logic 370 using direct memory access (step 440). Processor 330 then processes the packet header according to locally stored instructions (step 450) and passes the processed packet header to egress logic 370 where it is recombined with the body of the packet (step 460).

Processor 330 may perform tasks such as processing the packet header by checking the integrity of the packet header, checking its checksum, accessing statistics module 230 via statistics interface 360 to provide statistics used to report processing activities involving the packet header to modules external to forwarding engine 220, and communicating with ALE210 via ALE interface 350 to obtain routing information for one of outputs 168, 170, 172 associated with the destination of the packet. Additional network specific (e.g., IP, ATM, Frame Relay, HDLC, TDM) packet processing may be made at this point. At the end of this processing activity, the processor 330 modifies the packet header to include routing information that specifies a particular output 168, 170, 172 of the NIC160 (e.g., by prepending the forwarding vector to the packet header). The modified header is then written to egress logic 370 of forwarding engine 220 where it is then routed to queuing/dequeuing module 240 as described above.

ALE interface 350, statistics interface 360, and egress logic 370 are resources within forwarding engine 220 that may be shared among processors 320, 330, 340. An arbitration means (not shown) is provided at the forwarding engine 220 to arbitrate between the processors 320, 330, 340 accessing these resources 350, 360, 370. In one embodiment, when a processor 330 is assigned to a packet, a processor identifier, such as a processor number, for the processor 330 is passed to each of the three shared resources 350, 360, 370 identified above. Each of these shared resources 350, 360, 370 then writes the processor number into a FIFO, preferably having a depth equal to the total number of processors in the forwarding engine 220. Logic in each shared resource 350, 360, 370 accesses its respective FIFO to determine which processor 320, 330 or 340 should be the next processor to grant access to the resource. Once the granted processor completes its access to a particular resource 350, 360, 370, the accessed resource reads its next FIFO entry to determine the next processor to issue a grant to.

More specifically and with reference to fig. 5 and 6, the receipt, operation, and transfer of packet data within forwarding engine 220 is handled primarily by a plurality of DMA and DRA state machines. In an illustrative embodiment, these state machines are included within ingress logic 310 and processor 330. During operation of the illustrative embodiment, a packet is received from one of the inputs 162, 164, 166 of the NIC160 and stored in a receive-data FIFO (first-in/first-out buffer) within the ingress logic 310 (step 610). The receive-status FIFO512 records the particular input 162, 164, or 166 to which the packet arrives and maintains an ordered list of input numbers for each packet received by the forwarding engine 220, sorted by time the packet was received.

The issue-DMA-command state machine 514 detects when the receive-status FIFO512 contains data and retrieves the number of inputs 162, 164, or 166 associated with receiving a packet from the receive-status FIFO512 (step 620). The issue-DMA-command state machine 514 then sends a processor assignment request containing the port number of the packet to the assign-processor state machine 516, and the device 516 accesses the assign-combine register 518 associated with that port number to determine a set of processors 320, 330, 340 that are candidates for processing the packet (step 630). The allocate-processor state machine 516 then accesses the processor-free registers 520 to determine whether any of the candidate processors 320, 330, 340 identified by the allocate-combine registers 518 are usable. The allocate-processor state machine 516 then allocates an available processor 330 from the set of candidate processors 320, 330, 340 to process the packet (step 640) and sends an allocation grant and the processor number of the processor 330 to the issue-DMA-command state machine 514.

Upon receipt of the processor number associated with the assigned processor 330, the issue-DMA-command state machine 514 sends an execution signal/command containing the processor number to the DMA-execution state machine 522, and the device 522 accesses the header-DMA-length register 524 to obtain the number of received packets (i.e., the length of the header) to be sent to the processor 330 (step 650). DMA-execution state machine 522 then issues a DMA command that retrieves the header portion (corresponding to the packet header) from receive-data FIFO510 and transfers it on DRA bus 526, where it is received by processor-DRA state machine 530 contained within processor 330 (step 660). DMA-execution state machine 522 also issues a command to retrieve the body from receive-data FIFO510 and transfer it on another DMA bus 528 for receipt by a buffer (not shown) of egress logic 370. processor-DRA state machine 530 then writes the header data received via DRA bus 526 directly to a register file region beginning at a fixed address location (e.g., address 0) within register file space 710 (fig. 7) of processor 330 (step 670). The processor 330 then processes the packet header (step 680) and transfers the processed packet header to egress logic 370 via the transfer-DMA state machine 532 for reassembly with the packet body (step 690).

More particularly and with reference to fig. 7 and 8, processing of packet headers within the processor 330 is preferable so that the processor's instructions and activities are limited to data operations and execution results in the execution space formed within the processor's local register file 710. The architecture of processor 330 in an illustrative embodiment includes Stats-interface state machine 704, ALE-interface state machine 706, processor-DRA state machine 530, transfer-DMA state machine 532, register file 710, Arithmetic Logic Unit (ALU)720, processor control module 730, and instruction memory 740. The compute unit 725 is comprised of a processor control 730 and an ALU 720.

During operation of the illustrative embodiment and while the processor 330 waits to receive packet headers, the computational unit 725 continuously executes instructions at a particular address (e.g., address 0) in the instruction memory 740 (i.e., in an infinite loop) (step 810). Hardware within processor 330 detects that address 0 is a particular address where instructions are returned from a "circuit" instruction value etched on silicon rather than from instructions stored within instruction memory 740. In one possible implementation, an access to an instruction at a particular address 0 returns "JMP 0" (or a jump to address 0 instruction), causing the processor 330 to execute an infinite loop at that address.

When a packet header is transferred from ingress logic 310 to the processor's register file 710, a control signal from processor-DRA state machine 530 indicates to processor control module 730 that the packet header transfer is in process (step 820). When this signal is activated, the processor control module 730 forces the processor program counter (not shown) to specify an unspecific address (e.g., address 2) of the instruction memory 740 and thus causes the compute unit 725 to jump out of the infinite loop being performed at the specified address 0 and wait until the signal becomes invalid (step 830). The computational unit 725 begins execution of the instruction at address 2 in response to the signal becoming invalid (step 840). Address 2 of instruction memory 740 can be configured to hold the first instruction to be used to process the packet header within register file 710 (i.e., the instruction at address 2 corresponds to the beginning of a "real" software image that has been previously downloaded to process the packet header). When processor-DRA state machine 530 completes writing a packet header from a fixed location within register file 710 (which occurs when the control signal becomes invalid), computational unit 725 continues to normally execute the remaining instructions within instruction memory 740 (i.e., beyond address 2). A particular instruction within instruction memory 740 specifies a location within register file 710. When the processing activity for the special packet header is completed, the executing software "jumps" to address 0, thereby executing the instruction at address 0 in an infinite loop. This technique illustrates one particular implementation of how the processor 330 may be triggered to process the packet header stored in the register file 710 without the use of load and store instructions.

In another embodiment, the allocated processor 330 remains idle (i.e., does not access instruction memory or execute instructions) until such time as it receives a signal from an external state machine indicating that the register file 710 has been filled with a complete packet header. The computation unit 725 then executes the code from the instruction memory 740 to process the packet header. The triggering event can, for example, include when the control signal becomes inactive. On the other hand, the assigned processor 330 is triggered when the DRA delivery has been initiated, completed, or when it is ongoing. Many other triggering events will be apparent to those skilled in the art.

As discussed earlier, the processor 330 accesses one or more shared resources external to the processor 330 during processing of the packet header (see, e.g., fig. 3, ALE interface 350, statistics interface 360, and egress logic 370). For example, processor 330 interacts with ALE210 (fig. 2) via ALE interface 350 (fig. 3) to issue searches of ALE210 and receive search results therefrom. These interactions with ALE210 performed by processor 330 may occur without processor 330 executing load and store instructions.

In one aspect and when executing instructions in instruction memory 740, processor 330 composes a search key starting at a predetermined address in register file 710. Computational unit 725 executes an instruction that involves writing a value to the ALE-command register to specify the amount of search key data to send to ALE 210. This value effectively serves as a control line for ALE-interface state machine 706 of processor 330 and thereby triggers ALE-interface state machine 706 to read this value or other data from the ALE-command register, determine the amount of data to be transferred, and transfer the specified data to ALE interface 350 using a direct memory access independent of computational unit 725. While waiting for the search results to be returned, the processor 330 can perform other functions, such as a network protocol (e.g., IP) checksum that checks the header. When the search results from ALE210 are valid, they are sent to ALE-interface state machine 706 via ALE interface 350. ALE-interface state machine 706 writes the search results to a predetermined location of register file 710 using one or more direct register accesses and signals computing device 725 when the write is complete. The computing device 725 then modifies the packet header in response to the search results.

The processor 330 is also capable of issuing a statistics update command by writing an address and length value to a statistics-update-command register (not shown) of the processor 330. The statistics-interface state machine 704 of the processor 330 is triggered to read data from the statistics-update-command register, determine the source and amount of data to transfer, and transfer the specified data to the statistics interface 360 using a direct memory access independent of the computation unit 725.

Similarly, when the processor 330 has completed processing the packet header, the computing unit 725 writes the processed packet header to the transmit-DMA state machine 532 of the processor 330, and the device 532 transfers the processed packet header to a buffer within the egress logic 370 using a direct memory access independent of the processor 330 (step 850). When all processing is complete, software executing within processor 330 jumps back to address 0 of instruction memory 740 and begins executing the infinite loop instruction previously discussed while waiting for the next packet header to arrive (step 860).

More specifically, at the completion of the processing activity, the packet headers may not necessarily reside in adjacent regions of register file 710 and thus computing device 725 may have to specify the location of each processed packet header within register file 710. Thus, the computing device 725 issues one or more writes to a move-DMA-command register (not shown) specifying the starting address and length of each processed packet header. These writes are stored in a FIFO, primarily as a list of reprogramming commands. After obtaining the data for all incomplete headers, the computing device 725 writes and specifies the length of the packet body along with other data to a send-DMA-command register (not shown).

The value written to the send-DMA-command register triggers the send-DMA state machine 532 within the processor 330 to begin opening the packet header combination according to the reprogram command stored in the reference FIFO described above. The transmit-DMA state machine 532 then sends the combined packet header to the egress logic 370 with some control information (including the packet length) using a direct memory access independent of the computational unit 725. The egress logic 370 concatenates the processed packet header received from the transmit-DMA state machine 532 with the packet body stored in the FIFO of the egress logic 370 and then sends the reconstructed packet to the enqueue/dequeue module 240 as previously described.

To properly reconstruct the packet header and body, the processor 330 obtains the length of the entire packet from the data embedded in the packet header itself and obtains the length of the packet header (corresponding to the same value written into the header-length register 524 of fig. 5) from the data transferred to the processor 330 through the receive-data FIFO510 (fig. 5). Based on this information, the processor 330 calculates the amount of data of the packet that was previously transferred to the output FIFO within the egress logic 370 and specifies the length of the packet as control information to be sent to the egress logic 370 via the Trans-DMA state machine 532. In this manner, the processor 330 can specify the amount of packet data to be pulled from the output FIFO of the egress logic 370 to be added to the newly combined packet header formed by the processor 330 to reconstruct the modified packet. To properly reconstruct the modified data packet, processor 330 is permitted to access egress logic 370 in the same order in which processor 330 was assigned (and thus in the same order in which the packets were written to the output FIFO of egress logic 370).

Aspects of the present invention can provide input packet processing requirements with great flexibility in the allocation of computing resources. Assuming for illustrative purposes a total of 40 processors 320, 330, 340 within the forwarding engine 220, the processors 320, 330, 340 can be flexibly allocated to meet the packet processing needs of numerous input/output architectures. For example, in a NIC160 where there is only a single logical input (i.e., port 0), all 40 processors 320, 330, 340 can be assigned to process packets for that single port. In this case, the code image loaded into the instruction memory 740 of each processor 320, 330, 340 should be identical, so that each processor 320, 330, 340 is able to perform the same algorithm for that type of input. In another scenario involving four logical inputs, each with a different type of network interface, the processing algorithms required for the various network interfaces may be different. In this case, 40 processors can be allocated as follows: processors [0-9] for Port 0, processors [10-19] for Port 1, processors [20-29] for Port 2 and processors [30-39] for Port 3. In addition, 4 different code images can be downloaded, where each individual image corresponds to a particular input. In yet another case, the NIC160 may include two logical inputs, each having different processing performance requirements. In this case, one of the inputs may consume 75% of the ingress bus bandwidth and have a packet arrival rate that requires 75% of the processor resources, while the second end takes up the remainder. To support these performance requirements, 30 processors can be assigned to input 0 and 10 processors to input 1.

The programming modules for NICs 160, 180, which include multiple processors as components of their forwarding engine 220, can be simplified by assigning a single processor to each received packet. In addition, and as described above, the reduced die size achieved by incorporating the system of the present invention allows for the inclusion of additional processors within the forwarding engine ASICs of NICs 160, 180, thereby ensuring that packets can be transmitted at the line speed of network 110. The present invention is readily scalable by adding more processors to a given forwarding engine ASIC, increasing the clock rate of the processors, and by integrating the processing combination of multiple ASICs. Note that in providing this capability, the hardware architecture of the present invention maintains the packet order of packets arriving via the network interface so that the reassembled packets can be sent out of the forwarding engine in the proper order.

The processor combination integration technique may be particularly beneficial when the NIC160 of the communication device 150 receives packet data streams at line rates via the communication network 110, which may otherwise overwhelm the processing power of the NIC160 and result in packet reduction and reduced quality of service. The integration technique allows for the allocation of idle processors from more than one forwarding engine. For example, NIC160 may contain multiple forwarding engine ASICs, each having a processor complex that can be assigned to process packets arriving at any input on NIC 160. On the other hand, processor sets present on other NICs180 within communication device 150, other than the forwarding engine ASIC, can be assigned to NICs 160 that experience heavy network loads.

Although the present invention has been described with reference to specific details, it is not intended that such details be regarded as limitations upon the scope of the invention except insofar as and to the extent that they are included in the accompanying claims.

Claims

1. A method of processing packets, the method comprising the steps of:

receiving a packet;

identifying a header portion of a data packet;

transmitting the header to a register file accessible by a processor; and

the packet header is processed without invoking at least one of a load instruction and a store instruction by the processor.

2. The method of claim 1, wherein the transferring step is performed without invoking at least one of a load instruction and a store instruction.

3. The method of claim 1, further comprising the step of:

dividing the bag into a bag head part and a bag body part;

transferring the header to the register file using direct register access; and

the packet is passed to an output buffer.

4. The method of claim 3, further comprising the step of:

selecting an output for transmission of the packet;

combining the processed packet header with the packet body in the output buffer; and

the combined packet is forwarded from the output buffer to a selected output for transmission therefrom.

5. The method of claim 1, further comprising the step of:

providing a plurality of identical processors executing a common instruction set, each processor locally storing the instruction set to the processor;

selecting one processor from a plurality of processors to process the packet header; and

causing the selected processor to process the packet header.

6. The method of claim 5, wherein the step of selecting the processor is performed by a state machine responsive to receipt of a packet on the input.

7. The method of claim 5, wherein the step of causing the selected processor to process the packet header is performed by at least one state machine configured to write the packet header to at least one fixed location within a register file accessible to the selected processor.

8. The method of claim, further comprising the step of: a common instruction set is downloaded to an instruction memory within each of the plurality of processors.

9. A method of processing a header of a packet received over a communications network, the method comprising the steps of:

transferring the header to at least one fixed location within a register file;

providing a processor associated with the register file, the processor repeatedly executing an instruction in an infinite loop, the instruction being stored at a first known location in an instruction memory associated with the processor;

causing the processor to execute instructions beginning from a second known location within the instruction memory in response to the transmission of the packet header;

processing the packet header in at least one fixed location within the register file according to an instruction starting from a second known location within the instruction memory; and

the processor is reset upon completion of processing of the packet header to repeatedly execute instructions stored at a first known location within the instruction memory.

10. The method of claim 9, wherein the processing step comprises processing the packet header without invoking at least one of a load instruction and a store instruction.

11. The method of claim 9, further comprising the step of:

receiving packets at an input coupled to the communication network;

selecting a processor from a plurality of candidate processors associated with the input;

dividing the bag into a packet head and a bag body; and

the header is transmitted to at least one fixed location within the register file associated with the selected processor by executing DRA commands issued by a state machine coupled to the register file.

12. The method of claim 11, further comprising the step of: a common instruction set is downloaded to an instruction memory within each of a plurality of candidate processors.

13. A packet processing system for processing packets received over a communications network, the system comprising:

an input configured to receive packets over a communications network;

a processor associated with the output;

a register file accessible by the processor; and

an ingress element coupled to the input, the processor, and the register file, the ingress element configured to transfer at least a portion of the packet to the register file by referencing the DRA command,

wherein the processor processes the at least a portion of the packet in the register file in response to the DRA command and without invoking at least one of a load instruction and a store instruction.

14. The packet processing system of claim 13, wherein the ingress element is configured to select a processor from a plurality of candidate processors associated with the input.

15. The packet processing system of claim 14, further comprising a plurality of instruction memories, each of the plurality of instruction memories associated with a corresponding one of the plurality of candidate processors, wherein the plurality of instruction memories contain a same instruction set.

16. The packet processing system of claim 13, wherein the at least a portion of the packet corresponds to a packet header.

17. The packet processing system of claim 16, wherein the ingress element comprises a state machine configured to write the header to a fixed location within the register file.

18. A packet processing system for processing headers of packets received over a communications network, the system comprising:

an input coupled to the communication network;

an ingress element coupled to the input and configured to receive and analyze packets to obtain a packet header;

coupled to the ingress element and configured to store a packet header received from the ingress element in at least one fixed location;

an instruction memory configured to return instructions from at least a first and second address; and

a processor coupled to the entry element, the register file, and the instruction memory, the processor repeatedly executing instructions stored on a first one of the instruction memory, wherein the processor executes instructions starting at a second address of the instruction memory to process a packet header in the register file in response to a signal from the entry element.

19. An information processing system comprising:

a processor having an internal register file space and a unit for manipulating data;

an entry element for passing unprocessed data to the internal register file space; and

an exit element for retrieving processed data from the internal register file space,

wherein the operation of the processor is limited to operating on data within the internal register file space.

20. The system of claim 19, further comprising at least one state machine that manages operation of the ingress and egress elements, and responsive to instructions within the internal register file space, the state machine moves data into or out of the internal register file space using direct access to the internal register file space in accordance with the instructions.

21. The system of claim 20, further comprising a network interface that receives data from the communication network, the interface providing the received data to the portal element.

22. An information processing method, the method comprising the steps of:

providing a processor having an internal register file space and a unit for manipulating data; and

direct access to the internal register file space is used to transfer unprocessed data to the internal register file space and to retrieve processed data from the internal register file space, the operation of the processor being restricted to manipulating data within the internal register file space.

23. The method of claim 22, further comprising the step of:

providing at least one state machine to manage data transfers to and data removals from the internal register file space using direct access to the internal register file space; and

the processor signals the state machine by writing a value to a control register, the state machine performing the direct access in response to the value and in accordance with state machine logic.

24. The method of claim 22, wherein the unprocessed data originates from a communication network having a line data rate, the processor processing data at a rate equal to the line rate.

25. The method of claim 24, wherein the unprocessed data is in packet format.

26. A method of processing a packet stream containing a temporal sequence of packets, the method comprising the steps of:

receiving a packet;

for each packet, (i) identifying a header portion of the data packet, (ii) selecting a processor from among the plurality of processors to process the header based on the availability of the processor, and (iii) causing the selected processor to process the header using locally stored instructions; and

the processed packets are combined according to the temporal sequence to reconstruct the packet stream.

27. The method of claim 26, wherein the plurality of processors are physically located on a plurality of integrated circuits.

28. A system for processing a packet stream containing a temporal sequence of packets, the system comprising:

a plurality of identical processors executing a common instruction set, each processor including a local instruction memory containing the instruction set;

an input for receiving packets;

an ingress logic unit coupled to the input and the processor, the ingress logic unit configured to, for each packet, (i) identify a header portion of the data packet and (ii) select a processor from among the plurality of processors to process the header based on the availability of the processor, the selected processor processing the header using locally stored instructions in response to the ingress logic unit; and

an egress logic unit for combining the processed packets according to the temporal sequence to reconstruct the packet stream.

29. The system of claim 28, wherein the plurality of processors are physically disposed on a plurality of integrated circuits.