WO2017028909A1

WO2017028909A1 - Shared physical registers and mapping table for architectural registers of multiple threads

Info

Publication number: WO2017028909A1
Application number: PCT/EP2015/068977
Authority: WO
Inventors: Simcha Gochman; Zuguang WU; Weiguang CAI
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-08-18
Filing date: 2015-08-18
Publication date: 2017-02-23
Anticipated expiration: 2018-02-18
Also published as: CN107851006B; CN107851006A

Abstract

A system for handling a register accessing request, comprising includes an interface for receiving register accessing requests and a processing unit connected to the interface. The processing unit dynamically maps architectural registers to physical registers based on a criterion such as recent usage and/or access frequency of the architectural registers by multithreading (MT) threads. The processing unit also looks up a respective architectural register for register accessing requests for which a match is not found in the physical registers.

Description

REGISTER MAPPING FOR MULTI-THREADING

BACKGROUND The present invention, in some embodiments thereof, relates to

implementation of multi-threading and, more specifically, but not exclusively, to architectural register management in multi-threading cores.

CPU cores, especially those that are targeted to server market segments, increasingly support multi-threading (MT). The demand for multi-threaded cores has been increasing at a high rate in all server market segments, especially in the context of Scale-out applications (e.g. Big Data).

There are currently three MT implementation schemes:

1. Fine Grain MT (FGMT) - Threads are interleaved on a clock by clock basis;

2. Simultaneous MT (SMT) - Threads run simultaneously sharing all machine resources; and

3. Coarse Grain MT (CGMT, also denoted Switch on Event MT and SoE MT) - A thread runs until it is blocked by an event (that typically results in a long latency stall). Then it is replaced by the next thread that waits in the raw.

Current MT implementations include:

1. Larrabee by Intel (4 way FGMT) ;

2. Xeon Servers by Intel (2 way SMT); and

3. Intel's Itanium Montecito (2 way CGMT).

In MT, each thread carries over the entire architectural state of the machine. Each Architectural Register File Set (ARF) typically includes:

1. Integer Register File (e.g. ARMv8 employs 31 Registers, each 64-bits wide);

2. Floating Point / SIMD Register File (e.g. ARMv8 employs 32 Registers, each 128-bits wide); and

3. Status Register (e.g. ARMv8 employs roughly 6 Registers, each 64- bits wide).

Supporting multiple threads on same die multiplies this amount. The registers must be available and easily accessed. Current MT implementations use the following strategies for handling register files (RFs):

1) Duplicating the ARF for each thread. This is used for FGMT and SMT and in some cases also for CGMT (to avoid long switch times). Duplicating ARFs is very wasteful in terms of silicon area and energy consumption.

2) Holding a single register file set and copying back and forth

(applicable only to CGMT). This approach is time consuming, makes the switch time fairly long and inefficient and severely reduces performance.

SUMMARY

An object of the current invention is to improve multi-threading.

This object is obtained by the subject matter of the independent claims. The dependent claims protect further embodiments.

Embodiments presented herein perform map recently and/or frequently used registers of running threads (i.e. active threads) to physical registers. Registers of all the threads are saved in architectural registers, optionally in a SRAM. When a requested register is not mapped to a physical register, the content of the architectural register is stored in an allocated physical register, possibly replacing previously stored content (e.g. from a suspended thread). In this way, silicon area and energy

consumption are reduced and switch time may be shortened.

According to a first aspect of some embodiments of the present invention there is provided system for handling a register accessing request. The system includes an interface which receives register accessing requests and a processing unit. The processing unit dynamically maps a group of registers from multiple architectural registers to at least one of a multiplicity of physical registers based on at least one of recent usage and access frequency of each one of the architectural registers by multiple multithreading (MT) threads, and looks up a match for each one of the register accessing requests in the architectural registers when the match is not found in the physical registers.

In a first possible implementation form of the system according to the first aspect the MT threads submit the register accessing requests and are of a multithreading processor. In a second possible implementation form of the system, the register accessing requests are received via at least one pipeline engine.

In a third possible implementation form of the system, the architectural registers are stored in a static random access memory (SRAM).

In a fourth possible implementation form of the system, the system further includes a memory for storing an access frequency dataset. The processing unit updates the access frequency dataset with a frequency of access to respective registers and performs the mapping according to the access frequency dataset.

In a fifth possible implementation form of the system, the system further includes a memory for storing a recent usage dataset. The processing unit updates the recent usage dataset with the recent usage and performs the mapping according to the recent usage dataset.

In a second possible implementation form of the system according to the fifth implementation form of the first aspect, the recent usage dataset includes multiple records. Each of the records documents a recent usage of each one of the MT threads to the architectural registers.

In a third possible implementation form of the system according to the fifth implementation form of the first aspect, the recent usage dataset includes respective allocation states of architectural registers.

In a fourth possible implementation form of the system according to the fifth implementation form of the first aspect, the architectural registers maps to an allocation of suspended and running threads of the multiple MT threads and the physical registers maps to an allocation of running threads of the multiple MT threads.

In a fifth possible implementation form of the system according to the fifth implementation form of the first aspect, the processing unit updates the recent usage dataset when switching an allocation of any of the physical registers from one of the MT threads to another of the MT threads.

In a sixth possible implementation form of the system, the processing unit maps the architectural registers to the MT threads.

In a seventh possible implementation form of the system, the processing unit switches mapping of any of the architectural registers from one of the MT threads to another of the architectural registers. In an eighth possible implementation form of the system, the processing unit sets a respective state of physical registers mapped to an active thread to available when the active thread is inactivated by a switch to a different thread.

According to a second aspect of some embodiments of the present invention there is provided a method for handling a register accessing request. The method includes:

i) receiving multiple register accessing requests;

ii) mapping dynamically a group of registers from multiple architectural registers to at least one of a multiplicity of physical registers based on at least one of recent usage and access frequency of each one of the architectural registers by multiple multithreading (MT) threads; and

iii) looking up a match for each one of the register accessing requests in the architectural registers when the requested register not found in the mapping of the physical registers.

In a first possible implementation form of the method according to the second aspect the method further includes monitoring the at least one of recent usage and access frequency by recording the plurality of register accessing requests which are received via at least one pipeline engine.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

Fig. 1 is a simplified block diagram of system for handling a register accessing request, according to embodiments of the invention;

Fig. 2 is a simplified illustration of a register mapping scheme, according to embodiments of the invention;

Fig. 3 is a simplified block diagram of a method for handling register accessing requests according to embodiments of the invention;

Fig. 4 is a simplified block diagram of a method for thread context switching according to embodiments of the invention; and

Fig. 5 is a simplified flowchart of a method for handling a register accessing request according to embodiments of the invention. DETAILED DESCRIPTION

The present invention, in some embodiments thereof, relates to multithreading and, more specifically, but not exclusively, to architectural register management in multi-threading cores.

Embodiments of the invention utilize a register mapping scheme that dynamically maps the most recently and/or frequently used architectural registers to a smaller physical register file set, and fetches the registers' content on demand.

In some embodiments, when a new register access request is issued the register mapping (also denoted herein the mapping table) is checked to see if the requested architectural register is mapped to a physical registers. When the requested register is present in the PRF, the physical register is utilized for the register access.

When a requested register is not mapped to the PRF, one or more physical registers are written back to the ARF to make registers in the PRF available for storage of other architectural register values. The requested architectural register is written to a physical register, and access continues from the PRF.

The mapping table is maintained dynamically and updated as needed during assignment and reassignment of physical registers to architectural registers.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.

The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Some embodiments of the invention are based on a computing system that includes:

1) A memory (e.g. an SRAM) that stores architectural states of multiple

MT threads;

2) Physical registers which store a physical register file (denoted herein the PRF); and

3) A register mapping that dynamically maps architectural registers to physical registers.

As used herein the terms "architectural register file" and "ARF" mean the dataset which includes the entire architectural state for all threads. The terms are not limited to a particular type of file, organization of data or memory element used for storing the ARF.

The memory storing the ARF has denser data storage than the physical registers but its access time is longer than the access time of the physical registers. A reasonably sized PRF enables quick access to some architectural register content without requiring a drastic increase in silicon area. As used herein the terms "physical register file" and "PRF" mean the dataset stored in the physical registers. The terms are not limited to a particular type of file or organization of data.

In some embodiments, the memory stores all the architectural states for all threads. This embodiment uses simple logic for fixed indexing but is more costly in area. In other embodiments, the memory only stores architectural states not stored in the PRF, resulting in a reduction in area with increased indexing complexity.

In some embodiments, each active thread has predefined number of physical registers cannot use physical registers allocated to other threads. In other

embodiments, physical registers are dynamically allocated to all active threads.

In some embodiments, when a new register access request is issued its source and destination operands are looked up in the register mapping (also denoted herein the mapping table). When the mapping table shows that the requested register is present in the PRF, the physical register is utilized for the register access. When one or more requested registers are not mapped to the PRF, a replacement cycle occurs. In the replacement cycle one or more physical registers (for example the least recently used registers) are written back to the architectural registers. These physical registers are then available for storage of other architectural register values. After a warm up period, all the selected architectural registers (e.g. recently used) will be cached in the PRF and execution will require relatively few replacement cycles. However when the processing moves to a new phase that employs different architectural registers a new warm up period may occur. The mapping table is maintained dynamically and updated as needed during or after the replacement cycle.

As used herein the terms "register request" and "register access request" include requests for read and write operations to the register.

The register mapping described herein is particularly beneficial for core implementations that employ a large number of threads in order to exploit thread level parallelism (such as graphic accelerators, big data servers, etc.). A single core is able to support an increased number of threads by increasing thread level parallelism (TLP) without having the overhead of duplicating the entire architectural states of all threads or limiting operation to CGMT with long thread switch periods. As the number of threads per core increases, the potential benefit of the register mapping described herein increases. Reference is now made to Fig. 1, which is a simplified block diagram of system for handling a register accessing request, according to embodiments of the invention. System 100 includes interface 110 and processing unit 120.

Interface 110 receives register accessing requests. Optionally, the register accessing requests are submitted by multiple MT threads. Optionally, the register accessing requests are received via at least one pipeline engine.

Processing unit 120 dynamically maps a group of registers from architectural registers 150 to physical registers 140. Optionally the mapping is based on:

i) Access frequency by the MT threads ("frequently-used");

ii) Recent usage by the MT threads ("recently-used"); and/or

iii) A combination of access frequency and recent usage.

In response to a register access request, processing unit 120 determines from the mapping table whether the register value is stored in physical registers 140. When a match is not found in the physical registers 140, processing unit 120 looks up a match for the requested register in architectural registers 150.

Optionally, the architectural registers are stored in a static random access memory (SRAM).

In some embodiments, system 100 includes a memory which stores a recent usage dataset. Processing unit 120 updates the recent usage dataset with recent usage of each register, and performs the mapping, at least in part, according to the recent usage dataset. Additionally or alternately, the memory stores an access frequency dataset. Processing unit 120 updates the recent usage dataset with access frequency of each register, and performs the mapping, at least in part, according to the access frequency dataset.

Optionally, the recent usage dataset comprises multiple records. Each record documents the recent usage of architectural registers by a respective thread.

Optionally, the recent usage dataset includes an allocation state of each architectural register. The allocation state indicates when the architectural register is allocated to a physical register, in which case the architectural register value may be read from or written to the physical register, and optionally indicates the physical register to which the architectural register is allocated.

Optionally, architectural registers 150 are allocated to both suspended (i.e. inactive) and running (i.e. active) threads of the multiple MT threads, and physical registers 140 are allocated to running threads. Optionally, processing unit 120 updates the recent usage dataset when the allocation of an architectural register is switched from one MT thread to another thread. This may occur when a thread is terminated or added.

Optionally, processing unit 120 updates the recent usage dataset when the allocation of a physical register is switched from one MT thread to another thread. This may occur when a thread is inactive and the physical register is reallocated to an architectural register of a different thread.

Optionally, processing unit 120 maps the architectural registers to respective MT threads.

Optionally, processing unit 120 switches the mapping of an architectural register for a given MT thread to different architectural register.

Reference is now made to Fig. 2, which is a simplified illustration of a register mapping scheme, according to embodiments of the invention. In Fig. 2:

i) N denotes a number of active threads;

ii) M denotes a total number of threads (active and inactive);

iii) K denotes a number of all registers per thread; and

iv) J denotes a number of registers per thread which are stored in

PRF 130.

Thus, the total number of registers in the ARF is M*K, whereas the number of registers in the PRF is the smaller number N*J.

For clarity, in the non-limiting embodiment of Fig. 2 the registers stored in the PRF are selected on the basis of access frequency ("frequently-used"). In other embodiments, the registers stored in the PRF are selected by a different criterion (e.g. recently-used) and register mapping, access and handling is performed in a substantially similar manner.

Mapping table 210 specifies whether the requested register is allocated in PRF 220 and also maintains other information used for finding candidates for replacement (e.g. the least frequently used register). In the embodiment of Fig. 2, mapping table 210 holds the following fields for each register of the active thread:

i) Valid - indicates whether the architectural register value is stored in the PRF;

ii) Index- maps the architectural register to a physical register;

iii) Dirty - indicates whether the value stored in the architectural register is corresponds to the value of the mapped physical register; iv) Access frequency - May be used to select a physical register for overwrite when a requested architectural register is not in the PRF.

In Fig. 2, pipeline engine 200 is running N active threads. Active threads issue register access requests for architectural registers. When a register request is received from pipeline engine 200, mapping table 210 is used to determine whether the register value may be accessed from the PRF 220 relatively quickly or must be obtained from ARF 230.

ARF 230 stores the architecture register files of all the active and inactive threads. Data may be transferred between ARF 230 and PRF 220 to keep the architectural and physical register values up to date as required for operation. The mapping table is updated accordingly.

In the case of a "register miss" (i.e. the requested architectural register is not in PRF 220) a "victim" physical register is reallocated for the requested architectural register and the content is of the reallocated register is replaced. In some

embodiments, inactive threads are the preferred providers of victim physical registers.

Optionally, the victim physical register is selected at least in part on data stored in the mapping table (e.g. access frequency and/or recent access).

Optionally, remapping of source and destination registers is done in the pipeline engine.

Reference is now made to Fig. 3, which is a simplified flowchart of a method for handling a register accessing request according to embodiments of the invention. In 310, register accessing requests are received. The register mapping is checked in 320 to determine when the requested register is mapped to a physical register. In 330 a match is looked up in the architectural registers (i.e. ARF) for each requested register which is not mapped to a physical register. Optionally, in 340 the

architectural register value is stored in a physical register.

Optionally, in 350 the requested register is accessed from the PRF.

Register mapping from architectural registers to physical registers (i.e. PRF) is performed dynamically in 360. The mapping may be based on a recent usage of each architectural register by the MT threads and/or on recent usage of each architectural register by the MT threads. Optionally, the mapping is performed based on an alternate or additional mapping criterion. Optionally, recent usage of register usage (physical and/or architectural) is monitored by recording register accessing requests which are received via at least one pipeline engine.

Reference is now made to Fig. 4, which is a simplified block diagram of a method for handling register accessing requests according to embodiments of the invention.

In 400, a register accessing request is issued by a pipeline engine. In 410 the mapping table is checked to determine whether the requested register is stored in the PRF (e.g. by checking the "valid" bit of the requested register).

When the requested register is stored in the PRF, register read or write access is performed in 420. For a write operation the write data is stored in the physical register mapped to the requested architectural register. For a read operation, the value stored in the physical register mapped to the requested architectural register is returned.

When the requested register is not stored in the PRF, the PRF is searched in

430 to find an available physical register to store the requested architectural register's data. When an available physical register is not found, in 450 a victim physical register is selected and its content stored back to the ARF, thereby creating an available physical register.

In 460 it is determined whether the access is a read request or not. When the access is a read request, then in 470 the requested register value is copied from the ARF to an available register in the PRF as shown in 470. The read or write operation is then performed in 420, as described above.

Reference is now made to Fig. 5, which is a simplified block diagram of a method for thread context switching according to embodiments of the invention.

In 500, it is determined whether the thread switch is hardware or software. When the thread switch is a hardware switch, in 510 the engine pipeline switches the active thread to a different thread, temporarily blocking the previously active thread. In 520, the valid bits for the now active thread are set to zero. In 530, only registers marked as dirty in the mapping table are updated in the ARF.

When the thread switch is a software switch, in 540 the software switches the active thread to another thread, inactivating the previously active thread. In 550, all physical registers mapped to architectural registers for the currently active thread are read and written to memory (i.e. updated in the ARF). In 560, all valid bits in the mapping table are set to zero for the currently active thread. In 570 the previously active thread is deleted from the thread control register which specifies active threads.

In summary, the embodiments presented above are useful for all MT implementations, including CGMT. The register mapping described herein

significantly reduces CGMT overheads since ARFs are not duplicated per thread and recovery of architectural registers is done on demand. Retrieval of registers for a new thread may be performed during the thread switch time (i.e. while the machine front- end is fetching instructions from the new thread). Suspended threads naturally provide physical register victim candidates. Avoiding full ARF duplication results in significant reduction in area (die size) and in energy consumption. The thread switch time is significantly shortened relative to full ARFs save and restore and may be performed primarily in the background of execution.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant multithreading implementations, register files, architectural registers, physical registers, register mapping implementations and register access operations will be developed and the scope of the terms multithreading, register file, architectural register, physical register, register mapping, register access and register access request is intended to include all such new technologies a priori.

The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of and "consisting essentially of.

The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method. As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless such features conflict.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

1. A system for handling a register accessing request, comprising:

an interface adapted to receive a plurality of register accessing requests;

a processing unit, connected to said interface and adapted to:

map dynamically a group of registers from a plurality of architectural registers to at least one of a plurality of physical registers based on at least one of recent usage and access frequency of each one of said plurality of architectural registers by a plurality of multithreading (MT) threads; and

lookup a match for each one of said register accessing requests in said plurality of architectural registers when said match is not found in said plurality of physical registers.

2. The system of claim 1, wherein said plurality of MT threads submit said plurality of register accessing requests and are of a multi-threading processor.

3. The system of any of the previous claims, wherein said plurality of register accessing requests are received via at least one pipeline engine.

4. The system of any of the previous claims, wherein said plurality of architectural registers are stored in a static random access memory (SRAM).

5. The system of any of the previous claims, further comprising a memory adapted to store an access frequency dataset; wherein said processing unit is adapted to update said access frequency dataset with a frequency of access to respective registers and to perform said mapping according to said access frequency dataset.

6. The system of any of the previous claims, further comprising a memory adapted to store a recent usage dataset; wherein said processing unit is adapted to update said recent usage dataset with said recent usage and to perform said mapping according to said recent usage dataset.

7. The system of claim 6, wherein said recent usage dataset comprises a plurality of records, each documenting a recent usage of each one of said plurality of MT threads to said plurality of architectural registers.

8. The system of any of claims 6-7, wherein said recent usage dataset comprises respective allocation states of said plurality of architectural registers.

9. The system of any of claims 6-8, wherein said plurality of architectural registers maps to an allocation of suspended and running threads of said plurality of MT threads and said plurality of physical registers maps to an allocation of running threads of said plurality of MT threads.

10. The system of any of claims 6-9, wherein said processing unit is adapted to update said recent usage dataset when switching an allocation of any of said plurality of physical registers from one of said plurality of MT threads to another of said plurality of MT threads.

11. The system of any of the previous claims, wherein said processing unit is adapted to map said plurality of architectural registers to said plurality of MT threads.

12. The system of any of the previous claims, wherein said processing unit is adapted to switch mapping of any of said plurality of architectural registers from one of said plurality of MT threads to another of said plurality of architectural registers.

13. The system of any of the previous claims, wherein said processing unit is adapted to set a respective state of physical registers mapped to an active thread to available when said active thread is inactivated by a switch to a different thread.

14. A method for handling a register accessing request, comprising:

receiving a plurality of register accessing requests;

mapping dynamically a group of registers from a plurality of architectural registers to at least one of a plurality of physical registers based on at least one of recent usage and access frequency of each one of said plurality of architectural registers by a plurality of multithreading (MT) threads; and looking up a match for each one of said register accessing requests in said plurality of architectural registers when said requested register not found in said mapping of said plurality of physical registers.

15. The method of claim 14, further comprising: monitoring said at least one of recent usage and access frequency by recording said plurality of register accessing requests which are received via at least one pipeline engine.