[go: up one dir, main page]

WO2004092913A2 - Digital processor apparatus with code compression and method - Google Patents

Digital processor apparatus with code compression and method Download PDF

Info

Publication number
WO2004092913A2
WO2004092913A2 PCT/US2004/011562 US2004011562W WO2004092913A2 WO 2004092913 A2 WO2004092913 A2 WO 2004092913A2 US 2004011562 W US2004011562 W US 2004011562W WO 2004092913 A2 WO2004092913 A2 WO 2004092913A2
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
compressed
processor
compression
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2004/011562
Other languages
French (fr)
Other versions
WO2004092913A3 (en
Inventor
Elena G. Nikolova
David J. Mulvaney
Vassilios Chouliaras
Jose L. Nunez-Yanez
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARC International
ARC International UK Ltd
Original Assignee
ARC International
ARC International UK Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARC International, ARC International UK Ltd filed Critical ARC International
Publication of WO2004092913A2 publication Critical patent/WO2004092913A2/en
Anticipated expiration legal-status Critical
Publication of WO2004092913A3 publication Critical patent/WO2004092913A3/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • G06F9/30178Runtime instruction translation, e.g. macros of compressed or encrypted instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/324Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address using program counter relative addressing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching

Definitions

  • So-called "lossless" compression provides a further approach for factoring out the redundancy in the embedded code. It is an appealing approach as the code reduction exceeds that of other methods, reaching up to 50% reduction in code size.
  • the algorithms used in code compression are based on those developed for text compression, but the algorithms need to be adapted to the constraints encountered in the embedded program's structure.
  • COF change of flow
  • the execution flow is not continuous due to the change of flow (COF) instructions, namely jumps and branches; yet the decompression should be able to start at any point.
  • COF change of flow
  • the other major problem is that translation is required of uncompressed memory space addresses to the actual target addresses in compressed memory, that is if the processor is to be unaware of compression.
  • the hardware that performs the decompression should be fast and simple, in order not to degrade the original system's performance.
  • Wolfe and Chanin proposed a Code Compression RISC Processor (CCRP).
  • CCRP Code Compression RISC Processor
  • the ARC 700 can also utilize any number of DSP extensions, including e.g., dual 16 x 16 multiply-accumulate (MAC) instructions, dual 16x16 multiply- subtract (MSUB) instructions, 24x24 multiply-accumulate instructions, Viterbi and Turbo code butterfly instructions, Fast Fourier Transform (FFT) butterfly instructions, and Cyclic Redundancy Check (CRC) instructions, partial complex multiply instructions, to support a wide variety of different applications for the core.
  • the ARCtangent processor is a user-customizable 32-bit RISC core for ASIC, system- on-chip (SoC), and FPGA integration. It is synthesizable, configurable, and extendable, thus allowing developers to modify and extend the architecture to better suit specific applications.
  • FIG. 2-6 Exemplary embodiments of the present invention are now described with respect to Figs. 2-6 herein. It will be recognized that these embodiments are merely illustrative of the broader principles of the invention, which include inter alia, non-traditional placement of the compressed/uncompressed code space boundary, and use of a dictionary-based code decompressor therein.
  • the CPC unit 406 is the most complex block of the decompressor 400, and its internal stracture is shown in detail in Fig. 5. Its functions include: (i) extraction of the next CPC address; (ii) calculation of the branch and loop targets; (iii) translation of the uncompressed indirect jump targets to compressed targets; (iv) keeping track of function return addresses in compressed space; and (v) predicting whether a branch will be taken or not.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

A digital processor architecture and associated techniques for code compression particularly useful for, e.g., SOC or embedded applications where one or more constraints (such as memory size) are imposed. In one exemplary embodiment, the boundary between compressed and uncompressed code space is disposed between the instruction cache (408) and the processor core, the latter which is left completely unaware of compression and its functionality is fully preserved. Mechanisms are described that resolve the issues arising from the instruction cache data misalignment and the compressed to uncompressed address mapping. The compression techniques of the present invention reduce memory requirements for the IC design, which translates into savings in silicon cost. Alternatively, larger programs with additional functionality can be developed at the same cost compared to a corresponding system without invention. Reduced power consumption is also provided by the invention.

Description

DIGITAL PROCESSOR APPARATUS WITH CODE COMPRESSION AND METHOD
Priority This application claims priority to U.S. Provisional Patent Application Serial No.
60/463,185 filed April 14, 2003 of the same title, which is incorporated herein by reference in its entirety.
Copyright A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates to the field of integrated circuits and associated code, specifically to the use of code compression techniques within a pipelined central processing unit (CPU) or embedded processor.
2. Description of Related Technology
RISC (or reduced instruction set computer) processors are well known in the computing arts. RISC processors generally have the fundamental characteristic of utilizing a substantially reduced instruction set as compared to non-RISC (commonly known as "CISC") processors. Typically, RISC processor machine instructions are not all micro-coded, but rather may be executed immediately without decoding, thereby affording significant economies in terms of processing speed. This "su-eamlined" instruction handling capability furthermore allows greater simplicity in the design of the processor (as compared to non-RISC devices), thereby allowing smaller silicon and reduced cost of fabrication. RISC systems are often used in embedded application for these (and other) reasons.
The design of embedded systems is subject to many constraints: development and production costs, power consumption, reliability, safety and so on. On average, code size doubles every two years and often the memory in embedded systems is the most expensive piece of circuitry. Additional pressure on program memory comes from the relatively recent adoption in embedded systems of high-level languages like C and C++, which allow programs to be designed at a higher level of abstraction. For control-oriented embedded applications, a significant portion of the circuitry is used for instruction memory. High performance systems are also impacted by increasing program size due to the delays incurred by instruction cache misses.
Thus, code size has become a major problem for RISC embedded systems design and a range of different methods has been proposed for solving this problem. The best known examples include tailored compiler optimizations, which achieve about 5% code reduction compared to the original code size; using a subset of the original instruction set has produced up to 30% reduction (ARM Thumb, MIPS 16, ARCompact); and code factoring (modification of code by replacing repeated blocks by function calls), also produces roughly a 30% reduction.
So-called "lossless" compression provides a further approach for factoring out the redundancy in the embedded code. It is an appealing approach as the code reduction exceeds that of other methods, reaching up to 50% reduction in code size. The algorithms used in code compression are based on those developed for text compression, but the algorithms need to be adapted to the constraints encountered in the embedded program's structure. When executing code, the execution flow is not continuous due to the change of flow (COF) instructions, namely jumps and branches; yet the decompression should be able to start at any point. However, this is not normally available in algorithms used for text compression. The other major problem is that translation is required of uncompressed memory space addresses to the actual target addresses in compressed memory, that is if the processor is to be unaware of compression. The hardware that performs the decompression should be fast and simple, in order not to degrade the original system's performance. In 1992, Wolfe and Chanin proposed a Code Compression RISC Processor (CCRP). The
CCRP consisted of a standard RISC processor core augmented with a special code-expanding instruction cache, and the decompression engine is situated between the main memory and the cache. The code, generated by a standard compiler, is compressed on the host development system and then stored into the embedded system instruction memory. For the symbol of compression, fixed 32-byte blocks were used. The study used bounded (up to 16 bits) static Huffman encoding algorithm. While encoding, it is possible that some blocks grow larger rather than smaller, and in such cases, the block is bypassed and left uncompressed. At run time, on a cache miss, the instruction cache (Icache) refill engine locates the compressed cache lines in the memory and expands them; instructions in the instruction cache are exactly the same as in the uncompressed program. The approach requires a new design for the ICache refill engine (which plays the role of decompressor), but no modifications to the core processor are needed. In CCRP, program target addresses have different values if the line is in the main memory or in the cache and a line address table (LAT) is used to map addresses in the cache to the compressed code addresses in the main memory. The LAT contains one 64-bit wide entry for every 64 consecutive instructions, using pointers to locate blocks within that space. A compression ratio of 0.73 (the ratio between the size of the compressed program and the original one) was achieved for the MIPS R2000.
IBM's CodePack implements a similar approach. PowerPC code is compressed in 64- byte blocks using a Huffman encoding scheme and compressed blocks are expanded when a 64- byte cache line is loaded. Two tables each of length 2kbytes are used for holding token expansions. The first table is used for the high 16-bits of each instruction and the second - for the lower 16-bits. Whenever a branch is taken, a hash table is searched to find the actual address to use. To implement the compression scheme, the tool chain did not need modification but a postprocessor was developed to compress the ROM image of the program and build the necessary tables. The compression ratio claimed was in the range 0.60 to 0.65.
Another, more innovative, hardware-implementation of a decompressor is found in the Motorola MPC56x micro controllers and achieved compression ratio quoted is of about 0.7. The scheme is optimized for cacheless systems but it is also effective in systems operating in a low-cache hit ratio environment and for systems with fast-embedded program memory. The decompression module is incorporated into the processor's Burst Buffer Controller (BBC) that can work in both compressed and non-compressed modes. The compression algorithm used is class-based Huffman encoding, and the symbol of compression is either a word or a half-word. The two halves of an instruction can be compressed separately and placed into different regions into memory, which is separated into two parts by a dangling boundary. The dangling boundary is necessary as the two instruction halves are restricted to be within 32 consecutive bits (even though they can be placed in different memory words). To allow the dangling boundary partition, bit 31 in all memory words is allocated for boundary control, which gives a compression ratio penalty of 0.03. The decompression unit requires 2kbytes RAM for decompressor dictionary table storage, and a special branch target buffer to supply the processor with previously accessed or decompressed instructions. The system contains several sets of dictionaries for the different classes of instruction. The information regarding the different class formats is held in the class registers. The MPC56x is a complex solution for developing a compression model implemented within a non-cached RISC processor, yet it is able to maintain the required RISC performance.
See also the code compression techniques described in co-owned and co-pending U.S. Patent Application Serial No. 09/808,469 filed March 14, 2001 and entitled "Method and Apparatus for Processor Code Optimization Using Code Compression" which is incorporated herein by reference in its entirety. While providing good performance, this solution may not be optimal for all applications.
United States Patent No. 5,045,852 to Mitchell, et al. issued September 3, 1991 and entitled "Dynamic model selection during data compression" discloses a system and method for maximizing data compression by optimizing model selection during coding of an input stream of data symbols, wherein at least two models are run and compared, and the model with the best coding performance for a given-size segment or block of compressed data is selected such that only its block is used in an output data stream. The best performance is determined by 1) respectively producing comparable-size blocks of compressed data from the input stream with the use of the two, or more, models and 2) selecting the model which compresses the most input data. In the preferred embodiment, respective strings of data are produced with each model from the symbol data and are coded with an adaptive arithmetic coder into the compressed data. Each block of compressed data is started by coding the decision to use the model currently being run and all models start with the arithmetic coder parameters established at the end of the preceding block. Only the compressed code stream of the best model is used in the output and that code stream has in it the overhead for selection of that model. Since the decision as to which model to run is made in the compressed data domain, i.e., the best model is chosen on the basis of which model coded the most input symbols for a given-size compressed block, rather than after coding a given number of input symbols, the model selection decision overhead scales with the compressed data. Successively selected compressed blocks are combined as an output code stream to produce an optimum output of compressed data, from input symbols, for storage or transmission.
United States Patent No. 5,463,746 to Brodnax, et al. issued October 31, 1995 and entitled "Data processing system having prediction by using an embedded guess bit of remapped and compressed opcodes" discloses a data processing system includes branch prediction apparatus for storing branch data in a branch prediction RAM after each branch has occurred. The RAM interfaces with branch logic means which tracks whether a branch is in progress and if a branch was guessed. An operational code compression means forms each instruction into a new operation code of lesser bits and embeds a guess bit into the new operational code. Control means decode the compressed operational code as an input to an instruction execution unit whereby conditional branch occurs based on the guess bit provided a branch instruction is not in progress in the system.
United States Patent No. 5,703,907 to James issued December 30, 1997 and entitled "Method for data compression" discloses methods for compressing data including methods for compressing highly randomized data. Nibble encode, distribution encode, and direct bit encode methods are disclosed for compressing data which is not highly randomized. A randomized data compression routine is also disclosed for compressing data which is highly randomized. The compression methods disclosed operate on a bit level and accordingly are ostensibly insensitive to the nature or origination of the data sought to be compressed.
United States Patent No. 5,729,228 to Franaszek, et al. issued March 17, 1998 and entitled "Parallel compression and decompression using a cooperative dictionary" discloses a method and apparatus for compressing a block of data using a shared dictionary. Data to be compressed is divided into subblocks which are each provided to a respective compressor in a plurality of compressors. The compressors cooperatively construct a dynamic compression dictionary and compress the subblocks in parallel using the dictionary. Compressed subblocks output by the compressors are concatenated to form a compressed block.
United States Patent No. 5,764,994 to Craft issued June 9, 1998 and entitled "Method and system for compressing compiled microcode to be executed within a data processing system" discloses a method for compressing a set of compiled microcode to be utilized within a data processing system. All branch instructions within a set of compiled microcode are first identified. Then, the set of compiled microcode is parsed into a number of microcode segments such that each microcode segment begins at an instruction following each identified branch instruction or at a target address of each identified branch instruction. Subsequently, each of these microcode segments is individually translated to its compressed form by utilizing a data- compression routine. Finally, all the compressed microcode segments are concatenated together and linked by inserting branch instructions with modified target address, to yield a set of compressed executable microcode. By doing so, the required memory for storing the compressed executable microcode is reduced.
United States Patent No. 5,878,267 to Hampapuram, et al. issued March 2, 1999 and entitled "Compressed instruction format for use in a VLIW processor and processor for processing such instructions" discloses software which creates a compressed instruction format for a VLIW processor which allows greater efficiency in use of cache and memory. Instructions are byte aligned and variable length. Branch targets are uncompressed. Format bits specify how many issue slots are used in a following instruction. NOPS are not stored in memory. Individual operations are compressed according to features such as whether they are resultless, guarded, short, zeroary, unary, or binary. Instructions are stored in compressed form in memory and in cache. Instructions are decompressed on the fly after being read out from cache.
United States Patent No. 5,943,421 to Grabon issued August 24, 1999 and entitled "Processor having compression and encryption circuitry" discloses a data processing system that includes a data processor or CPU having decompression circuitry and decryption circuitry that operates on compressed/encrypted data to produce decompressed and decrypted data. The data processing system includes memory in which instructions and data are stored in a compressed and/or encrypted format. The CPU retrieves the compressed/encrypted data over a system bus. A bus interface unit within the CPU receives the compressed/encrypted data, decompresses and decrypts the data and stores the data in cache memory. An execution unit and other components within the CPU retrieve the decompressed and decrypted data and operate upon it. Alternatively, upon retrieval of compressed/encrypted data from memory the data is stored in cache memory in its compressed/encrypted format. Upon retrieval by the execution unit, the data is decompressed and decrypted in preparation for execution by the execution unit. A data processing system of the present invention requires that the CPU decrypt encrypted data. Thus, any devices accessing data in the system over a network without decryption by the CPU retrieve encrypted data that cannot otherwise be decrypted.
United States Patent No. 5,945,933 to Kalkstein issued August 31, 1999 and entitled "Adaptive packet compression apparatus and method" discloses a method for compressing each of a plurality of data packets to form a compressed packet for transmission by a communication device, the data packets being composed of a sequence of data elements and the data packets being stored on a first computer such that the method is performed by the first computer. The method comprises (a) receiving one of the plurality of data packets designated as packet Pm; (b) parsing the packet such that the sequence of data elements of the packet is parsed into a sequence of parsed elements, each of the parsed elements having a form comprising a character, a pair of offset and length components, or a run length encoding consisting of a repetition factor component and a character component, and each of parsed elements and each of the components of the parsed elements having a frequency of occurrence; (c) selecting an encoding table from a historical array, the historical array including at least one encoding table from compression of at least one previously compressed data packet, the encoding table having been constructed according to the frequencies of occurrence of a plurality of parsed elements of the at least one previously compressed data packet, independent of data from the packet.
United States Patent No. 5,946,489 to Yellin, et al. issued August 31, 1999 and entitled "Apparatus and method for cross-compiling source code" discloses a method of cross- compiling computer programs includes the step of extracting constants from an inheriting computer program written in a first computer language. The extracted constants refer to a generating computer program written in a second computer language. A new program in the second computer language is then created using the constants. The new program is then compiled for a target computer to ascertain compiled constant values. The compiled constant values are then substituted into the inheriting computer program to produce a final inheriting computer program.
United States Patent No. 6,005,503 to Burrows issued December 21, 1999 and entitled "Method for encoding and decoding a list of variable size integers to reduce branch mispredicts" discloses a list of variable size integers encoded in a memory. Each variable size integer is expressed as a set of a minimum number of bytes. A fixed number the bytes of the sets are grouped with an associated bit map into a logical memory word unit. Each bit map has one continuation bit for each of the fixed number of bytes. Each continuation bit indicating whether or not a particular variable size integer continues into a following byte. An entry is stored in an array for each possible pattern of continuation bits of the bit maps. Each entry including a plurality of fields. There is one field for each of the fixed number of bytes in each group. Each field stories a length of a corresponding set of bytes expressing a particular variable size integer in the group. The entries provide a decoding table that is indexable by the bit maps to recover the list of variable size integers. United States Patent No. 6,121,903 to Kalkstein issued September 19, 2000 and entitled "On-the-fly data re-compression" discloses recompression of previously compressed data with a better algorithm. A data analyzer analyses the compressed data to determine whether the data has been compressed with a supported compression format such as GIF, JPG or PNG. A data decompressor decompresses the data, then a data compressor re-compresses the data using the superior compression algorithm. When the data is to be reused, a data decompressor decompresses the data and a data compressor recompresses the data using the original compression algorithm. The invention may be implemented over a path in a data network. Compressed data that is to be sent over path is intercepted and re-compressed with the superior algorithm before being sent over the path. On the other end of the path, the data is converted back to its original compression format.
United States Patent No. 6,125,201 to Zador issued September 26, 2000 and entitled "Method, apparatus and system for compressing data" discloses an apparatus and method for image data compression that performs a modified zero-tree coding on a range of absolute image values from the largest to a determined smaller absolute value, based upon file size or quality. If it is desired to maintain more detail in the image, then a vector quantizer codes the remaining values below this determined smaller value to zero, and lossless entropy coding is performed on the results of the two coding steps. The determined smaller value can be adjusted by examination of the histogram of the tree, or iteratively to meet a preselected compressed image size criterion or to meet a predefined level of image quality, as determined by any suitable metric. If the image to be compressed is in RGB color space, the apparatus converts the RGB image to a less redundant color space before commencing further processing.
United States Patent No. 6,195,743 to Elnozahy issued February 27, 2001 and entitled "Method and system for compressing reduced instruction set computer (RISC) executable code through instruction set expansion" discloses a compression scheme for program executables that run on Reduced Instruction Set Computer (RISC) processors, such as the PowerPC architecture. The RISC instruction set is expanded by adding opcodes to produce code that facilitates the removal of redundant fields. To compress a program, a compressor engine rewrites the executable using the new expanded instruction set. Next, a filter is applied to remove the redundant fields from the expanded instructions. A conventional compression technique such as Huffman encoding is then applied on the resulting code. United States Patent No. 6,233,674 to Elnozahy issued May 15, 2001 and entitled "Method and system for scope-based compression of register and literal encoding in a reduced instruction set computer (RISC)" discloses a compression scheme for program executables that run in a reduced instruction set computer (RISC) architecture such as the PowerPC. The method and system utilize scope-based compression for increasing the effectiveness of conventional compression with respect to register and literal encoding. First, discernible patterns are determined by exploiting instruction semantics and conventions that compilers adopt in register and literal usage. Additional conventions may also be set for register usage to facilitate compression. Using this information, separate scopes are created such that in each scope there is a more prevalent usage of a limited set of registers or literal value ranges, or there is an easily discernible pattern of register or literal usage. Each scope then is compressed separately by a conventional compressor. The resulting code is more compact because the small number of registers and literals in each scope makes the encoding sparser than when the compressor operates on the global scope that includes all instructions in a program. Additionally, scope-based compression reveals more frequent patterns within each scope than when considering the entire instruction stream as an opaque stream of bits.
United States Patent No. 6,442,680 to Elnozahy issued August 27, 2002 and entitled "Method and system for compressing reduced instruction set computer (RISC) executable code" discloses a method and system for a compression scheme used with program executables that run in a reduced instruction set computer (RISC) architecture such as the PowerPC. Initially, a RISC instruction set is expanded to produce code that facilitates the removal of redundant fields. The program is then rewritten using this new expanded instruction set. Next, a filter is applied to remove redundant fields from the expanded instructions. The expanded instructions are then clustered into groups, such that instructions belonging to the same cluster show similar bit patterns. Within each cluster, the scopes are created such that register usage patterns within each scope are similar. Within each cluster, more scopes are created such that literals within each instruction scope are drawn from the same range of integers. A conventional compression technique such as Huffman encoding is then applied on each instruction scope within each cluster. Dynamic programming techniques are then used to produce the best combination of encoding among all scopes within all the different clusters. Where applicable, instruction scopes are combined that use the same encoding scheme to reduce the size of the resulting dictionary. Similarly instruction clusters are combined that use the same encoding scheme to reduce the size of the resulting dictionary.
United States Patent Application Publication No. 20030225998 to Khan, et al. published December 4, 2003 and entitled "Configurable data processor with multi-length instruction set architecture" discloses digital processor apparatus having an instruction set architecture (ISA) with instruction words of varying length. In the exemplary embodiment, the processor comprises an extended user-configurable RISC processor with four-stage pipeline (fetch, decode, and writeback) and associated logic that is adapted to decode and process both 32-execute, bit and 16-bit instruction words present in a single program, thereby increasing the flexibility of the instruction set, and allowing for greater code compression and reduced memory overhead. Free-form use of the different length instructions is provided with no required mode shift. An improved instruction aligner and code compression architecture is also disclosed
Despite the foregoing variety of different code compression schemes and apparatus, there is still a need to provide an improved processor code compression approach and apparatus which can be used in and is optimized for, inter alia, embedded RISC applications. Such improved compression approach would ideally support mixed form instruction sets, including the addition of user-defined or user-added extension instructions and associated hardware.
Summary of the Invention The foregoing needs are satisfied by the present invention which provides improved methods and apparatus for processor code compression, including those particularly adapted for an embedded RISC processor having an extensible architecture. In a first aspect of the invention, an improved pipelined data processor apparatus having an instruction cache and processor core is disclosed, the instruction cache containing at least one compressed instruction. In one exemplary embodiment, the processor comprises a user-configurable RISC processor apparatus having an instruction cache, processor core with multi-stage pipeline, and instruction set comprising at least one user-defined extension instruction, wherein the instruction cache contains at least one compressed instruction, and the code space compression boundary disposed between the cache and core. The multi-stage pipeline comprises at least pack and decompression decode stages disposed between an instruction fetch and instruction decode stage of the pipeline. A class-based dictionary (CBD) scheme is employed.
In a second aspect of the invention, an improved method of processing compressed processor instructions is disclosed. The method generally comprises using a compression scheme to dispose at least one compressed instruction within the instruction cache of the processor. The compressed instruction is then decompressed (and at least partly decoded) prior to traditional instruction decode by the decode stage of the pipeline.
In a third aspect of the invention, an improved method of compressing processor programs and instructions is disclosed. In one exemplary embodiment, when a program is being compressed, the new address of each instruction, including its bit offset in the memory word, is calculated. The branch offsets are not changed. Instead, an exemplary 16-bit entry is appended to each branch instruction. The first 11 bits of the entry represent the offset difference between the compressed and uncompressed space, while the next 5 bits show the offset of the compressed word in the memory word; i.e., at which bit position is the beginning of the target instruction word. Adding this entry after each branch instruction only minimally degrades the compression ratio, such minimal degradation more than being offset by the other features of the invention which increase the compression ratio.
In a fourth aspect of the invention, an improved method of synthesizing a processor design having the improved code compression functionality described above is disclosed. In one exemplary embodiment, the method comprises generating a processor configuration specification based on user inputs as to the design of a configurable processor, such inputs including selections relating to the implementation of the code compression methods and functional units disclosed herein. The configuration specification is then reduced to a hardware language representation of the processor, which may then be simulated, synthesized, etc.
In a fifth aspect of the invention, an improved method of decoding instructions within a digital processor is disclosed. In one exemplary embodiment, the method comprises: fetching at least one instruction; performing at least a partial decode of the at least one instruction; determining, based at least in part on the decode, whether the at least one instruction belongs to a first class (such as, e.g., a branch instruction); and if the at least one instruction does belong to the first class, sending at least offset information disposed within the at least one instruction to a program counter. If the instruction does not belong to the first class, then it is evaluated as to whether it belongs to a second class (e.g., jumps), in which case at least one operand is decoded in order to determine the type of instruction within the second class.
In a sixth aspect of the invention, an improved method of optimizing the performance of a processor design through use of an instruction code compression scheme having a dictionary is disclosed. In one exemplary embodiment, the method comprises: selecting a plurality of parameters forming the basis of optimization; determining a plurality of different program types formed from instructions within the instruction code; running optimizations, based at least in part on the plurality of parameters, for each of the program types; determining, based on the optimizations, at least a substantially optimal size for the dictionary for each of the program types; and selecting, based on a given program type to be used in the processor, its substantially optimal dictionary size to be used in the design thereof.
Brief Description of the Drawings
Fig. 1 is a functional block diagram of an exemplary RISC processor core configuration (A 700) useful with the methods and apparatus of the present invention.
Fig. 2 is a graphical representation of an exemplary 16-bit entry format used with the code compression scheme of the present invention. Fig. 3 is a graphical representation showing the compression performance associated with the techniques of the present invention as compared to various other prior art techniques, based on a variety of software test bench tools.
Fig. 4 is a schematic block diagram illustrating an exemplary processor code compression architecture including "compressed" instruction cache and additional compression pipeline stages.
Fig. 4a is a logical flow diagram of an exemplary method of decoding processor instructions according to the present invention.
Fig. 4b is a logical flow diagram of an exemplary method of optimizing the compression ratio and/or other parameters based on program type according to the invention. Fig. 4c is a logical flow diagram of an exemplary method of decoding processor branch and jump instructions to determine compressed memory offset according to the present invention. Fig. 5 is a schematic block diagram illustrating an exemplary embodiment of the compressed program counter (CPC) unit of the present invention.
Fig. 6 is a logical flow diagram of an exemplary processor design process including the addition of code compression functionality according to the present invention.
Detailed Description
Reference is now made to the drawings wherein like numerals refer to like parts throughout.
As used herein, the term "processor" is meant to include any integrated circuit or other electronic device (or collection of devices) capable of performing an operation on at least one instruction word including, without limitation, reduced instruction set core (RISC) processors such as for example the ARC 700, 600, ARCompact™ A5 and ARCtangent™ A4 user- configurable core manufactured by the Assignee hereof (described in Appendix I hereto), central processing units (CPUs), and digital signal processors (DSPs). The hardware of such devices may be integrated onto a single substrate (e.g., silicon "die"), or distributed among two or more substrates. Furthermore, various functional aspects of the processor may be implemented solely as software or firmware associated with the processor. Additionally, it will be recognized by those of ordinary skill in the art that the term
"stage" as used herein refers to various successive stages within a pipelined processor; i.e., stage 1 refers to the first pipelined stage, stage 2 to the second pipelined stage, and so forth. Such stages may comprise, for example, instruction fetch, decode, execution, and writeback stages. Lastly, any references to hardware description language (HDL) or VHSIC HDL
(VHDL) contained herein are also meant to include other hardware description languages such as Verilog®. Furthermore, an exemplary Synopsys® synthesis engine such as the Design Compiler 2000.05 (DC00) may be used to synthesize the various embodiments set forth herein, or alternatively other synthesis engines such as Buildgates® available from, inter alia, Cadence Design Systems, Inc., may be used. IEEE std. 1076.3-1997, IEEE Standard VHDL Synthesis Packages, describes an industry-accepted language for specifying a Hardware Definition Language-based design and the synthesis capabilities that may be expected to be available to one of ordinary skill in the art. Overview
The improved code compression scheme proposed herein generally implements a simple class-based dictionary technique (CBD). In the exemplary embodiments, the most frequently encountered instructions are coded in 8-bit or 16-bit codewords and placed in a dictionary, which is stored in the decompressor. The decompression core is placed between the processor and the instruction cache (Icache). As the cache holds compressed instructions, the cache miss ratio decreases, which compensates the performance loss introduced in the system by the online memory decompression. Due to the compression, the memory requirements for the design decrease, which translates into savings in silicon cost. Alternatively, larger programs with additional functionality can be developed at the same cost compared to a corresponding system without compression. Another benefit is a reduction of power consumption, as one of the most power-hungry operations in an embedded system is the transfer of data between the different levels in the memory hierarchy.
ARC™ Configurable Processors and Instruction Set Architectures
The exemplary embodiments described herein are cast in terms of the user- configurable processor cores manufactured by the Assignee hereof (including the ARC 700, ARC 600, and ARC Tangent A5 and A4 cores); however, it will be recognized that the methods and apparatus of the invention can be practiced and adapted to a broad variety or processor types and architectures. The invention should therefore not be considered to be limited in any way to these exemplary embodiments.
The ARC 700 processor core 100 (Fig. 1) is a RISC core having a 7-stage scalar fully interlocked instruction pipeline; dynamic branch prediction; support for non-blocking loads; hit-under-miss data cache; 32-bit data, instruction and address buses; a 64-bit memory system; AMBA and BVCI protocol based interfaces; configurable instruction cache and writeback data cache (configurable RAM sizes); optional on-chip Code RAM and Data RAM; additional instructions for multiprocessing support; a 32-bit single cycle barrel shifter/rotate device; and a fast 32x32 multiplier. The designer is also provided the choice of a Harvard or Von Neumann bus architecture.
The ARC 700 100 has 35 registers in the base processor (extendible to 63), 26 general purpose registers (extendible to 54), and dedicated registers for loop count, program counter, branch link, global pointer and stack pointers. It further includes optional 232 x 32-bit auxiliary registers (single-cycle access), a cache low power sleep mode, on-chip RAM controls, USB OTG peripheral, 10/100 Ethernet MAC and UART peripherals.
The ARC 700 (and other cores) can also utilize any number of DSP extensions, including e.g., dual 16 x 16 multiply-accumulate (MAC) instructions, dual 16x16 multiply- subtract (MSUB) instructions, 24x24 multiply-accumulate instructions, Viterbi and Turbo code butterfly instructions, Fast Fourier Transform (FFT) butterfly instructions, and Cyclic Redundancy Check (CRC) instructions, partial complex multiply instructions, to support a wide variety of different applications for the core. The ARCtangent processor is a user-customizable 32-bit RISC core for ASIC, system- on-chip (SoC), and FPGA integration. It is synthesizable, configurable, and extendable, thus allowing developers to modify and extend the architecture to better suit specific applications. The ARCtangent microprocessor comprises a 32-bit RISC architecture with a four-stage execution pipeline. The instruction set, register file, condition codes, caches, buses, and other architectural features are user-configurable and extendable. It has a 32 x 32-bit core register file, which can be doubled if required by the application. Additionally, it is possible to use large number of auxiliary registers (up to 2E32). The functional elements of the core of this processor include the arithmetic logic unit (ALU), register file (e.g., 32 x 32), program counter (PC), instruction fetch (i-fetch) interface logic, as well as various stage latches. The ARC cores described above also utilize the ARCompact™ instruction set architecture (ISA) that allows designers to mix 16 and 32-bit instructions on a 32-bit processor. The key benefit of the ISA is the ability to cut memory requirements on a SoC (system-on-chip) by significant percentages, resulting in lower power consumption and lower cost devices in deeply embedded applications such as wireless communications and high volume consumer electronics products.
The main features of the ARCompact ISA include 32-bit instructions aimed at providing better code density, a set of 16-bit instructions for the most commonly used operations, and freeform mixing of 16- and 32-bit instructions without a mode switch - significant because it reduces the complexity of compiler usage compared to competing mode- switching architectures. The ARCompact instruction set expands the number of custom extension instructions that users can add to the base-case processor instruction set. The existing processor architecture already allows users to add as many as 69 new instructions to speed up critical routines and algorithms. With the ARCompact ISA, users can add as many as 256 new instructions. Users can also add new core registers, auxiliary registers, and condition codes. The ARCompact ISA thus maintains and expands the user-customizable features of ARC's configurable processor technology. As 32-bit architectures become more widely used in deeply embedded systems, code density can have a direct impact on system cost. Typically, a very high percentage of the silicon area of a system-on-chip (SoC) is taken up by memory.
The ARCompact ISA delivers high density code helping to reduce the memory required for the embedded application, a vital factor for high-volume consumer applications, such as flash memory cards. In addition, by fitting code into a smaller memory area, the processor potentially has to make fewer memory accesses. This can cut power consumption and extend battery life for portable devices such as MP3 players, digital cameras and wireless handsets. Additionally, the new, shorter instructions can improve system throughput by executing in a single clock cycle some operations previously requiring two or more instructions. This can boost application performance without having to run the processor at higher clock frequencies.
The support for freeform use of 16 and 32-bit instructions allows compilers and programmers to use the most suitable instructions for a given task, without any need for specific code partitioning or system mode management. Direct replacement of 32-bit instructions with new 16-bit instructions provides an immediate code density benefit, which can be realized at an individual instruction level throughout the application. As the compiler is not required to restructure the code, greater scope for optimizations is provided, over a larger range of instructions. Application debugging is more intuitive because the newly generated code follows the structure of the original source code.
It is further noted that the present disclosure references a method of synthesizing a processor design having certain parameters ("build") incorporating, inter alia, the foregoing code compression functionality. The generalized method of synthesizing integrated circuits having a user-customized (i.e., "soft") instruction set is disclosed in Applicant's co-pending U.S. Patent Application Serial No. 09/418,663 entitled "Method And Apparatus For Managing The Configuration And Functionality Of A Semiconductor Design" filed October 14, 1999, which is incorporated herein by reference in its entirety, as embodied in the "ARChitect" design software manufactured by the Assignee hereof, although it will be recognized that other software environments and approaches may be utilized consistent with the present invention, including the largely object-oriented "ARChitect 2" software disclosed in co-owned and co- pending United States Patent Application Publication No. 20030229482 to Cook, et al., Serial No. 10/423,745 filed April 25, 2003 and published December 11, 2003 and entitled "Apparatus and method for managing integrated circuit designs" which is incorporated herein by reference in its entirety.
Hence, references to specific attributes of the aforementioned ARChitect program(s) are merely illustrative in nature.
Additionally, while the following description of the design environment is presented in terms of an algorithm or computer program running on a microcomputer or other similar processing device, it can be appreciated that other hardware enviromnents (including minicomputers, workstations, networked computers, "supercomputers", and mainframes) may be used to practice the method. Additionally, one or more portions of the computer program may be embodied in hardware or firmware as opposed to software if desired, such alternate embodiments being well within the skill of the computer artisan.
Exemplary Code Compression Methods and Apparatus
Exemplary embodiments of the present invention are now described with respect to Figs. 2-6 herein. It will be recognized that these embodiments are merely illustrative of the broader principles of the invention, which include inter alia, non-traditional placement of the compressed/uncompressed code space boundary, and use of a dictionary-based code decompressor therein.
In contrast to the prior art, the ICache of the illustrated embodiment contains compressed instructions. This approach has the advantage of effectively increasing the size of cache, but also generates a number of new considerations not apparent in the prior art approaches previously described herein.
The compression algorithm chosen for the present embodiment was originally based on the instruction set architecture (ISA) of the well-known "x86" family of processors. This family has 8-bit, 16-bit, 24-bit, and 32-bit instructions. The 32-bit instructions are compressed into two different classes with corresponding lengths of 8- and 16-bits. After compilation and linking, the embedded program is parsed, and all the unique instructions are identified and sorted by their frequency distribution. The dictionary entries then become the most frequently encountered instructions, and the codewords are the actual indexes to the corresponding instructions. The first bit of a codeword flags whether the instruction is compressed and the next two bits show which class it belongs to, that is, in which region in the dictionary it can be found. The remaining bits in a codeword points to the exact location of the instruction in the dictionary. In order to preserve the indigenous ISA of the exemplary platform of the present embodiment (i.e., the ARC 700, ARC 600, or ARC Tangent cores previously described), a one-bit flag is added to each uncompressed instruction indicating that it is not compressed, although it will be recognized that other schemes (including more than one bit in order to specify one or more compression "modes") may be used if desired. The one-bit approach used herein has the advantage of simplicity. After compression, the exemplary program consists of words with three different lengths - 8 bits, 16 bits and 33 bits. This means that instructions are not aligned on word boundary in the memory, so a technique for managing the memory is necessary.
In the exemplary ARC ISA there are several instructions that cause a change of flow (COF). These axe jump, jump and link, loop, branch, and branch and link instructions; all of these instructions can be conditional. Vo jump instructions where the condition is met, the program execution is resumed at the location contained in a register. Branch and loop targets are calculated relative to the program counter (PC), and a signed multi-bit (e.g., 20-bit) field holds the displacement value. When jump and link instruction is executed, the current status register (PC) is placed in a special blink register, so that all function returns are jumps to the value in the blink register, thereby maintaining continuity.
To manage changes of flow during program execution, the code compressor and decompressor of the illustrated embodiment use several mechanisms. When the program is being compressed, the new address of each instruction, including its bit offset in the memory word, is calculated. As the processor ideally should be completely unaware of compression, the branch offsets are not changed. Instead, an exemplary 16-bit entry 100, whose format is shown in Fig 2, is appended to each branch instruction.
The first 11 bits 202 of the exemplary entry 200 of Fig. 2 represent the offset difference between the compressed and uncompressed space. By adding the original offset to this difference, the real target offset can be calculated. The next 5 bits 204 show the offset of the compressed word in the memory word; i.e., at which bit position is the beginning of the target instruction word. Adding this entry after each branch instruction degrades the compression ratio on the order of 6% to 8% in the present embodiment. Anecdotal evaluation by the inventors hereof of a large range of programs reveals that every sixth instruction in the ARC code is a branch, although it will be appreciated that different values of decompression may be experienced based on any number of features including the processor ISA chosen, program content, etc.
Another factor considered by the present invention is the effect of indirect jump instructions. Indirect jumps occur when the target is not known at compile time, but it is computed at execution time. Indirect jumps are generated by the exemplary compiler in the following cases: (i) switch statements, which select one of several alternatives; (ii) function pointers, which allow functions to be passed as arguments; (iii) dynamically shared libraries, which allow a library to be loaded and linked at run time; and (iv) virtual functions or methods in object oriented languages such as C++ and Java.
In embedded systems, the last two cases (iii) and (iv) discussed above are rarely encountered, so the exemplary embodiment of the invention prioritizes the development of the architecture so that the effects of compression on indirect jumps relating to switch statements and function pointers are considered. It will be recognized, however, that the architecture disclosed herein may be adjusted based on other priorities, such as where an embedded or other application is more intensive in terms of its use of dynamically shared libraries and/or object-oriented languages. For both switch statements (i) and function pointers (ii), a jump table is usually generated. For the purposes of the present embodiment, a custom version of the compiler can be utilized that generates a special section that holds (a) the address of the jump table, (b) the number of entries in the table, and (c) the type of table. With this information, a look-up table is generated that holds the address of the indirect jump and all the possible targets. Advantageously, this table is not large, as usually the targets are few in number and the indirect jumps typically form only 10% of all the jumps in the program. It will be recognized by those of ordinary skill provided this disclosure, however, that other configurations of lookup tables, or even mechanisms other than jump or look-up tables, may be used consistent with the present invention. The returns from subroutines (these are jumps to the "blink" register previously described) are in the exemplary embodiment implemented in hardware, and more details are provided subsequently herein. However, other implementations are possible. So-called "test bench" programs (such as, for example, the "MediaBench" Suite compiled for the ARC microprocessor, described in Mediabench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems, C. Lee, M. Podkonjak, and W.H. Mangione-Smith, International Symposium on Microarchitecture, pp. 330-335, 1997, which is incorporated herein by reference in its entirety) have been used as a test bench to objectively assess the compression performance of the algorithm of the present invention. Fig. 3 summarizes the results of these tests.
As shown in Fig. 3, the compression ratios from algorithm described herein are compared to those achieved by rumiing the same data set on several different popular commercial compression tools. These other algorithms are not suitable for code compression, as some of them do not allow random access (see discussion of the prior art previously presented herein); others cannot be implemented in hardware because the implementation would be prohibitively complex (e.g., PPMZ) or do not give good compression ratios for small block sizes (e.g., XMatchPro). The mean compression ratio achieved by the exemplary algorithm of the present invention is 0.63 using the Media Bench Suite, when the branch appendices are taken into account.
The branch appendices degrade the compression ratio by roughly 0.08. It will be recognized by those of ordinary skill, however, that alternative approaches may be implemented consistent with the present invention to mitigate this penalty. Nevertheless, the compression ratios achieved by the exemplary embodiment of the class-based dictionary (CBD) technique remains comparable to other algorithms, and advantageously exceeds the ratios achieved by the other proposed hardware code compression techniques described previously herein.
Exemplary configurations of the decompression hardware used with the CBD approach are now described with respect to Figs. 4 and 5. In the exemplary CBD embodiment, the hardware decompression logic is placed between the instruction cache (Icache) and the processor core. The decompression logic's main functions are (i) to decompress the compressed instructions at execution time, and (ii) to map the uncompressed memory space as seen by the processor to the compressed memory space. The main components of the architecture and its integration into the processor's pipeline are shown in the exemplary architecture 400 of Fig. 4. The run-time decompression process is now also described in detail. To fetch an instruction, the processor provides the decompressor 404 with the uncompressed memory address value. The addresses provided by the processor requests are ignored, as the decompressor has its own built-in Compressed Program Counter (CPC) within the CPC unit 406 that determines the compressed addresses. The exception is in the case of indirect jumps, in which case the address provided by the processor is used to look up in a table the corresponding address in compressed space. The decompressor 404 sends requests to the ICache 408 to load instructions independently of the processor whenever its 66-bit load buffer 410 is empty (see discussion provided subsequently herein). When a change of the execution flow occurs, the decompression core is restarted. Two 32-bit words are loaded in the buffer 410 and the decompression process restarts and then these and subsequent instructions are decompressed and provided to the processor core. Note that in the illustrated embodiment, a stall in the processor pipeline causes a stall in the decompression engine 404 and vice versa. However, it will be appreciated that a selective or parallel decompression arrangement may be used, such as where the decompressor 404 (including CPC logic 406) are disposed in a path parallel to the pipeline and selectively utilized only for a certain fraction of the instructions within the instructions set.
The exemplary decompression core uses the standard ARC ICache interface, although it will be apparent that other interface configurations may be substituted consistent with the given application. The B_TAKEN signal of the aforementioned cache interface is modified such that it goes "high" when a change of flow (COF) instruction is executed and it is taken from the decode unit (second stage) of the processor pipeline.
It is also noted that the exemplary decompressor architecture 400 of Fig. 4 adds two (2) stages to the processor's pipeline; i.e., the "unpack" stage 430 and "decompress decode" stage 431 disposed between the instruction fetch and decode stages of the typical processor. It will be appreciated, however, that the functionality of the present invention may be implemented in any different number of stages if desired, the term "stage" here being somewhat relative and arbitrary. For example, the functionality of the unpack and decompress/decode stages shown in Fig. 4 may be aggregated into one pipeline "stage", or alternatively disposed within three or more "stages". Furthermore, various of the functional units in each stage may conceivably disposed in other stages or aggregated within other existing components or functional units within the processor (or processor "island" if present). Hence, many different design variations are possible consistent with the broader principles of the present invention.
The decompressor modules of the exemplary illustrated embodiment are now describe din greater detail with respect to Figs. 4 and 5. Load Buffer - The load buffer 410 of the exemplary embodiment is a customized 66- bit register, which can be read from and written to at any bit position. Instructions are continually loaded into the buffer 410 whenever 32 or more bits are empty. After restart, it takes two cycles to fill the register before the decompression can begin. As the decompressor of the illustrated embodiment is restarted at every COF in the program, these two additional cycles required for loading are the primary cause of degradation in performance. It will be appreciated, however, that pre-loading of the buffer 410 under certain circumstances by a buffer management unit (not shown) can be utilized if desired, thereby reducing latency associated with the aforementioned load cycle, at the penalty of increased complexity and silicon. Decoding and Load Control Unit (DCLU) - As shown in the method 450 of Fig. 4a, if the load register or buffer 410 is full (i.e., 33 or more of its bits are valid) per step 452, the relevant bits (here, the first valid bit) is/are decoded per step 454 to detect whether the instruction is compressed or not. As stated above, the compression coding may utilize more than one bit, such as where two bits are utilized to specify up to four (4) distinct compression modes. As yet another alternative, other bits in a given instruction can be used as "hint" bits for one or more subsequent instructions, such as where it is desirable to let the DCLU unit know in advance of the compression status of a follow-on instruction coming from the cache.
If the instruction is uncompressed, the following 32 bits of the buffer are latched to the pipeline register per step 456. If the instruction is compressed, another set of bits (here, two) are decoded per step 458 to determine the length of the codeword. The codeword is sent to the dictionary (step 460), and the DCLU 414 sends a signal to the CPC logic 406 to indicate the number of bits by which the CPC should be incremented in order to fetch the next instruction (step 462).
Dictionary - The dictionary unit 416 can be implemented in ROM or RAM (e.g., SRAM), and may even reside off-chip if desired. It stores the instructions that are being compressed, and the codewords are used as indexes to its entries. The size of the dictionary will vary according to the requirements of different programs. Larger dictionaries generally give better compression, but there is also a trade-off between the compression ratio and the complexity of the on-chip circuitry. For the exemplary test bench programs previously described herein (see Fig. 3), 8 kilobytes of dictionary was suitable, although it will appreciated that other values (more or less) may be used consistent with the intended application. It is also recognized that optimization may be conducted between dictionary size and complexity/size/power consumption of the dictionary unit logic in order to determine the optimal trade-off values for the given application, such optimization techniques being readily implemented by those of ordinary skill in the processor design arts provided the present disclosure. Additionally, the size of the dictionary may be dynamically varied as a function of time or another variable (such as program type). For example, one type of program used may require a greater dictionary size in order to provide an acceptable level of compression, whereas a second type may require less dictionary size. The program type may be determined at load, for example, through reading of a designated register or other mechanism which causes the decompressor and associated logic (not shown) to dynamically select and allocate space within the dictionary unit 416 based on a pre-stored lookup table or other mechanism which correlates program type and dictionary size. Fig. 4b shows an exemplary method 480 of utilizing a variable such as program type to generate a lookup table for optimization parameters. Decode Unit - At the decompression decode pipeline stage 431, all of the instructions are fetched to the processor and partially decoded by the decode unit 412 of the decompressor. Decoding is necessary to detect changes of execution flow at an early stage. When the decoding of an opcode indicates a branch instruction is encountered, all or a portion of the offset field (e.g., 20 bits) are sent to the CPC logic block 406, which uses it to calculate the offset in compressed memory. The decode unit 412 also signals the load control unit 414 that a branch is being encountered, which then sends the following 16 bits (the branch offset entry; see Fig. 2) to the CPC 406. In case of jumps, the operand is also decoded in order to discern the type of jump. Fig. 4c illustrates this process graphically.
CPC Unit - The CPC unit 406 is the most complex block of the decompressor 400, and its internal stracture is shown in detail in Fig. 5. Its functions include: (i) extraction of the next CPC address; (ii) calculation of the branch and loop targets; (iii) translation of the uncompressed indirect jump targets to compressed targets; (iv) keeping track of function return addresses in compressed space; and (v) predicting whether a branch will be taken or not.
The exemplary ARC core used as the basis of the illustrated embodiment of Fig. 5 has a 24-bit instruction address bus. The two least significant bits of the address reflect the byte position, although they are used for compression purposes. Three more bits are added to the CPC in order to designate the exact location of the codewords in the compressed memory space, as these might start at any bit position. The CPC is, in this embodiment, a 27-bit register, whose contents are pipelined at each stage of decompression (CPC_1 530 and CPC_2 532). The CPC increment logic 504 consists of three counters, which increment the contents of the CPC register 540 by 8, 16 or 33, depending on the length of the codewords.
When a branch is detected by the decode unit 412, the contents of register CPC_1 530 (the address of the instruction following the branch instruction) are added to the 16-bit offset entry supplied by the Decode Load Control (DLC) Unit 414. The calculated branch target address is one of the inputs of the branch multiplexer (B_MUX) 502 (Fig. 5), the other input being the incremented CPC from the CPC increment logic 504. A simple static scheme is used in the illustrated embodiment for branch prediction; this scheme is based on experiments performed by the inventors hereof on the Media Bench Suite, namely that 60-75% of the negative branches in the code are taken. If the offset of the branch instruction is negative, the prediction is that the branch will be taken (based on the foregoing statistics), and so the load buffer and the pipeline registers are restarted. The correction signal [B_TAKEN] 421 arrives one cycle later from the decode unit 422 of the processor. If the prediction is wrong, the misprediction multiplexer (M_MUX) output signal 506 is resolved for the incremented CPC signal. It will be appreciated, however, that branch prediction schemes other than the static scheme described above may be used consistent with the invention. For example, more sophisticated or nonOstatic schemes may be applied, such as where more than just the (negative) offset of the branch instruction is considered as the basis for prediction. See also the discussion below regarding the Smith branch predictor. The CAM (Content Addressable Memory) jump table 508 maps the indirect jumps' target addresses in the uncompressed space as requested by the processor to the corresponding target addresses in compressed space. When an indirect jump occurs, the CAM 508 is searched and the correct address is fetched to the J_ADDR multiplexer 510. J_ADDR's other input is fed from the Subroutine Return Stack (SRS) 512. The SRS 512 is, in the present embodiment, an SRAM stack-like structure that stores the return addresses whenever a branch and link or a jump and link instruction is executed. The required depth of the SRS depends on the extent of function call nesting in the application being executed. It will be appreciated that other depths and types of structures (SRAM, ROM, or otherwise) may be used to provide this functionality, however, consistent with the invention.
As discussed above, the decompression engine is placed between the ICache and the processor. The instruction cache holds compressed instructions, which indicates that the resulting system will exhibit a lower cache-miss ratio, giving the present approach a significant performance advantage compared with other code compression schemes found in the prior art. The present invention accordingly provides clear execution speed advantages. A further significant advantage of the present approach is that the functionality of the processor is unaffected and no modifications need to be made to its internal organization. This is a highly attractive feature, since often times such modifications will "ripple" through the various stages of the design process and architecture.
The present invention further recognizes that other configurations and alterations to the exemplary compression/decompression architectures described herein may be made. For example, a dynamic 2-bit Smith branch predictor of the type known in the prior art (or even other types of prediction schema) may be used to enhance branch prediction performance. Additionally, techniques for discarding the additional 16-bit branch offset entries previously described may be used. These concepts have been proven by the inventors using software simulations.
Verification of the final hardware configuration of any processor design utilizing the present invention may also be performed using any number of commercially available suites, such as for example the ARC ARCangel™ prototyping system manufactured by the Assignee hereof. The mean compression ratio achieved for a standard set of embedded system programs is on the order of 0.6 using an 8 kilobyte dictionary, which compares quite favorably with other prior art hardware compression schemes. Integrated Circuit (IC) Device
As previously described, the Assignee's processor core configurations are used as the basis for the IC device of the exemplary embodiments described herein; however, other arrangements and configurations may be substituted if desired. The device is fabricated using the customized VHDL or Verilog design obtained using the method referenced subsequently herein, which is then synthesized into a logic level representation, and then reduced to a physical device using compilation, layout and fabrication techniques well known in the semiconductor arts. For example, the present invention is compatible with 0.35, 0.18, 0.13 and 0.1 micron processes, and ultimately may be applied to processes of even smaller or other resolution. An exemplary process for fabrication of the device is the 0.1 micron "Blue Logic" Cu-11 process offered by International Business Machines Corporation, although others may be used.
It will be appreciated by one skilled in the art that the IC device of the present invention may also contain any commonly available peripheral such as serial communications devices, parallel ports, timers, counters, high current drivers, USB interfaces, Ethernet 10/100/1000 interfaces, analog to digital (A/D) converters, digital to analog converters (D/A), interrupt processors, LCD drivers, memories and other similar devices. Further, the processor may also include other custom or application specific circuitry, such as to form a system on a chip (SoC) device useful for providing a number of different functionalities in a single package as previously referenced herein. The present invention is not limited to the type, number or complexity of peripherals and other circuitry that may be combined using the method and apparatus. Rather, any limitations are primarily imposed by the physical capacity of the extant semiconductor processes which improve over time. Therefore it is anticipated that the complexity and degree of integration possible employing the present invention will further increase as semiconductor processes improve.
It will be further recognized that any number of methodologies for synthesizing logic incorporating the "code compression" functionality previously discussed may be utilized in fabricating the IC device. Exemplary methods of synthesizing integrated circuit logic having a user-customized (i.e., "soft") instruction set is disclosed in co-pending U.S. Patent Application Serial Nos. 09/418,663 and 10/423,745 previously referenced herein. Other methodologies, whether "soft" or otherwise, may be used, however. Concurrent generation of software and hardware design tools (such as compilers, assemblers, simulators, verification tools, etc.) may also be performed based on the processor configuration specification, such techniques being well known to those of ordinary skill.
Fig. 6 illustrates an exemplary methodology of generating a processor design having code compression as described herein using one of the aforementioned "soft" tools. As shown in Fig. 6, the method 600 generally comprises first providing a software design environment (such as for example one of the aforementioned ARChitect tools) capable of generating such a design (step 602). Next, a plurality of design options are presented to the designer (step 604), including basic processor configuration (e.g., Harvard or Von Neumann), cache configuration, peripherals, register sets, etc. Next, per step 606, extension hardware is selected by the designer. As part of this step (or a separate step), the designer is also afforded the ability to include a "compression module" which includes the compression hardware shown in Figs. 4 and 5 herein. Election of this option structures the processor pipeline to, inter alia, include the two compression/decode stages 430, 431 of Fig. 4, as well as the compression dictionary size as previously described. As part of this process, an optimization routine may be employed, wherein the compression dictionary size (and any other relevant parameters associated with the configuration of the decompressor and associated hardware) are selected based on the output of this optimization routine. For example, the user may be provided the opportunity to select the desired die size, CPU speed, power limitations, etc. as part of the design process. The aforementioned routine may then iterate through various different compression dictionary sizes based on the user-selected optimization parameters in order to provide an "optimal" dictionary size. Various other approaches recognized by those of ordinary skill may be used as well, including allowing the user to select a trial dictionary size which, along with the user's other selections and an optimization routine or test bench, allows generation of an estimate of the salient core performance parameters. Furthermore, the aforementioned optimization process may be made global and/or permeate other stages of the design process in order to minimize possible variations. For example, in one variant, all user-selectable selections are obtained by the design software before any optimization is conducted, which allows the optimization routine or test bench to consider all aspects of the design including e.g., other extension hardware and instructions, etc. The routine can then iterate (manually or automatically) on various dictionary sizes for the same set of user selections in order to generate various performance estimates. It will also be recognized, however, that the use of a compression module in the processor design can be made part of the "basecase" processor configuration (i.e., the non- extended portion) such that the compression module election of step 606 is obviated. In this fashion, the user may be made to have anything ranging from (i) no choice in the matter; e.g., the use and parameters including dictionary size of the decompressor is pre-selected for the user either in static fashion or based on other parameters of the design; to (ii) complete control over the configuration of the decompressor and logic (yet simply no choice as to whether it is included or not in the first place).
Next, in step 608, one or more extension instructions are added, such as for custom DSP applications (e.g., Turbo coding of Viterbi decode, FFT, etc.), and may even include customized instructions generated by the user themselves. As will be appreciated, this step 608 may be integrated with step 606 described above, or even permuted in order therewith.
Lastly, after all selections are made, the design is evaluated for performance and any other relevant considerations (which may include compression ratio and decompressor dictionary size, etc. as previously discussed) per step 610. Any desired changes are then made (step 612), and the process iterated as required until convergence on the desired design is achieved.
It will be appreciated that while certain aspects of the invention have been described in terms of a specific sequence of steps of a method, these descriptions are only illustrative of the broader methods of the invention, and may be modified as required by the particular application. Certain steps may be rendered unnecessary or optional under certain circumstances. Additionally, certain steps or functionality may be added to the disclosed embodiments, or the order of performance of two or more steps permuted. All such variations are considered to be encompassed within the invention disclosed and claimed herein. While the above detailed description has shown, described, and pointed out novel features of the invention as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the invention. The foregoing description is of the best mode presently contemplated of carrying out the invention. This description is in no way meant to be limiting, but rather should be taken as illustrative of the general principles of the invention. The scope of the invention should be determined with reference to the claims.

Claims

WE CLAIM:
1. Pipelined data processor apparatus having an instraction cache and processor core, wherein said instruction cache contains at least one compressed instruction and at least one non-compressed instruction, said at least one compressed instruction being decompressed utilizing a decompression apparatus disposed within said pipeline between said cache and a decode stage of said pipeline.
2. The processor apparatus of Claim 1, wherein said decompression apparatus comprises a compressed program counter (CPC).
3. The processor apparatus of Claim 2, wherein said decompression apparatus further comprises a dictionary having a plurality of entries, at least a portion of said entries comprising compressed instructions.
4. The processor apparatus of Claim 3, wherein said dictionary entries are indexed by one or more codewords.
5. The processor apparatus of Claim 1 , wherein said decompression apparatus comprises decoding apparatus adapted to identify whether a fetched instruction is in compressed form or not.
6. The processor apparatus of Claim 5, wherein said at least one non-compressed instraction comprises at least one bit, said at least one bit determining the compression status thereof.
7. The processor apparatus of Claim 5, wherein said at least one compressed instraction includes at least one codeword length bits, and said decoding apparatus is further adapted to: decode said at least one codeword length bits; and notify a program counter associated with said decoding apparatus of at least one of (i) the codeword length, and (ii) the next fetch location.
8. The processor apparatus of Claim 1 , wherein said at least one compressed instraction comprises a user-configured or user selected extension instruction
9. The processor apparatus of Claim 1 , wherein the compression and decompression of said at least one compressed instruction is completely transparent to the downstream stages of said pipeline.
10. User-configured RISC processor apparatus comprising: an instruction cache; a processor core with multi-stage pipeline comprising at least compression and decompression decode stages disposed between the instruction fetch and instruction decode stages thereof; and an instraction set comprising at least one user-defined extension instruction; wherein said instruction cache contains at least one compressed version of said at least one user-define instruction, said compressed version being generated using at least a class- based dictionary (CBD) model.
11. The processor apparatus of Claim 10, wherein at least one of said compression and decompression decode stages includes a codeword indexed dictionary having a plurality of entries.
12. The processor apparatus of Claim 11 , wherein at least one of said compression and decompression decode stages further includes a compression program counter.
13. The processor apparatus of Claim 12, wherein at least one of said compression and decompression decode stages further includes apparatus adapted to determine whether any instraction received thereby are compressed and, if compressed, selectively initiate further decoding of said any instruction.
14. A method of operating a digital processor, comprising: providing a plurality of instructions in an uncompressed form; selectively compressing ones of said plurality of instructions; fetching ones of said compressed and uncompressed instructions; evaluating at least a portion of at least said uncompressed instructions to identify said instructions as uncompressed; for said compressed instructions that are fetched, decoding a length codeword therein; and based at least in part on said decoded length codeword, fetching another instraction.
15. The method of Claim 14, wherein said act of fetching comprises fetching at least a portion of said uncompressed and compressed instructions from an instruction cache.
16. The method of Claim 14, further comprising passing said decoded codeword to a dictionary, said codeword being used to index at least a portion of said dictionary.
17. The method of Claim 14, wherein said act of fetching another instruction based at least in part on said length codeword comprises notifying a program counter (pc) of an increment to be used in said fetching.
18. A method of decoding instructions within a digital processor, comprising: fetching at least one instruction; performing at least a partial decode of said at least one instruction; determining, based at least in part on said at least partial decode, whether said at least one instruction belongs to a first class; and if said at least one instraction does belong to said first class, sending at least offset information disposed within said at least one instruction to a program counter.
19. The method of Claim 18, wherein said act of performing at least a partial decode comprises performing said at least partial decode within a decompressor stage disposed within a pipeline of said processor.
20. The method of Claim 18, wherein said act of determining comprises determining if said at least one instruction comprises a branch instruction.
21. The method of Claim 18, wherein said program counter comprises a program counter other than the main program counter, said method further comprising: receiving said offset information at said other program counter; and using at least said offset information to determine a location from where a subsequent instruction fetch is to begin.
22. A method of decoding instructions within a digital processor, comprising: fetching at least one instruction; performing at least a partial decode of said at least one instraction; determining, based at least in part on said at least partial decode, whether said at least one instruction belongs to a first class; and if said at least one instruction does not belong to said first class, determining, based at least in part on said at least partial decode, whether said at least one instruction belongs to a second class; if said at least one instruction belongs to said second class, decoding at least one operand within said at least one instruction; and sending at least offset information disposed within said at least one instraction to a program counter.
23. The method of Claim 22, further comprising using said at least one decoded operand to determine the specific type of instraction within said second class that said at least one instraction comprises.
24. The method of Claim 23 , wherein said act of determining whether said at least one instruction belongs to a second class comprises determining whether said instruction is a jump instruction.
25. The method of Claim 24, wherein said act of performing at least a partial decode comprises performing said at least partial decode within a decompressor stage disposed within a pipeline of said processor, said decompressor stage being disposed substantially between an instruction cache and the remainder of said pipeline.
PCT/US2004/011562 2003-04-14 2004-04-14 Digital processor apparatus with code compression and method Ceased WO2004092913A2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US46318503P 2003-04-14 2003-04-14
US60/463,185 2003-04-14
US82412004A 2004-04-13 2004-04-13
US10/824,120 2004-04-13

Publications (2)

Publication Number Publication Date
WO2004092913A2 true WO2004092913A2 (en) 2004-10-28
WO2004092913A3 WO2004092913A3 (en) 2005-11-03

Family

ID=33303115

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/011562 Ceased WO2004092913A2 (en) 2003-04-14 2004-04-14 Digital processor apparatus with code compression and method

Country Status (1)

Country Link
WO (1) WO2004092913A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007107457A1 (en) * 2006-03-23 2007-09-27 International Business Machines Corporation Memory compression in information handling systems
CN111384958A (en) * 2018-12-27 2020-07-07 上海寒武纪信息科技有限公司 Data compression devices and related products
KR20220047827A (en) * 2019-08-19 2022-04-19 어드밴스드 마이크로 디바이시즈, 인코포레이티드 Flexible dictionary sharing for compressed caches

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010025337A1 (en) * 1996-06-10 2001-09-27 Frank Worrell Microprocessor including a mode detector for setting compression mode
DE19629130A1 (en) * 1996-07-19 1998-05-14 Philips Patentverwaltung Signal processor
US6199126B1 (en) * 1997-09-23 2001-03-06 International Business Machines Corporation Processor transparent on-the-fly instruction stream decompression
US6138229A (en) * 1998-05-29 2000-10-24 Motorola, Inc. Customizable instruction set processor with non-configurable/configurable decoding units and non-configurable/configurable execution units
US6185670B1 (en) * 1998-10-12 2001-02-06 Intel Corporation System for reducing number of opcodes required in a processor using an instruction format including operation class code and operation selector code fields

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007107457A1 (en) * 2006-03-23 2007-09-27 International Business Machines Corporation Memory compression in information handling systems
US7849241B2 (en) 2006-03-23 2010-12-07 International Business Machines Corporation Memory compression method and apparatus for heterogeneous processor architectures in an information handling system
CN111384958A (en) * 2018-12-27 2020-07-07 上海寒武纪信息科技有限公司 Data compression devices and related products
CN111384958B (en) * 2018-12-27 2024-04-05 上海寒武纪信息科技有限公司 Data compression device and related product
KR20220047827A (en) * 2019-08-19 2022-04-19 어드밴스드 마이크로 디바이시즈, 인코포레이티드 Flexible dictionary sharing for compressed caches
KR102685082B1 (en) 2019-08-19 2024-07-16 어드밴스드 마이크로 디바이시즈, 인코포레이티드 Flexible dictionary sharing for compressed caches
US12135653B2 (en) 2019-08-19 2024-11-05 Advanced Micro Devices, Inc. Flexible dictionary sharing for compressed caches

Also Published As

Publication number Publication date
WO2004092913A3 (en) 2005-11-03

Similar Documents

Publication Publication Date Title
US7774768B2 (en) Method and apparatus for processor code optimization using code compression
US7278137B1 (en) Methods and apparatus for compiling instructions for a data processor
Kozuch et al. Compression of embedded system programs
US6892292B2 (en) Apparatus for one-cycle decompression of compressed data and methods of operation thereof
US20100223237A1 (en) Lossless data compression and real-time decompression
US6848074B2 (en) Method and apparatus for implementing a single cycle operation in a data processing system
Gordon-Ross et al. Exploiting fixed programs in embedded systems: A loop cache example
Lekatsas et al. Design of an one-cycle decompression hardware for performance increase in embedded systems
US7203935B2 (en) Hardware/software platform for rapid prototyping of code compression technologies
CN102270112A (en) Reduced instruction-set computer (RISC) microprocessor command decoding circuit
Benini et al. A class of code compression schemes for reducing power consumption in embedded microprocessor systems
Xie et al. Code compression for embedded VLIW processors using variable-to-fixed coding
Wang et al. Code compression for embedded systems using separated dictionaries
EP4020817B1 (en) Method and apparatus for efficient deflate decompression using content-addressable data structures
Bonny et al. Huffman-based code compression techniques for embedded processors
Corliss et al. A DISE implementation of dynamic code decompression
WO2004092913A2 (en) Digital processor apparatus with code compression and method
Lin et al. Code compression for VLIW embedded systems using a self-generating table
Multanen et al. Programmable dictionary code compression for instruction stream energy efficiency
Corliss et al. The implementation and evaluation of dynamic code decompression using DISE
Lekatsas et al. Coco: A hardware/software platform for rapid prototyping of code compression technologies
Menon et al. Space/time tradeoffs in code compression for the tms320c62x processor
Heikkinen et al. Dictionary-based program compression on customizable processor architectures
Lin et al. Code Compression for Embedded Systems
Borin et al. Clustering-based microcode compression

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase