HK1174748A

HK1174748A - System, apparatus, and method for adaptive weighted interference cancellation using parallel residue compensation

Info

Publication number: HK1174748A
Application number: HK13101529.3A
Authority: HK
Inventors: Yuanbin Guo; Dennis Mccain; Joseph R. Cavallaro
Original assignee: Core Wireless Licensing S.A.R.L.
Priority date: 2005-02-25
Filing date: 2013-02-04
Publication date: 2013-06-14

Description

System, apparatus and method for adaptive weighted interference cancellation using parallel residual compensation

The present application is a divisional application of chinese patent application 200680006029.4 entitled "system, apparatus and method for adaptive weighted interference cancellation using parallel residual compensation" filed on 20/2/2006.

Technical Field

The present invention relates generally to multiple access communication systems, and more particularly to systems, apparatuses, and methods for enhancing suppression of multiple access interference.

Background

Generally speaking, a cellular communication system provides communication channels to multiple users simultaneously within a given service area (e.g., cell). Such communication channels include an uplink, i.e., mobile terminal to base station, communication channel and a downlink, i.e., base station to mobile terminal, communication channel for facilitating two-way multiple access communication with many users. Regardless of which multiple access communication scheme is employed, however, the user data that can be served in a given cell is limited by an upper limit. For example, the number of users that can be accommodated by each cell in a Time Division Multiple Access (TDMA) system is limited by the number of time slots, M, that are available in the uplink and downlink frequency bands. These bands may represent a continuous time-frequency plane, where M slots are available in the time-frequency plane. For example, the number of mobile terminals capable of simultaneous communication with their respective base stations is equal to M, whereby the mth user transmits signal energy in the mth time slot using the low duty cycle uplink. Reception from the base station to the mobile terminal is similarly limited in the downlink.

On the other hand, in Code Division Multiple Access (CDMA) systems, the signal energy is continuously distributed over the entire time-frequency plane, whereby each user shares the entire time-frequency plane by employing wideband coded signaling waveforms. Thus, the number of users that can be simultaneously accommodated in a CDMA system is not limited by the number of time slots available in the time-frequency plane, but is a function of the number of users present in the communication channel and the amount of Processing Gain (PG) employed by the CDMA system. PG for CDMA systems is defined as the ratio of the bandwidth of a spread signal (spread signal) in hertz (Hz) to the bandwidth of a data signal in Hz.

The number of users transmitting within a given CDMA channel contributes to the total amount of undesired signal power received and is thus a measure of the interfering signal power caused by multiple access users within the CDMA channel. Thus, from the PG and interfering signal power present at the CDMA receiver, an upper limit on the number of users that can be supported by a given CDMA channel can be calculated.

For example, if the information bandwidth of the transmitted data signal is 9600Hz and the transmission bandwidth of the data signal is 1.152 megahertz (Mhz), PG 1152000/9600 is 120, or 20.8 decibels (dB). Furthermore, if the bit energy-to-noise spectral density ratio (Eb/N0) required for acceptable performance of a CDMA communication system is equal to 6dB, the communication device can accomplish its objective even if the interfering signal power exceeds 14.8 dB. That is, the allowed interference margin for the receiver is calculated to be 20.8-6-14.8 dB. Thus, if each user in the spread spectrum bandwidth provides the same amount of signal power to the base station antenna through an ideal power control scheme regardless of location, then a 120-bit Multiple Access (MA) user can be accommodated 102.08 by the CDMA channel.

The idea of a CDMA communication system is therefore to consume interference tolerance by accommodating the maximum number of co-channel communication devices as much as possible. As described above, these co-channel communication devices occupy the frequency-time plane simultaneously, which accounts for the interference or interference power seen at the CDMA receiver. In theory, if their respective signals are orthogonal to each other, it is possible to reduce Multiple Access Interference (MAI) caused by MA users in a CDMA channel to zero. In practice, however, co-channel interference, or cross-correlation from other codes, still exists because delayed and attenuated replicas of the signals arriving asynchronously are not orthogonal to their original components. Similarly, signals received from neighboring cells contribute to MAI because those signals are not synchronous and thus are not orthogonal to the signals received from the local cell.

Conventional CDMA receivers demodulate each user's signal as if it were the only signal present by using a bank of filters that match the user's signal waveform. Since the user's signal also contains cross-correlations, i.e., interference, from other codes, the matched filter gradually exhibits poor performance as the number of users increases, or as the relative power of the interfering signal becomes greater. Thus, the receiver is required to be able to determine which of the N possible messages is the message transmitted in the presence of this interference.

It is known that a Maximum Likelihood (ML) sequence detector based on the maximum a posteriori probability (MAP) receiver principle is the preferred receiver for performing such a decision in the presence of interference. However, the complexity of the ML sequence detector is exponentially related to the amount of code being processed, which creates a prohibitive challenge for computation and memory implementation.

Prior art attempts to achieve a good balance between performance and complexity have produced a great deal of multi-user detection (MUD) research activity. Among these activities, the multi-stage Parallel Interference Cancellation (PIC) technique, due to its relatively low computational complexity and good performance, provides a promising algorithm for real-time implementation. In particular, full-PIC (Complete-PIC) and Partial-PIC (Partial-PIC) algorithms have attracted attention in the literature.

Full PIC is a subtractive interference cancellation scheme that assumes that the symbol detection from the previous stage is correct. An MAI estimate (estimate) is then made from the previous stage detection, which estimate is then subtracted from the received signal completely. If some symbol detections are erroneous, for example when the system load is high or there are iterations in previous stages, erroneous interference estimates will be obtained, which when subtracted from the received signal may introduce more interference than was previously present. This phenomenon leads to the so-called "ping-pong" effect in conventional full PIC schemes.

In this case, it is not desirable to cancel the entire estimated interference. At this time, partial elimination of MAI, i.e., partial PIC, may be performed by introducing a weight in each stage. The weights are found by trial and error with the constraint that each weight takes a value between 0 and 1. Although a reasonably good performance improvement for the full PIC algorithm is achieved by partial PIC, it is known that the choice of weights used in each stage greatly affects performance. Thus, incorrect selection of weights does not have acceptable performance characteristics at all.

Despite the continued development of MAI reduction technologies, few research activities have investigated the feasibility of Very Large Scale Integration (VLSI) implementations of these technologies. Although full PIC and partial PIC algorithms provide good performance and have low computational complexity, their real-time hardware implementation remains extremely challenging. Commercialization of these algorithms relies in particular on finding a viable VLSI architecture that can efficiently apply hardware resources, thereby achieving low power and low cost in its design.

Therefore, there is a need in the communications industry for an MAI reduction algorithm that further reduces computational complexity over the prior art. Furthermore, by taking advantage of the inherent properties of the MAI reduction algorithm, the reduced computational complexity will facilitate its VLSI implementation. The present invention fulfills these and other needs and provides other advantages over prior art MAI reduction methods.

Disclosure of Invention

To overcome the deficiencies of the prior art described above, and to overcome other deficiencies that will become apparent upon reading and understanding the present specification, the present invention discloses a system, apparatus and method for a multi-stage, Parallel Residue Compensation (PRC) receiver to enhance MAI suppression. The present invention enables MAI estimation accuracy to be improved by using a user-specific weight (user-specific weight) calculated by an adaptive Normalized Least Mean Square (NLMS) algorithm. In this approach, direct interference cancellation is avoided and a reduction in algorithm complexity is achieved by exploiting the common characteristics among multiple users and the characteristics of the MAI suppression algorithm itself.

According to one embodiment of the present invention, a multi-stage, Normalized Least Mean Square (NLMS) -based, Parallel Residue Compensation (PRC) receiver includes a matched filter stage coupled to receive a multi-user signal and adapted to provide each user with data symbols representing demodulated bit stream packets. The receiver further comprises: a signal reconstructor coupled to receive the data symbols and adapted to generate a modulated representation for each user's data symbols to produce a replica of the multi-user signal; an NLMS module coupled to receive a replica of the multiuser signal and adapted to compute a weighted estimate of the replica; and a Parallel Residual Compensation (PRC) module coupled to receive the weighted estimate of the replica and the multiuser signal and adapted to generate a common residual error signal from the weighted estimate of the replica and the multiuser signal. The common residual error signal is eventually subtracted from the data symbols of each user to cancel the interference associated with the data symbols of each user.

According to another embodiment of the present invention, a method of estimating symbols transmitted from a plurality of users in a multi-user communication system comprises: calculating a weighted estimate of the multi-user signal; generating a common residual signal by subtracting the weighted estimates of the multi-user signals from the multi-user signals; compensating the signal of each user by the residual signal so as to obtain an interference-eliminated signal for each user; and filtering the interference canceled signal for each user to obtain an estimate of the symbol transmitted by each user.

According to yet another embodiment of the present invention, a Code Division Multiple Access (CDMA) chipset is contemplated that includes a Normalized Least Mean Square (NLMS) based Parallel Residue Compensation (PRC) receiver. The receiver comprises a signal reconstruction circuit coupled to receive the multiuser signal and adapted to provide data symbols representing the demodulated bit stream packets for each user and further adapted to generate a modulated representation for the data symbols for each user to produce a replica of the multiuser signal. The CDMA chipset-based receiver further includes an NLMS circuit coupled to receive a replica of the multi-user signal and adapted to accumulate (accumulate) first and second weighted signals generated as a result of a difference between the multi-user signal and a weighted replica of the multi-user signal, wherein the replica of the multi-user signal includes a first spreading code (spreading code) bitstream, and first and second data streams. The CDMA chipset-based receiver further includes a Parallel Residue Compensation (PRC) circuit coupled to receive the weighted replica of the multiuser signal and adapted to generate first and second error signals from the weighted replica of the multiuser signal. The first and second error signals are subtracted from the data symbols for each user to cancel interference associated with the data symbols for each user.

In accordance with yet another embodiment of the present invention, a method for implementing a Normalized Least Mean Square (NLMS) -based Parallel Residue Compensation (PRC) receiver for reducing multiple access interference for each user of a multi-user signal is contemplated. The method comprises the following steps: two parallel processing paths are established to operate on two groups of users, wherein each processing path is implemented in combinational logic for operating on each group of users in succession. The successive operations in each processing path include: estimating symbols for each user of a group of users, calculating weighted symbols for each user of the group of users, calculating weighted sum chip signals for each user of the group of users, generating a detected bit vector for each user from the weighted sum chip signals, generating a difference between each bit of the detected bit vector and the symbol estimate for each user, adding the difference to the weighted symbols for each user, and generating an interference canceled signal for each symbol once all bits of the detected bit vector have been processed.

These and various other advantages and features of novelty which characterize the invention are pointed out with particularity in the claims annexed hereto and form a part hereof. However, for a better understanding of the invention, its advantages, and the objects obtained by its use, reference should be made to the drawings which form a further part hereof, and to the accompanying descriptive matter, in which there are illustrated and described exemplary systems, apparatuses, and methods in accordance with the invention.

Drawings

The invention is described in connection with the embodiments illustrated below.

FIG. 1 illustrates an exemplary system diagram of a multi-user communication system;

FIG. 2 illustrates an exemplary system on a chip (SoC) architecture in accordance with the present invention;

FIG. 3 illustrates an exemplary area constraint architecture for a modulator according to the present invention;

FIG. 4 illustrates an exemplary system level architecture for a multi-stage, Normalized Least Mean Square (NLMS) receiver in accordance with the present invention;

FIG. 5 illustrates an exemplary multi-user matched filter module according to the present invention;

FIG. 6 illustrates an exemplary loop structure for chip-based (chip-basis) updating for each symbol in accordance with the present invention;

FIG. 7 illustrates an exemplary block diagram of a basic Sumsub-MUX-Unit (SMU) design block in accordance with the present invention;

FIG. 8 illustrates an exemplary block diagram of the parallel orientation of the basic SMU design blocks of FIG. 7;

FIG. 9 illustrates an exemplary SMU weighted notation (SMUw) block diagram in accordance with the present invention; and

FIG. 10 illustrates an exemplary block diagram for a Weighted Sum Matching Filter (WSMF) and Residual Compensation (RC) in accordance with the present invention.

Detailed Description

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office patent file or records, but otherwise reserves all copyright rights whatsoever.

In the following description of the various exemplary embodiments, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration various embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized, and structural and operational changes may be made without departing from the scope of the present invention.

In general, the present invention provides a new, multi-stage Parallel Residual Compensation (PRC) receiver architecture for enhanced suppression of Multiple Access Interference (MAI) in Code Division Multiple Access (CDMA) systems. The accuracy of the interference estimation is improved using a set of weights computed by an adaptive Normalized Least Mean Square (NLMS) algorithm. The algorithm achieves significant performance gains over conventional Parallel Interference Cancellation (PIC) algorithms that assume full or partial interference cancellation.

To reduce complexity, common features of multi-code processing are extracted and used to derive the structure of the PRC, avoiding direct interference cancellation. The derived PRC structure reduces interference cancellation from a complexity proportional to the square of the number of users to a complexity linearly related to the number of users.

Furthermore, the present invention is directed to a scalable system-on-chip (SoC) VLSI architecture using simple Sumsub-MUX-Unit (SMU) combinational logic. The proposed architecture avoids the use of dedicated multipliers, which is effective to achieve at least a tenfold improvement in hardware resource configuration. Efficient precision C-based high-level synthesis (HLS) design methods are applied to implement these architectures in FPGA systems. Hardware efficiency is achieved by investigating multi-stage parallelism and pipelining, which yields substantial improvements over conventional designs.

In one embodiment in accordance with the principles of the present invention, the enhanced MAI suppression algorithm is implemented within an Application Specific Integrated Circuit (ASIC) that is further integrated within the physical layer (HPY) processing engine of each CDMA chipset. The implementation includes a pipeline architecture for NLMS weight updates, PRC, and a matching filter element. Furthermore, the present invention contemplates optimization of the logic elements to replace dedicated multipliers with SMU combinational logic. In an alternative embodiment, a Digital Signal Processor (DSP) may be used, as long as the appropriate level of parallelism and pipelining can be achieved for the real-time processing required by the time critical block.

The application of the present invention can be focused on any cellular communication algorithm utilizing spread spectrum techniques in base stations and mobile terminals. Such communication systems include CDMA systems conforming to, for example, CDMA2000, Wideband CDMA (WCDMA), High Speed Downlink Packet Access (HSDPA) systems for WCDMA, and other high capacity, multiple access communication protocols.

Fig. 1 shows an exemplary system diagram of a multi-user communication system 100, where user 1 through user K represent K users of the CDMA uplink physical layer to a corresponding base station (not shown). Although the CDMA uplink of fig. 1 is emphasized, those skilled in the art will appreciate that a corresponding downlink exists and is not shown. Users 1-K share a common single path channel 116 whose noise is estimated as Additive White Gaussian Noise (AWGN)114, thus distinguishing one user from the next involves the use of orthogonal or nearly orthogonal codes to modulate the transmitted bits. The orthogonal codes, or so-called spreading sequences, of the spreading module 108 and 112 perform the necessary modulation.

The channel encoder 102 and 106 provide error correction functionality to the multi-user communication system 100 whereby discrete-time input sequences are mapped to discrete-time output sequences exhibiting redundancy. This redundancy is effective for providing noise averaging characteristics that cause the channel decoder 128 to be less vulnerable to channel effects due to noise, distortion, fading, and the like.

The CDMA communication system 100 may employ any number of modulation schemes, although for illustrative reasons a Quadrature Phase Shift Keying (QPSK) modulation scheme within the spreading module 108 and 112 is discussed. Using this modulation scheme, a set of binary bits is usedThe nth data symbol of the kth user at the transmitter is mapped to constellation points (constellation points). The symbol output at the modulator (not shown) is expressed with equal probability as:

in AWGN channel, the complex baseband signal received at the receiver 130 by the ith chip of the nth symbol is represented as:

whereinAndis the complex channel amplitude and transmit power of the kth user. c. C_k[i+(n-1)N]Is the ith chip spreading code of the nth symbol of the kth user and has a value of { +/-1 }. N is a spreading factor, K is an element of [1, N ]]Is the number of active users, z (i) is the number of active users with a two-sided spectral density N₀A sample of complex additive gaussian noise of/2.

By collecting N chip samples in a vector for one symbol duration, the expression of the received vector can be expressed as:

the received signal may be despread 122 using the matched filter 118 and soft estimates of the symbols for the multiple users are generated as follows:

whereinIs the cross-correlation matrix of the spreading code, and the superscript H indicates the hermitian conjugate.

When the correlation matrix is interactedWhen not identical, MAI occurs.The symbol estimate for the kth user is given by:

the matched filter output is then phase corrected by channel estimation using channel estimation block 132 and multiuser detector 126 and sent to multiuser channel decoder 128. At the decoder, the estimated bits are detected, such as:

where "·/" denotes point segmentation. The elements in vectors (6) and (7) are given as follows:

the particular set of multi-user detectors used in the multi-user detector 126 implementation is based on Interference Cancellation (IC), in particular Parallel Interference Cancellation (PIC). The concept is to eliminate the interference generated by all but the desired user, since lower computational requirements and hardware related structures can be achieved using PICs. Traditionally, an iterative multi-stage PIC approach is used, so the input to one particular stage is the estimated bits of the previous stage. Assuming that the bit estimates at level (m-1) are the bits transmitted by each user, the interference estimate at level m for each user is determined by excluding the reconstructed signal for that particular user.

However, as mentioned above, if the estimates of the previous stages are not accurate enough, the PIC algorithm may introduce more interference to the signal. Thus, in order to obtain more accurate interference cancellation, according to the invention a set of partial weights is introduced in each stage. A respective weight is selected for each user depending on the accuracy of the symbol estimation. By defining a cost function according to the squared euclidean distance between the received signal r (i) and the weighted sum of the estimated signals of all users, by minimizing the Mean Square Error (MSE) of the cost function, giving the optimal weight,

wherein the weighted sum of the hard decision symbols of all users at the m-th level is given by:

in this case, the amount of the solvent to be used,is a weighting vector of the m-th order, anIs the output vector of the multiuser spreader (spaader) in PIC reconstruction.

The residual error between the desired response and its estimate in the mth stage is defined asThe MMSE optimization of equation (9) is solved by updating the Normalized Least Mean Square (NLMS) algorithm in the equation with iterations operating in bit intervals on the chip rate,

where μ is the step size, where,is the input vector to the NLMS algorithm. Each user in adaptive PICThe interference of (a) is estimated in a direct manner for all K users as

Generating interference-cancelling chip-level signals for each user, e.g.

And detecting the symbol as

Since the computational complexity determines the cost of the required hardware resources, such as the number of functional units, it is one of the most important considerations in PIC implementation. The complexity of direct-form PIC in one chip for K users is 4K x (K-1) real multiplications, 2K (K-1) real additions and 2K subtractions. Furthermore, there is one "if" statement that maps to a hardware comparator per user loop, which makes the loop structure irregular and not conducive to pipelining. Therefore, according to the present invention, the calculation regularity of all users is considered, thereby changing the rank (order) of "interference estimation" and "interference cancellation".

The architecture according to the invention thus performs the following steps. First, a weighted sum chip function is calculated by summing the weighted signals of all users together to obtain a weighted estimate of the received signal in a chip rate sample, e.g.

Second, a common residual signal is generated for all users by subtracting a signal from the initial signal, e.g.

Thirdly, compensating the residual error to each user, thereby obtaining a chip signal with interference eliminated,

Finally, a multi-user "chip-matched filter" may be performed on the corrected signal, as in equation (14) above. Thus, the procedure described in the four steps above realizes a chip-level PRC (CL-PRC) architecture.

Furthermore, by considering the matched filter and residual compensation steps together in equations (15), (16), and (17), the symbol-level PRC (SL-PRC) architecture can be generated with the level 0 multi-user matched filter output. At the chip level, a "spreading" of the weighted symbols for each user followed by a "matched filter" procedure is superfluous. Therefore, a matched filter is only necessary for the weighted sum chip, and is performed as follows

And the soft decision matched filter output in the symbol stage that ultimately generates the correction signal is

The optimal Weighted Symbol (WS) of equation (13) may be calculated as

And may then be stored in a register or array.

A summary of the complexity of the Direct Form (DF) PIC structure, CL-PRC structure and SL-PRC structure is shown in Table 1. It can be seen that the interference cancellation complexity is from O (K) in DF-PIC²N) rating to O (K × N) in the PRC architecture, which is linear with the number of users. Although the SL-PRC architecture is similar to the CL-PRC, the circular chain of chip indices for the SL-PRC architecture is more compact and regular for scheduling pipelines and parallel architectures, and thus the SL-PRC architecture is prone to generate faster designs than the CL-PRC architecture.

Algorithm	Multiplication factor	Addition/subtraction factor
			DF-PIC	4K²*N	(2K2-1)N
CL-PRC	5K*N	(4K-2)*N
			SL-PRC	5K*N	(3K-2)*N+K

TABLE 1

Turning to fig. 2, a conceptual SoC architecture in accordance with the principles of the present invention is shown that provides a scalable verification solution that addresses all aspects of the design cycle and reduces the verification gap. The system level VLSI design of fig. 2 illustrates one embodiment of an NLMS-based adaptive PRC architecture, divided into several subsystem modules (SB) according to their respective functions and timing relationships exhibited by each subsystem module (SB). Each SB represents a precision-C design module, where each SB is cascaded in pipeline configuration 202 through the use of, for example, an appropriate Hardware Design Language (HDL) designer. Each SB is made up of several Processing Elements (PEs), which are configured in a pipeline configuration 204 and/or a parallel configuration 206. The pipelining and parallelism in the PE stage reflects the loop structure in the algorithm and has the best chance of optimization. The PEs map to hardware resources of Functional Units (FUs) 210, including registers, memory, multipliers, adders, etc., each exhibiting additional stages of parallel configuration 208.

Turning to fig. 3, an exemplary area constrained architecture for bit-vector joint modulator 306, spreader 308, and multicode combiner 310 in accordance with the present invention is shown. At the transmitter, the input bit streams for the K users are packed into a single-word bit vector buffer 302, such that

In order to conserve storage resources. The spreading codes of the K users may also be combined to form a code vector ROM 312, e.g.

The bits are read out of the vector buffer 302 and converted to a parallel I/Q bit stream by a series-parallel converter 304. In the hardware configuration of fig. 3, the bit vector joint modulator 306 and spreader 308 are combined to apply a common feature on a cyclic architecture. The multiplication of spreader 308 is designed using bit-level combinational logic to avoid the use of multipliers. The script of the combinational logic hardware design is shown in the following code segment (23), as follows:

although K users can process logic in parallel, all K users can also process in series, provided the system clock is fast enough, in case of satisfying real-time requirements. As can be seen from examination of fig. 3, an efficient VLSI architecture is designed using combinational logic, where modulator 306 and spreader 308 utilize shift registers, and gates, and multiplexers controlled by the spreading code bits of the K users. Multicode combiner 310 utilizes an accumulator architecture to generate signals SIsum (i) and SQsum (i) such that the real-time requirements of K users can be achieved using a minimum design area.

Turning to the receiver partitioning block diagram of fig. 4, the loop structure and inherent timing in the algorithm are optimized to achieve pipelining and parallelism, and further optimized to reduce redundant computations, avoid timing conflicts, and share functional units as well as registers and memory. It can be seen that the functional unit 402 is logically combined 412 for optimization.

The system-level architecture 400 of the multi-stage NLMS receiver according to the present invention utilizes a multi-code matched filter as the first stage within functional unit 402. The first stage match filter output of the K codes is stored in memory module S _ MF0[ K ]414 for the symbol stage PRC. At the output of the demodulators DEMOD 1-DEMOD K, the detection bits of the K users are packed into two words, B0 and B1, for QPSK modulation. After the parallel-to-serial conversion, the detector bits are received by the reconstructor 404, whereby signal reconstruction using the detected bits is achieved by the modulators MOD 1-MOD K and the spreading units SP 1-SP K. The output of reconstructor 404 is passed to the stage 1 NLMS module of functional unit 406 for weight calculation, while being buffered for stage 1 PRC processing. The interference cancelled signals are detected for the K users by a combined matched filter and demodulator unit (MFU + DEMU) 408. The multi-stage hardware unit of the NLMS-PRC module 410-412 is provided for the M stages of the pipeline mode, where the detection bits are passed to the following stages M of the multi-stage processing and the application of FIFO is chosen to balance the processing latency in each link.

Fig. 5 shows an exemplary embodiment of the multi-user matched filter module 402 of fig. 4, where the architecture is designed as 2 parallel despreader units (DSU) + MFU engines 502 and 506. The design is implemented in combinatorial logic by exploiting the properties of the spreading code so as to eliminate the need for multiplier circuits. The K users are divided into two groups of K/2 users, where each group of users utilizes one PE serially, as in the example illustrated in fig. 2. The temporary results of the MFU are stored in respective Dual Port Random Access Memory (DPRAM) vectors 504 and 508, respectively, and then accumulated by accumulators 514 and 516, respectively. For each input chip sample, RE [ i ] and IM [ i ], K/2 user spreading codes C1[ i ] and C2[ i ] are serially shifted from code vectors ROM 510 and 512, respectively, for multiplication by the chip sample. Once the SYMBOLs have been accumulated by accumulators 514 and 516, signal symbolready is asserted to indicate that the demodulator unit is required to read the SYMBOL estimates.

As mentioned above, NLMS stages 1 to M represent a significant throughput bottleneck because the algorithm utilizes division and multiplication operations as exemplified by equation (11) with extensive feedback. The NLMS design module receives a chip-based complex NLMS algorithm, as described by equations (10) and (11), and computes optimal weights for all users in each symbol. Mapping the adaptive NLMS algorithm of the present invention to hardware, special attention is paid to the data flow and timing for the active partitions.

The conventional approach of mapping the LMS algorithm to a parallel and pipelined architecture either introduces a delay in coefficient updating or imposes excessive hardware requirements. However, in accordance with the present invention, to accommodate NLMS, attention is directed to a hardware efficient pipeline architecture that provides substantially the same output and error signals as the standard LMS architecture without the associated delays. Furthermore, the throughput of the architecture according to the invention is independent of the length of the input vector, i.e. the number of users.

Referring back to equations (10) and (11) as described above, corresponding to the top level cyclic structure, L1 and L2 may be derived. The L1 loop represents a recursive loop for chip-based updates for each symbol of equation (10), while the L2 loop updates the weight estimates from the registers to the memory modules when one symbol is prepared per equation (11). As exemplified in the block diagram of FIG. 6, loops L1 and L2 map to hardware units.

Loop L1 is illustrated by two second stage loops shown in blocks 602 and 604. Blocks 602 and 604 correspond to user indices, where block 602 computes a weighted estimate of the received signal based on the current weights; module 604 calculates the iteration weights for the K users. According to the cyclic structure of code index k and chip index i, the NLMS module can be divided into two main functions: the Weighted Sum Function (WSF) of block 602 as described by equation (10), and the Weighted Adaptation Function (WAF) of block 604 as described by equation (11).

In the WSF block 602, the estimated hard decision bits are extracted from the bit vectors B0 and B1 by the unpacking unit (DPU) of block 614.

Generating an omega vector of equation (24) from the estimated bits and the spreading code vector C [ i ] using the same Modulator Spreader Unit (MSU) as in the transmitter; the vector is then stored in a memory module or register file. In the same loop structure, the chip-weight-unit (CWU)/complex-add-unit (CAU)616 generates a weighted sum of replicas as described in equation (10). Then, as in equation (16), a replica of the received signal is subtracted from the received chip samples to form a residual error. The omega vector of equation (24) and the residual error of equation (16) are then passed to the WAF module 604.

First, the omega vector is multiplied by the residue (residual) and then by the factor μ/norm. This resultant is then iteratively added to the previous weight and written back to W_tmp[K]In the space 610. This process is iteratively repeated for all chips in one symbol. Once each symbol is ready, the Weight Loader (WLP)606 loads the optimal weights 608 for interference cancellation.

Ping-pong buffer 612 is designed to store the input chip samples for the next symbol while the NLMS module computes the weights. In the NLMS L1 architecture, counter 618 controls the iteration such that, for the first chip of each symbol, the initial value 620 of the weight vector of equation (25) is set to be equal toScaling (scale) channel estimates for each user, where B_WIs the bit width of the scaling system.

By way of overview, the scripts that WSF 602 and WAF 604 loop are shown in code segments (26) and (27), respectively.

In the WSF block 602, a vector process of modulation is formed for all K users. In the WAF module 604, it is necessary toCalculation of norm (norm) of vector. The direct calculation of the Ω vector norm is given as:

equation (28) has the complexity of 2K multiplications and (K-1) additions. If it is notStored in a memory array, then the complexity increases by 2K memory reads. However, since for QPSK And c_k(i) E { ± 1}, so there is no need to compute the norm separately for each symbol. Can be seen, quantityIs a constant such that it passes log₂(2K) The right shift of (2) to implement the division. Since the step size μ does not need to be a very precise value, it can be determined only by log₂(2K) After right-shifting, μ and norm are combined into one coefficient, which can be computed as a offline constant.

The conventional design of MSU and CWU implementing modules 602 and 604 requires a tree layout of 6 multipliers and CAU for module 616 for a K user full pipeline sum. However, becauseAnd C_i(k) Adopt the values of { +/-1} so instead of using {0, 1} to represent these values, then K usersCan be packed into vector words B₀、B₁And C_i. Bit values are extracted from the vector word, such as: b₀＝(B₀＞＞k)&1；b₁＝(B₁＞＞k)&1; and C_k(i)＝(C[i]＞＞k)&1. As shown in the table 2 below, the following examples,may be derived from the truth table based on the different input bits and hard decision bits of the spreading code. Further, by using {0, 1} instead of { +/-1} to representAnd

TABLE 2

The logic design is shown as:

using a bit with 1-bit value {0, 1}Andthe decoder-controlled Multiplexer (MUX) circuit of (1) can be implemented as equation0) With a 2-bit value { +/-1} as in (11) Is performed. The multiplication in equation (10) may then be implemented as a Sumsub-Mux-Unit (SMU) of weighted notation (SMUw),

for in equation (11)The same structure can be used, such as a SMU block for error (SMUe).

The circuit logic for one SMUw/SMUe 702 is depicted in FIG. 7, where only the sign (sign) and input to accumulator 710 are controlled by 4-way MUX 708. The difference between the SMU 702 operating as SMUw or SMUe is determined by the input to the MUX708 and the configuration of the Connection Network (CN) 706. The select decoder 704 generates the SEL [ K ] signal to replace the original omega vector, which is then used to control the MUX708 as shown in Table 3. It should be noted that table 3 identifies the configuration of the CN706 for both the SMUw and SMUe configurations of the SMU 702.

TABLE 3

Referring back to fig. 6, it can be seen that WSF module 602 and WAF module 604 for NLMS algorithms can be integrated using the basic SMU design module of fig. 7, as described above. In one embodiment according to the invention, for example, the parallel orientation of two SMUw and SMUe engines is shown in fig. 8. In the WSF function of blocks 802 and 804, the K users are split into two blocks of K/2 users, so the select decoders 812 and 816 receive the respective C [ i ], B [0] and B [1] bit streams, generating select signals SEL1[ K/2] and SEL2[ K/2] for SMUw 814 and 818. SMUw 814 and 818 also receive inputs from temporary weight memory blocks 824 and 826.

The CAU 806 sums the two portions of the path to obtain a total weighted sum chip signal, which is then subtracted from the received initial signal Re i Im i to generate an error signal, which is then forwarded to smues 820 and 822 of the WAF modules 808 and 810, respectively. Once the total weighted sum chip signal is multiplied by the signal μ _ norm, it is adjusted by the weights from the previous iteration and written back to the temporary weight memory blocks 824 and 826. In this way, each engine acts as a single processor for serial processing of K/2 users, which represents a significant improvement in VLSI region and timing closure (timing closure) optimization compared to conventional multiplier designs.

In another embodiment in accordance with the principles of the present invention, the basic SMU design block of fig. 7 may also be used to implement a weight-sum-match-filter (WSMF) and a residue-compensation (RC) block, as described in equations (15) through (19) above. Similar to the NLMS module of FIG. 8, the symbol level Sum-sub-MUX-Unit of the weighted symbol (SMUw) block diagram of FIG. 9 can be designed with bit-ware combinational logic to generate ws [ k ] k as calculated by equation (20)]. In this example, SMUw 908 is controlled only by the selective decoder 914, which is represented by B [0 ] ]And B [1 ]]And (5) vector triggering. When WMFU 910 accumulates user index k, MUX in weight-match-filter-unit (WMFU)910 is encoded by spreading code Ci]Control to accumulate the best weighted sum chip signal

The full datapath logic block diagram of the WSMF and PRC processing described by equations (15) to (19) according to the basic SMUw design block as exemplified in fig. 9 can now be as shown in fig. 10. Parallel PEs 1002 and 1004 are built by combinational logic to operate two groups of K/2 users, where the users in each group utilize their respective PEs serially. In each PE1002 and 1004, the optimal weights 1006 and 1020 are input to each SMUw 1008 and 1022 to compute the weighted symbols ws [ k ]]1010 and ws [ k ]]1024, and weighted sum chip symbolsThe weighted sum chip signal is then detected by WMFUs 1012 and 1026 to form a signal1014, and 1028, and then estimates from the symbol of the kth userThe signal is subtracted and added according to weighted symbols 1010 and 1024. Finding matched filter outputs for interference-canceled signals1018, and 1032, the process ends. Once the entire SYMBOL has been accumulated, the signal symbolready is asserted to alert the demodulator unit to read the SYMBOL estimate.

It should be noted that the architecture of fig. 10 does not require the use of a general multiplier as is conventionally used. Thus, a bit-level combinational logic VLSI architecture can be used to achieve significant improvements in clock rates and to reduce the number of Configurable Logic Blocks (CLBs) required for a design. The improvement in clock rate helps to make more time resources available for processing each user and each chip.

As mentioned above, the VLSI architecture according to the present invention is implemented with a precision-C approach. In an exemplary design implementation, a specification of a real-time design is analyzed with a precision-C method, which corresponds to WCDMA and High Speed Downlink Packet Access (HSDPA) systems used for WCDMA. In particular, the chip rate of the downlink wireless multimedia service for these systems is 3.84MHz with a spreading gain of 16. Given an operating clock rate of 38.4MHz, 10-cycle resources are created for each chip and 160-cycle resources are created for each symbol.

The latency of a particular design is determined by the ratio of the number of cycles required to operate the clock rate, e.g.

T_L＝N_cycle/f_clk. (33)

Thus, equation (33) indicates that two variables can be used to reduce latency: orReducing the number of cycles N required_cycle(ii) a Or increasing the frequency f of the operating clock_clk. For a PE with several different functional units, the critical path determines the highest clock rate that can be reached. Since the latency in the critical path is an accumulation of the latencies of all functional units, retiming is typically required to increase the clock frequency. However, when the design becomes complex, retiming using conventional design methods is quite difficult once the design specifications change.

There is a balance between speed and size when considering the different types of storage hardware that are available. If, for example, a register file is applied to map the data arrays, they can be accessed in parallel in one cycle. In this way, the use of a register file helps to provide increased parallelism. On the other hand, if multiple register files are needed to share multiple functional units, a MUX is needed to control the inputs to the multiple functional units. Since the MUX may primarily affect the design size, increased parallelism will typically result in designs requiring more chip area.

Thus, it is desirable to investigate various mapping and pipeline choices in order to maximize the efficiency of VLSI implementation using various architectural constraints. Furthermore, according to the invention, through this investigation of the synthesis, a heuristic comparison between multiplier-based architectures and SMU-based architectures is made. For example, optimization of a multiplier-based NLMS architecture results in a design that requires 2697 CLB, 91-module multiplier, 147 cycles, and 48.4MHz operating clock frequency. On the other hand, the optimized SMU based NLMS architecture according to the present invention yields an exemplary design that requires 3477 CLB, 9 ASIC multipliers, 151 cycles and a 59MHz operating frequency. Thus, while the SMU based design remains within the 160-cycle resource constraint, it additionally provides an improvement in operating frequency and a reduction in the number of required multipliers by a factor of 10. Similar results may be obtained for other SMU-based architectures discussed herein.

The present invention is directed to an adaptive PRC algorithm for MAI suppression in CDMA systems. The algorithm according to the invention focuses on the use of a set of weights, thereby increasing the confidence level and improving the accuracy of the interference cancellation compared to conventional PIC and PPIC algorithms. Furthermore, the computing architecture of the adaptive PRC is optimized to reduce redundant computations and facilitate efficient VLSI design. The efficiency of VLSI design is achieved primarily due to the use of combinational logic circuits to avoid the use of dedicated ASIC multipliers.

The foregoing description of the exemplary embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. For example, a balance between the speed and size of the architecture of the adaptive PRC algorithm may be made to prioritize one design constraint over another. In this case, the size may have a higher priority than the speed, which allows to reduce the number of CLBs required by a particular architecture, while reducing the maximum frequency of the operating clock. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.

Claims

1. An apparatus, comprising:

a matched filter stage coupled to receive the multi-user signal and configured to provide data symbols representing demodulated bit stream packets for each user;

a signal reconstructor coupled to receive the data symbols and configured to generate a modulated representation for each user's data symbols to produce a replica of the multi-user signal;

a Normalized Least Mean Square (NLMS) module coupled to receive the replica of the multi-user signal and configured to compute a weighted estimate of the replica, wherein the NLMS module is configured to:

multiplying a replica of the multi-user signal with a spreading code vector to extract a hard decision bit vector for each chip of the replica, wherein there is at least one chip per data symbol;

multiplying the hard decision bit vector by an accumulated symbol weight to produce a weighted estimate for each chip of the replica;

subtracting the multi-user signal from the weighted estimate for each chip of the replica to produce a residual signal;

adding the accumulated symbol weight to the product of the residual signal and the hard decision bit vector; and

providing final symbol weights once the weights for each chip of the data symbol are accumulated, thereby forming weighted estimates of the replica; and

A parallel residual compensation PRC module coupled to receive the weighted estimate of the replica and the multi-user signal and configured to generate a common residual error signal from the weighted estimate of the replica and the multi-user signal, wherein the common residual error signal is subtracted from the data symbols of each user to cancel interference associated with the data symbols of each user.

2. The apparatus of claim 1, wherein the NLMS module comprises a memory module coupled to store a hard decision bit vector for each chip of the replica.

3. A method, comprising:

demodulating a multi-user signal to form a bit stream associated with each user of the multi-user signal;

generating a replica of the multi-user signal from bit streams associated with each user of the multi-user signal;

demodulating the replica of the multi-user signal to obtain modulation symbols, wherein one or more chips are associated with each modulation symbol;

accumulating the weighted values for each chip of the replica of the multi-user signal;

subtracting the weighted value of each chip of the replica from the multi-user signal to generate a residual signal;

multiplying the modulation symbols with the residual signal;

Adding the multiplied modulation symbols to the accumulated weight value of each chip to form a weighted estimate of the multi-user signal;

generating a common residual signal by subtracting the weighted estimates of the multi-user signals from the multi-user signals;

compensating the signal of each user by the common residual signal, thereby obtaining an interference-removed signal for each user; and

the interference canceled signal is filtered for each user to obtain an estimate of the transmitted symbol for each user.

4. The method of claim 3, further comprising storing a modulation symbol for each chip of the replica of the multi-user signal.

5. An apparatus, comprising:

a signal reconstruction circuit coupled to receive the multi-user signal and configured to provide data symbols representing the demodulated bit stream packets for each user and configured to generate a modulated representation for the data symbols for each user to produce a replica of the multi-user signal;

a Normalized Least Mean Square (NLMS) circuit coupled to receive the replica of the multiuser signal and configured to accumulate first and second weighted signals generated due to a difference between the multiuser signal and the weighted replica of the multiuser signal, the replica of the multiuser signal including a first spread code bit stream and first and second data streams; wherein the NLMS circuit comprises:

a) A first selective decoder coupled to receive a first spread code bit stream and

first and second data streams and configured to generate first and second selection signals in response to bit values of the first spreading code bit stream and the respective first and second data streams;

b) a first multiplexer circuit coupled to the user to receive the first and second selection signals and the first and second weighting signals and configured to provide an accumulation of a sum of the first and second weighting signals, wherein signs of the first and second weighting signals are determined by the first and second selection signals; and

c) a second multiplexer circuit coupled to receive the first and second selection signals and the first and second error signals and configured to provide a sum of the first and second error signals, wherein signs of the first and second error signals are determined by the first and second selection signals; and

a parallel residual compensation PRC module coupled to receive the weighted replica of the multiuser signal and configured to generate first and second error signals from the weighted replica of the multiuser signal, wherein the first and second error signals are subtracted from the data symbols for each user to cancel interference associated with the data symbols for each user.

6. The apparatus of claim 5, wherein the first select decoder comprises a combinational logic gate to generate the first and second select signals.

7. The apparatus of claim 6, wherein the combinational logic gate comprises:

a first exclusive or gate in which an exclusive or operation is performed on the first spread code bit stream and the first data stream to generate a first selection signal; and

a second exclusive-or gate, wherein performing an exclusive-or operation on the first spread-code bit stream and the second data stream generates a second selection signal.

8. The receiver of claim 5, wherein the first multiplexer circuit comprises:

an adder coupled to receive the first and second weighted signals and configured to provide a sum of the first weighted signal and the second weighted signal as a first output and an inverse sum of the first weighted signal and the second weighted signal as a second output; and

a subtractor coupled to receive the first and second weighted signals and configured to provide as a first output a difference between the first weighted signal and the second weighted signal and as a second output an inverse difference between the first weighted signal and the second weighted signal.

9. The apparatus of claim 8, wherein the first multiplexer circuit further comprises a first multiplexer coupled to receive the first and second outputs of the adder and the subtractor and configured to select one of the first and second outputs of the adder and the first and second outputs of the subtractor in response to a first select signal.

10. The apparatus of claim 9, wherein the first multiplexer circuit further comprises a second multiplexer coupled to receive the first and second outputs of the adder and the subtractor and configured to select one of the first and second outputs of the adder and the first and second outputs of the subtractor in response to a second select signal.

11. The apparatus of claim 9, wherein the first multiplexer circuit further comprises a connection network coupled to route the first and second outputs of the adder and the subtractor to the first and second multiplexers according to a predetermined routing scheme.

12. The device of claim 5, wherein the second multiplexer circuit comprises:

an adder coupled to receive the first and second error signals and configured to provide a sum of the first error signal and the second error signal as a first output and an inverse sum of the first error signal and the second error signal as a second output; and

a subtractor coupled to receive the first and second error signals and configured to provide as a first output a difference between the first error signal and the second error signal and as a second output an inverted difference between the first error signal and the second error signal.

13. The apparatus of claim 12, wherein the second multiplexer circuit further comprises a first multiplexer coupled to receive the first and second outputs of the adder and the subtractor and configured to select one of the first and second outputs of the adder and the first and second outputs of the subtractor in response to a first select signal.

14. The apparatus of claim 13, wherein the second multiplexer circuit further comprises a second multiplexer coupled to receive the first and second outputs of the adder and the subtractor and adapted to select one of the first and second outputs of the adder and the first and second outputs of the subtractor in response to a second select signal.

15. The apparatus of claim 14, wherein the second multiplexer circuit further comprises a connection network coupled to route the first and second outputs of the adder and the subtractor to the first and second multiplexers according to a predetermined routing scheme.

16. A method, comprising:

two parallel processing paths are established to operate two groups of users, wherein each processing path is implemented with combinational logic to operate each group of users in series, the series operation including,

Estimating a symbol for each user in a set of users;

calculating a weighted symbol for each user in the set of users by generating a weight selection signal using the estimated symbol for each user in the set of users;

calculating a weighted sum chip signal for each user in the group of users by selecting a weighted coincidence from a plurality of weighted coincidence combinations using a weight selection signal;

generating a detected bit vector from the weighted sum chip signal for each user;

generating a difference between each bit of the detected bit vector and the symbol estimate for each user;

adding the difference to the weighted fit for each user; and

once all the bits of the detected bit vector have been processed, a per-symbol, disturbance-free signal is generated.

17. The method of claim 16, further comprising:

a matched filter output of the interference canceled signal is generated.