Attorney Docket No. MIT-25446WO01 Machine Intelligence on Wireless Edge Networks CROSS REFERENCE TO RELATED APPLICATION(S) [0001] This application claims the priority benefit, under 35 U.S.C.119(e), of U.S. Application No.63/669,419, filed on July 10, 2024, which is incorporated by reference herein in its entirety for all purposes. BACKGROUND [0002] Deep learning has emerged as a dominant paradigm in modern computing, offering remarkable advancements in tasks such as image classification, natural language processing, and autonomous systems. The success of deep neural networks (DNNs) in these domains is attributable to their ability to learn complex patterns and representations from large volumes of data. However, as DNNs continue to grow in depth and complexity, conventional computing architectures face significant challenges in meeting the computational demands required for training and inference. One of the primary obstacles hindering the scalability and efficiency of traditional computing architectures is the data movement bottleneck. This bottleneck arises from the separation of processing units and memory, leading to inefficiencies in data transfer between these components. As a result, the performance of DNNs, which heavily rely on iterative matrix operations, becomes increasingly limited by the speed of data movement rather than the computational capabilities of the processing units. SUMMARY [0003] Disaggregated memory access DNN architectures have emerged as a promising way to address the memory access bottleneck. Recent demonstrations of DNN architectures with disaggregated memory access show significant reductions in energy consumption and latency compared to digital electronics in fiber-optic connected edge devices. A promising disaggregated optical DNN named the Multiplicative Analog Frequency Transform (MAFT) operates by encoding neuron values in the amplitude and phase of frequency modes, enabling efficient matrix-vector products through photoelectric multiplication. MAFT uses optical fiber links, which limits its deployment in wireless networks. For more information on MAFT, please see U.S. Pre-Grant Publication No.2023/0281437 A1, which is incorporated herein by reference in its entirety for all purposes. [0004] The present technology, called Machine Intelligence on Wireless Edge Networks
Attorney Docket No. MIT-25446WO01 (MIWEN), is based on a radio-frequency (RF) implementation of MAFT that revolutionizes wireless network architectures. MIWEN capitalizes on disaggregated memory access to enable the wireless streaming of machine learning (ML) models to edge devices, such as smart phones, drones, sensors, and other computing devices at the edges of computer networks, thereby addressing memory and power bottlenecks. By integrating computation into the RF/analog chains of wireless transceivers (TRXs) in edge devices, MIWEN offers orders of magnitude reductions in power consumption and latency. By implementing MAFT in analog RF hardware, it becomes possible to perform DNN inference tasks locally on edge devices, reducing the reliance on centralized computing resources and enabling real-time decision-making in wireless networks. [0005] This DNN inference processing can be performed at an edge device as follows. The edge device’s receiver receives an analog weight signal (e.g., a frequency-multiplexed analog weight signal) representing weights of a neural network. A mixer (e.g., a passive diode ring mixer) in the edge device mixes the analog weight signal with an analog input signal to produce an analog product signal that represents the time-domain product of the analog weight signal and the analog input signal. The analog product signal is passed through a nonlinear component, such as a rectifier, which applies a nonlinear activation function (e.g., rectification or self-convolution) to produce an analog output signal representing an output of a layer of the neural network. If desired, a (bandpass) filter can remove noise from the analog weighted sum before the analog weighted sum is passed through the nonlinear component. And if desired, an analog-to-digital converter (ADC) can convert the analog output signal into a digital output signal for Fourier-transforming with a digital processor. [0006] In some cases, the layer of the neural network is a first layer, the analog weighted sum is a first analog weighted sum, and the analog output signal is a first analog output signal. These cases may further include using another mixer to mix the first analog output signal with a second analog weight signal to produce a second analog weighted sum. This second analog product signal is passed through a second nonlinear component to produce a second analog output signal representing an output of a second layer of the neural network. If desired, the edge device may include additional mixers and nonlinear components to execute additional layers of the neural network. [0007] Mixing the analog weight signal with the analog input signal can include coupling the analog weight signal to a local oscillator (LO) port of the mixer, coupling the analog input signal to a radio-frequency (RF) port of the mixer, and coupling the analog product signal from
Attorney Docket No. MIT-25446WO01 an intermediate frequency (IF) port of the mixer. [0008] The nonlinear component can comprise a rectifier and applying the nonlinear activation function to the analog product signal can comprise rectifying the analog product signal. [009] All combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are part of the inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are part of the inventive subject matter disclosed herein. The terminology used herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein. BRIEF DESCRIPTIONS OF THE DRAWINGS [0010] The skilled artisan will understand that the drawings primarily are for illustrative purposes and are not intended to limit the scope of the inventive subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the inventive subject matter disclosed herein may be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features (e.g., functionally and/or structurally similar elements). [0011] FIG. 1 illustrates Machine Intelligence on a Wireless Edge Network (MIWEN) implemented with a central server that streams the model weight and inference requests and edge devices (clients) with radio frequency (RF) analog chains in wireless transceivers to run disaggregated deep learning. [0012] FIG. 2 illustrates a MIWEN server and MIWEN client (edge device) with a digital- tailored machine learning (ML) computation chain. [0013] FIG.3 illustrates a MIWEN server and MIWEN client (edge device) with a hardware- tailored ML computation chain. [0014] FIG. 4 is a plot of the effective number of bits (ENOB) in the output of the RF inner product computation ^^ · ^^ as a function of client and server energy per Multiply-Accumulate (MAC) for a diode ring mixer in a MIWEN analog signal processing engine. [0015] FIG. 5 is a plot of the MNIST classification accuracy of RF analog deep learning as a function of the total energy per inference for MIWEN.
Attorney Docket No. MIT-25446WO01 [0016] FIG. 6 is a plot showing the time-frequency distribution of a signal generated by detecting a frequency-multiplexed optical analog weight signal with a photodetector on an edge device. DETAILED DESCRIPTION [0017] Machine Intelligence on Wireless Edge Networks (MIWEN) technology enables the deployment and operation of high-performance machine learning (ML) models on ultra-low- power wireless edge devices. MIWEN technology addresses the power and memory constraints that currently limit the implementation of sophisticated ML models on edge devices. It does this in part by using disaggregated memory access, allowing one or more base stations or central servers to stream ML models wirelessly to edge devices. Streaming the ML model wirelessly bypasses the conventional limitations of local memory and processing power on edge devices by distributing the computational workload across the network. This results in more efficient use of resources and reduces the dependency on high-power, high-memory local devices. [0018] MIWEN also uses the analog radio-frequency (RF) chain in the wireless transceiver of a modern edge device to perform computations for inference processing. By moving these computations from the digital domain into the RF/analog domain, MIWEN reduces or eliminates the need for traditional digital signal processing, which is typically power-intensive and latency-prone. This integration significantly reduces the overall power consumption and enhances the speed of processing, enabling real-time ML inference on edge devices. [0019] Performing inference processing in the RF/analog domain also reduces the number of analog-to-digital (A/D) conversions and, where applicable, electro-optical (E/O) conversions. In conventional inference processing, these conversions are typically required to process digital signals at the edge, but they introduce significant power overhead and latency. By operating on the signals in the analog domain instead of the digital domain, MIWEN streamlines the processing pipeline, thereby conserving energy and reducing latency. [0020] MIWEN also enhances security and privacy at the physical layer. By leveraging low- power RF signals and sophisticated modulation techniques, MIWEN ensures that data transmission is less susceptible to eavesdropping and unauthorized access. The physical-layer security is achieved through properties of the analog signals, which are harder to intercept and decode than digital signals. Additionally, since sensitive data does not need to be transmitted from edge devices to centralized cloud servers, privacy is significantly improved.
Attorney Docket No. MIT-25446WO01 [0021] In sum, MIWEN fundamentally changes how ML models are deployed and executed on edge devices. Conventional approaches to deploying and executing ML models rely heavily on digital processing, requiring significant power and computational resources that are often unavailable on edge devices. MIWEN addresses these challenges by: • streaming ML models and computational tasks wirelessly to edge devices using analog RF modulation; • using analog signal processing techniques to perform complex computations directly in the analog RF domain; • reducing power consumption and latency by eliminating unnecessary digital conversions and maintaining signal integrity in the analog domain; and • enhancing security and privacy by reducing or eliminating data transmission to external servers and leveraging the intrinsic security features of analog signals. Thanks to these advantages, MIWEN provides a robust and efficient framework for deploying advanced ML models on a wide range of edge devices, enabling new applications and improving the overall performance and security of wireless networks. [0022] FIG.1 illustrates energy-efficient inference at edge devices using MIWEN in a wireless network 100. The wireless network 100 includes a central server 102 with a memory that stores weights ^^^^^^^^ for one or more artificial intelligence (AI) models, including but not limited to DNNs and other ML models. The central server 102 is coupled to one more wireless transmitters, including in this example a 5G wireless transmitter (e.g., cell phone base station) 104, a Wi-Fi® transmitter 106, and a laptop 108 with a Bluetooth® transceiver. In operation, these wireless transmitters modulate the amplitudes of different spectral components of analog RF signals with the weights ^^^^^^^^ and broadcast the resulting analog weight signals to different edge devices 110, including but not limited to smart phones 110-1, smart cameras 110-2, drones 110-3, robots 110-4, and/or (autonomous) cars 110-5. These edge devices 110 are located at the edge of the wireless network 100 and may include sensors, such as cameras, microphones, and antennas, that generate data for processing with the AI models and/or actuators for acting on the outputs of the AI models. Each edge device 110 also includes an analog signal processing engine 120 (right) that processes data locally, reducing latency and bandwidth usage by reducing or eliminating data transmission to a cloud server or other central location.
Attorney Docket No. MIT-25446WO01 [0023] In greater detail, each wireless transmitter (e.g., the 5G wireless transmitter 104, Wi- Fi® transmitter 106, and/or laptop 108) encodes the weights ^^^^^^^^ in the amplitudes of the frequency components of an RF waveform to yield a modulated analog RF signal called an analog weight signal. Each wireless transmitter transmits its analog weight signal over a wireless channel via one or more antennas to one or more of the edge devices 110, each of which includes an antenna that receives the analog weight signal. If the edge devices are all running the same DNN or ML model, then the wireless transmitters can broadcast the same analog weight signal to every edge device 110. If the edge devices are running different DNNs or ML models, then each wireless transmitter can transmit a different analog weight signal to each edge device 110 (or subset of edge device 110 running the same DNN or ML model). [0024] The central server 102 and/or wireless transmitters can also transmit inference requests—inputs to the DNN or ML model—to one or more of the edge devices 110, e.g., via wireless connections or via wired connections, such as coaxial cable connections or fiber-optic connections. The central server 100 can encode each inference request in the amplitudes of spectral components of an analog RF or optical carrier signal. The resulting modulated analog signal is called an analog input signal. If the analog input signal is an optical signal, each edge device 110 includes a suitable photodetector for transducing the optical signal into an analog RF signal that can be processed as described below. Each inference request can be unique such that each smart receiver 110 performs a different inference processing task, e.g., each smart receiver 110 may compute the matrix-vector product of a unique inference processing task request/DNN input vector and a common weight matrix. The edge devices 110 generate analog output signals based on the analog weight signal and the analog input signals. If desired, these analog output signals can be digitized for further processing with digital processors. [0025] Alternatively, or in addition, one or more of the edge devices 110 can include a sensor (e.g., analog sensor 112) for generating an input to its DNN. These sensors can be integrated into or coupled to the edge devices 110. Each sensor can be different; for example, some edge devices 110 may have sensors that record physiological data (e.g., a smart watch with a heart rate sensor that records heart rate, a smart blood pressure monitor that sense blood pressure, etc.), other edge devices 110 may have sensors that sense acoustic or pressure waves (e.g., a smart phone 110-1 with a microphone), and other edge devices 110 may have optical or image sensors (e.g., a smart phone 110-1, smart camera 110-2, or drone 110-3 with an image sensor). Some edge devices (e.g., smart phone 110) may have multiple sensors, either of the same type or different types.
Attorney Docket No. MIT-25446WO01 [0026] Edge devices 110 that support the same type of sensor may run the same DNN or ML model and operate with the same weights and analog weight signals from the central server 102 and wireless transmitters. Edge devices 110 that support different types of sensors (e.g., microphones versus cameras) may run or execute DNNs or ML models tailored to the data generated their sensor(s) and use different weights and analog weight signals from the central server 102. Regardless of the type of sensor, an edge device may store and process the data generated by its sensor(s) locally instead of or in addition to sharing the data with the central server 102 or other edge devices. Storing and processing the data locally instead of sharing the data reduces bandwidth consumption and latency. It also reduces the probability that the data could be hacked or intercepted in transmission. [0027] Each edge device 110 computes the product of the analog weight signal W and its local input signal in the analog RF domain and applies a nonlinear activation function to the resulting analog product to yield a corresponding analog output signal y as described below. The smart receiver 110 can transmit the analog output signal back to the base station 100’ and/or to another device, such as another edge device, for further analog processing. The analog output signal can also be digitized, e.g., with an analog-to-digital converter in the smart receiver 110 or edge device, for storage in local and/or remote digital memory and/or for digital processing. [0028] FIG. 1 also illustrates the analog signal processing engine 120 that performs neural network computations in the analog RF domain using the weights broadcast from the central server 102. The wireless transmitters coupled to the central server 102 transmit the weights as wireless analog RF signals encoded according to the Multiplicative Analog Frequency Transform (MAFT) described below and disclosed in U.S. Pre-Grant Publication No. 2023/0281437 A1, which is incorporated herein by reference in its entirety for all purposes. [0029] The analog signal processing engine 120 includes one or more concatenated analog RF chains, each of which represents a fully connected (FC) layer of the DNN executed on the edge device 110. The number of analog RF chains can vary from analog signal processing engine 120 to analog signal processing engine 120, depending on the number of layers of the DNN being executed on the edge device 110 by the analog signal processing engine 120. [0030] Each analog RF chain includes a mixer 122 coupled to a bandpass filter 124 whose output is coupled to a rectifier 125. The edge device 110 receives these weight signals with an antenna 115 and couples them into an input port (the local oscillator (LO) or RF port) of the mixer 122. One input port (e.g., the local oscillator (LO) or RF port) of the mixer 122 in the
Attorney Docket No. MIT-25446WO01 first analog RF chain is coupled to a sensor 112 (e.g., an image sensor, microphone, etc.) or another source that provides an analog input signal 101. If the sensor 112 is a digital sensor, then its output is converted from digital form to analog form by a digital-to-analog converter (DAC) 114. The other input port (e.g., the RF or LO port) of the mixer 122 in each subsequent analog RF chain is coupled to the output port of the rectifier 125 in the preceding analog RF chain. The output port of the rectifier 125 in the last analog RF chain is coupled to an analog- to-digital converter (ADC) 230. [0031] The edge device 210 receives the analog weight signals from the wireless transmitters with an antenna 115 and disaggregates or demultiplexes them by neural network layer, then feeds the analog weight signals 103-1 through 103-3 into the corresponding input ports (e.g., LO ports) of the respective mixers 122. Each mixer 122 mixes the analog input signal at its other input port (e.g., RF port) with the analog weight signal 103 at its other input port to produce a corresponding analog product signal at its output port (intermediate frequency (IF) port). This analog product signal is the frequency-domain convolution of the analog input signal and the analog weight signal 103 and the time-domain product of the analog input signal and the analog weight signal 103. Each analog product signal is filtered with the corresponding bandpass filter 124 to remove thermal noise introduced by the mixer 122 and rectified (rectification is a suitable nonlinear activation function for a layer of a DNN) with the corresponding rectifier 125 to yield an analog output signal 105 that is fed into the LO port of the mixer 122 in the next analog RF chain. The analog output signal 105-3 from the last analog RF chain is digitized with the ADC 130 to yield a digital output signal 107 suitable for digital signal processing (e.g., a fast or discrete Fourier transformation) with a digital signal processor 132. [0032] Put differently, in MIWEN, the edge device 110 applies the local data waveform ^^(^^) (e.g., a character from the MNIST handwritten digital classification problem) to the mixer’s RF port and injects the broadband multi‑tone weight comb ^^(^^) from the wireless transmitter into the mixer’s LO port. Because the mixer’s output spectrum contains every tone‑pair sum ^^RF + ^^LO, each IF line emitted by the mixer 122 directly implements one product term ^^^^^^^^. Choosing a sub‑spacing Δ^^^^ for intra‑column weights and a much larger super‑spacing Δ^^ℓ = ^^Δ^^^^ for column leaders guarantees (i) orthogonality of products and (ii) that the desired multiply-accumulate (MAC) terms fall inside the IF pass‑band, whereas mixer intermodulation products land outside and are suppressed by a 2‑pole IF bandpass filter 124 plus a diode-based half‑wave rectifier 125 (realizing a ReLU‑like nonlinear activation). A discrete SiGe
Attorney Docket No. MIT-25446WO01 double‑balanced mixer operates at 6^fJ/MAC at 915^MHz and performs correct accumulation for m,d ≤ 8 without digital correction. [0033] The output of the nonlinear activation block is an analog output signal ^^(^^) that is fed, optionally via another bandpass filter (not shown), as the input to the next neural network layer, which is implemented as another RF mixer. If the nonlinear activation block is in the last layer of the neural network, the output is digitized using an analog-to-digital converter (ADC) 130, Fourier transformed in the digital domain (e.g., using a fast Fourier transform (FFT)) with a digital processor 132, and presented as the output of the neural network/ML model (e.g., a classification of the input/data signal). Mixers for Matrix-Vector Multiplication with Frequency-Multiplexed Weight Encoding [0034] A deep learning hardware system should implement matrix-vector multiplication efficiently. The wireless receiver in each edge device described above contains an analog mixer (e.g., mixer 122 in FIG. 1) that performs frequency up- and down-conversion in a standard communications pipeline. The analog mixer receives input waveforms ^^(^^) and ^^(^^) and outputs the instantaneous product waveform ^^(^^) = ^^(^^) ⋅ ^^(^^). Receiver mixers can be used in different ways to calculate matrix-vector products, depending on how the input matrix and vector are encoded into signals. [0035] In MAFT, the weights are encoded {^^^^^^|^^ ∈ [^^], ^^ ∈ [^^]} (where [^^] denotes the set {1 … ^^}) into the amplitudes of the frequency components of a single waveform ^^(^^) while the vector components {^^^^|^^ ∈ [^^]} are encoded into the amplitudes of the frequency components of a different waveform ^^(^^). For example, consider an image that is flattened into the input vector, which is encoded into the different frequency components in the digital domain. The input vector and weights are converted from the digital domain to the analog domain with digital-to-analog converters (DACs) at the edge device and central server, respectively, to produce the analog input signal (analog input vector) and analog weights, which can be expressed as:
where ^^^^, ^^^^ are the weight and activation waveform carrier frequencies, ^^^^^^ is a larger
Attorney Docket No. MIT-25446WO01 super-frequency spacing, ^^^^^^ is a smaller sub-frequency spacing, and c.c. is the complex conjugate of the preceding terms, added to ensure that the resulting signal is real-valued. For ease of discussion in later sections, the coefficient of ^^^^^^^^^^ is denoted below by ^^+ (^^), and its complex conjugate by ^^− (^^). The coefficient of ^^ ^^^^^^^^ is similarly labeled ^^+ (^^), with its complex conjugate being denoted by ^^− (^^). [0036] In this comb-like encoding, the elements of the first column of the weight matrix ^^ are modulated into the amplitudes of consecutive frequency comb lines spaced
intervals, followed by the elements of the second column of ^^, and so on. ^^^^^^ is the spacing between the comb lines that encode the column leaders ^^11, ^^12 etc. The column leader spacing ^^^^^^ is larger than ^^^^^^^^ so that the elements of the ^^-th column of ^^ are accommodated between the ^^-th column leader and the (^^ + 1)-th column leader comb lines. In fact, ^^^^^^ > 2^^^^^^^^ as discussed below. All elements of the weight matrix ^^, as opposed to just a single column, are encoded into the waveform in this method. The encoding of the input vector ^^ is more straightforward, with the components being encoded onto the amplitudes of comb lines spaced ^^^^^^ apart. [0037] As explained above, the weights are broadcast wirelessly to edge devices, each of which includes an RF mixer that multiplies the waveforms ^^(^^) and ^^(^^) to produce an output waveform ^^(^^):
A bandpass filter filters the output of the mixer for digitization by an ADC. [0038] From the preceding expression, the output ^^(^^) is itself a frequency comb that is centered at the sum and difference carrier frequencies ^^^^ + ^^^^ and ^^^^ − ^^^^. If the frequency spacings ^^^^^^ and ^^^^^^ satisfy ^^^^^^ > 2^^^^^^^^, the output comb line at ^^ = ^^^^ − ^^^^ + ^^^^^^^^ will have the amplitude:
which is the ^^-th component of ^^ ⋅ ^^∗. The condition ^^^^^^ > 2^^^^^^^^ is necessary and sufficient to ensure that only the terms that have ^^ = ^^ in the expression for ^^(^^) contribute to
Attorney Docket No. MIT-25446WO01 the comb amplitude at ^^ = ^^^^ − ^^^^ + ^^^^^^^^ and that no terms with ^^ ≠ ^^ contribute. [0039] Since MAFT achieves full matrix-vector multiplication through a single use of the mixer without an integrator, it is well suited for implementation on a standard wireless receiver. With MIWEN, a central server or base station streams out the weights ^^ of a model, in the form of a frequency-multiplexed signal ^^(^^), over a wireless channel. Wireless receivers (e.g., antenna 115) in the edge devices receive these weights and apply them to their local inputs ^^, which are themselves pre-encoded into the frequency-multiplexed signal ^^(^^). This enables the size-weight-and-power-constrained (SWaP-constrained) edge devices to perform energy- and space-efficient machine learning inference. Time-Multiplexed Weight Encoding [0040] Another way of using a mixer to calculate the dot product ^^ ⋅ ^^ of input vectors ^^, ^^ ∈ ℝ^^ is to encode the components of each vector into the amplitudes of two separate time- multiplexed voltage trains of ^^ pulses each, ^^(^^) and ^^(^^), and send them as inputs to the mixer. The output ^^(^^) is then a voltage train of ^^ pulses whose amplitudes are element-wise products of the vectors ^^ and ^^. An op-amp integrator can be used to integrate this pulse train to obtain the final dot product ^^ ⋅ ^^. To generalize this to the matrix-vector product of matrix ^^ ∈ ℝ^^×^^ and vector ^^ ∈ ℝ^^ , one could make ^^ identical copies, denoted by ^^(^^), of the vector ^^ (in other words, fan out ^^ identical copies of the input vector ^^) and compute for each index ^^ the dot product of ^^(^^) with ^^(^^), the ^^-th row of ^^, thence yielding all the components ^^^^ of ^^ = ^^^^. While this approach is simple and appealing, not every wireless receiver contains an integrating element. Therefore, to obtain a practical implementation, the output product pulse trains ^^(^^)(^^) should be converted to the digital domain before summing up the pulse amplitudes to obtain the dot products ^^^^ that make up the output vector ^^. The analog- to-digital conversion may reduce the energy and speed gains from performing the multiplication with an analog mixer. Performance for Machine Learning [0041] The frequency-domain inner-product protocol can be extended to multi-layer neural networks using either (1) a digital-tailored protocol that performs conventional matrix-vector products and utilizes intermediary analog-to-digital and digital-to-analog conversions or (2) a hardware-tailored protocol that more naturally utilizes the physics of the RF pipeline itself. In the digital-tailored approach (described below with respect to FIG. 2), the neural network inputs ^^ and the weights ^^ of the first layer are encoded into the frequency domain. The output
Attorney Docket No. MIT-25446WO01 ^^(^^) of the mixer then contains the output pre-activations ^^ of the first layer. The challenge then is to apply an elementwise nonlinearity on each frequency comb line of the output signal. A direct way of doing this is to pass ^^(^^) through an ADC, perform an FFT to extract the amplitudes of the frequency comb lines ^^^^, apply the nonlinearity on them, perform an inverse FFT to obtain the time domain post-activation function ^^(^^), and then reencode ^^(^^) into the analog domain with a DAC for subsequent multiplication with the weights of the next layer in the mixer. [0042] If desired, the digital FFT operations can be merged into the weight matrices and performed in the analog domain with in-phase/quadrature (IQ) modulators. This ‘merging’ trick makes it possible to encode weights and activations directly in the time domain, which in turn enables simple element-wise application of the nonlinearities on the time-domain encoded output activations. [0043] Instead of attempting to faithfully mimic matrix-vector multiplications, the hardware- tailored approach (described below with respect to FIG. 3) involves towards time-domain encoding of the weights and activations. The weights are learned directly by training on the task loss function through the mathematical behavior of the underlying analog hardware, including the nonidealities. In other words, the ‘digital twin’ of the hardware trains and obtains networks that are optimized for the hardware’s physics. Digital-Tailored Operation of Multi-Layer Neural Networks [0044] FIG.2 illustrates a digital-tailored machine learning computation chain suitable for use with a MIWEN server 202 and MIWEN client (edge device) 210 with an analog signal processing engine 220. In this digital-tailored computation chain, the server 202 flattens (204), inverse FFTs (206), and IQ modulates (206) the weights so they can be transmitted in a modified form. The analog signal processing engine 220 processes these modified weights using an IQ modulator 221 and IQ demodulator 226 instead of a digital processor that performs digital FFT and IFFT operations. Spurious frequency content produced by the multiplication process appears as spurious time bins after IQ demodulation and is temporally filtered out. The activations are passed through a time-bin nonlinearity before being looped around to get multiplied by the incoming weights of the next layer from the server. [0045] In more detail, this digital-tailored approach, the composition of the ^^-th and (^^ + 1)-th layers of the neural network can be represented by:
Attorney Docket No. MIT-25446WO01 These values are encoded in the frequency domain in MIWEN using single-sideband IQ modulators 208 and 221 in the server 202 and client 210, respectively, as discussed below. [0046] The frequency-encoded weight and activation signals in the MIWEN scheme can be expressed in the general language of single-sideband modulation as follows. Given an input sequence of ^^ complex numbers ^^1, ^^2, ... , ^^^^, single-sideband modulation onto different frequency components produces the signal a(t):
where ^^^^ is the carrier frequency. This can be implemented in practice as follows. Given a complex-valued envelope function ^̃^(^^), IQ modulators produce the signal ^^(^^):
The envelope function ^̃^(^^) is assumed to be slowly varying compared to the carrier frequency. This suggests that a way of encoding the MIWEN signals is to first Inverse Fast Fourier Transform the ais into the envelope ^̃^(^^) and feed the result into an IQ modulator to produce ^^(^^). Lumping Fourier transform matrices ^^ into Eq. (1) to avoid performing this “encoding” Inverse Fourier Transform on the client’s side as shown below. [0047] Next, introduce the ^^ × ^^ orthonormal Fourier transform matrix ^^^^ and its conjugate transpose ^^† ^^ into Eq. (1):
where the pre-encoded activations
and the pre-encoded weights ^^(^^)pre ≔ ^^(^^)^^† ^^. The server broadcasts the pre-encoded weights so that the client can process it with its local pre-encoded activations. Performing an Inverse Fourier Transform on the pre-encoded activations yields samples of the corresponding time-domain pre-encoded activations ^̃^pre(^^). But performing the Inverse Fourier transform ^^† ^^ on
simply yields ^^(^^−1)! Therefore, the samples of ^̃^pre(^^) that the IQ modulator uses to generate ^^pre(^^) are simply the direct activations ^^(^^−1) themselves, with no need for any Fourier transform. This trick eliminates Fourier transforming the activations before encoding them using IQ modulators.
Attorney Docket No. MIT-25446WO01 [0048] The decoding FFTs can be eliminated using a similar trick. The output ^^(^^)pre^^pre(^^−1) in Eq. (5) is encoded in the frequency domain of the output waveform ^^(^^). Implementing the nonlinearity in the frequency domain can be more difficult than doing it in the time domain. Thus, the weights sent by the server take on the following modified form: ^^post(^^)pre ≔ ^^^^^^ (^^)^^^ † ^. (^^) This ensures that the output recorded in the frequency domain of the output is
which implies that the time-domain signal obtained when this output is sampled by an IQ demodulator is ^^(^^)^^(^^−1). A diode (not shown) applies an ReLU nonlinearity on the time bin output of the IQ demodulator 226 before the output is upconverted using another IQ modulator (221) and passed onto a mixer (222) for subsequent mixing with the incoming weights of the next layer ^^post(^^+1)pre. [0049] The components of the vector ^^^^(^^)^^(^^−1) form a subset of the set of Fourier components of the output ^^(^^) of the mixer. The other components are spurious products that follow from the frequency-domain convolution performed by our protocol. An IQ demodulator downconverts ^^(^^) to yield the vector ^^(^^)^^(^^−1) as time-bin amplitudes. The spurious frequencies are converted to time-bin pulses as well and are temporally filtered away through a switch. The neural network nonlinearity ^^(·) is applied individually on the remaining time- bins before they are routed to the IQ modulator for mixing with the next wave of incoming weights from the server. Hardware-Tailored Operation of Multi-Layer Neural Networks with Time-Domain Encoding [0050] FIG.3 illustrates a hardware-tailored machine learning computation chain suitable for use with a MIWEN server 302 and MIWEN client (edge device) 310 with an analog signal processing engine 320. The server 302 modulates (304) the weights onto an RF carrier and broadcasts the resulting waveform ^^(^^) to the client 310. At the client 310, ^^(^^) gets mixed with local time-encoded activations waveform ^^(^^) at the client-side analog signal processing engine 320. The result is sent through a layernorm layer (323; implemented via op amps) to stabilize the activations. The result is then passed through a linear filter 324 to enable mixing between time-bins and enhance the expressivity of the physical neural network. The resultant final waveform is then mixed with the weights of the next layer. [0051] In more detail, in the hardware-tailored setting, the objective is not to compute exact
Attorney Docket No. MIT-25446WO01 inner products in any particular domain (time, frequency, or otherwise), but to enable energy- efficient inference that directly maps to the analog hardware of the analog signal processing engine 320. The analog signal processing engine 320 exploits the physical properties of diode ring mixing and filtering to approximate the desired computation. To this end, the weights and activations are directly encoded in the time domain in a fixed time window, using interpolation on the activations since they have fewer components than the weight matrix. These signals are passed through a diode ring mixer 322, which is time-instantaneous. Since the diode ring mixer 322 does not allow for interaction between different time bins, a filter 324 supplements the mixing process to mix time bins to build a neural network layer. These layers are integrated into the training process, allowing the neural network to learn weights and representations that are optimized for the analog hardware’s physics. The combination of analog mixing and filtering operations enables the system to perform expressive transformations of the input signals. Mixing shifts different input components to distinct frequencies, and filtering isolates or weights them selectively — together these operations provide sufficient computational flexibility for learning-based tasks like classification. Co-designing the physical layer operations and the network training procedure yields high inference accuracy despite the lack of precise digital-style inner product computation. This approach eliminates the need for ADCs or digital logic between layers. The computation is performed directly on modulated analog signals, significantly reducing overhead while preserving task-specific performance. [0052] For analog signal processing engine with diode ring mixers, training may suffer from vanishing activations and gradients. The modified, fully analog, layer-norm layer 323 can mitigate this problem by controlling the activations and gradients. Layer Normalization (LayerNorm) is a normalization technique that improves the training stability of deep neural networks by normalizing across the feature dimensions of each individual sample. Given a vector ^^ = [^^1, ^^2, ... , ^^^^], LayerNorm computes the mean and variance as
and applies normalization followed by an affine transformation:
where ^^^^ and ^^^^ are learnable scaling and shifting parameters for each feature, and ^ > 0 ensures numerical stability. One variant of LayerNorm retains only the shifting component by fixing
Attorney Docket No. MIT-25446WO01 ^^^^ = 1 for all ^^. The resulting transformation becomes ^^^^ = ^̂^^^ + ^^^^ , preserving unit variance while allowing the mean to adapt through the learnable shift ^^^^. To enable further mixing between time-bins, we also include the same analog bandpass filters after each layernorm module. In summary, each layer of the hardware-tailored network includes a diode ring mixer followed by layernorm and filter. Mutual Information and Effective Number of Bits [0053] A common metric to measure the discrepancy between digital and analog computation is to define the average relative absolute error ^^rel = ^|(^^ · ^^)analog − (^^ · ^^)digital|/|(^^ · ^^)digital|^^^, ^^, where the angular bracket notation ^·^ denotes averages and the subscript indicates that the averaging is performed over pairs of vectors w, x sampled from some chosen distribution. A shortcoming of this metric is that it could fail to meaningfully connect to the actual information-theoretic transformations caused during the computation. A more appealing and theoretically robust metric is to compute the amount of information about the exact computation that is preserved by the physical machine in the presence of analog noise. This idea is captured exactly by the mutual information ^^((^^ · ^^)analog; (^^ · ^^)digital) between the analog and digital inner products. [0054] Another way of evaluating the performance is through the signal-to-noise ratio of the output. Let a general noisy inner product engine be given by:
The squared average output
for a given w, x, but averaged over all the noise components,
. The “noise strength” Var^^(^^) for the same w, x is given by the variance of ^^ computed over the noise random variables. The signal and noise strengths for the overall setup are defined as the average of ^^^^
and Var^^(^^) over all choices of w, x: ^^ ^^^^^2 ^^ ^^^,^^, ^^ = ^Var^^(^^)^^^,^^. The SNR of the inner product engine is then SNR = ^^/^^. This means that the effective number of bits ENOB in the output is:
This definition of ENOB can be derived from the mutual information.
Attorney Docket No. MIT-25446WO01 [0055] FIG. 4 shows the ENOB in the output of the RF inner product computation ^^ · ^^ plotted as a function of client and server energy per Multiply-Accumulate (MAC) for a diode ring mixer. A fixed bandwidth of 25 MHz is used to encode the components of the input vectors w, x, the vector length is 256, and the ENOB is computed over 200 joint random initializations of the vectors w, x. The energy cost of the analog-to-digital conversion to read out the answer is not included. The inner product accuracy improves with increasing client and server power, reaches a peak, and reduces with further increases in input power due to the deviation of the diode ring transfer function from exact multiplication. Numerical Results [0056] FIG. 5 is a plot of the MNIST classification accuracy of RF analog deep learning as a function of the total energy per inference for MIWEN. The legend indicates the network architecture and whether the energy of the layernorm implicit amplification is included. The test accuracy starts out at random guessing, increases to near-digital accuracy, and then deteriorates again as the diode mixer deviates from exact multiplication ^^(^^)^^(^^). The dotted lines depict the test accuracy as a function of the energy of the client input waveform ^^(^^), while the solid lines depict the same test accuracies but include the amplification energy implicitly expended in the layernorm layers. The results are for five models trained at each energy level; the median and inter-quartile range are presented as the center point and error bars, respectively. [0057] FIG.5 presents the variation of the test accuracy obtained by networks of two different sizes when the total client energy is restricted to different thresholds. In the setting of disaggregated memory access, where a server streams the weights ^^(^^) to the client, the main concern is the energy expended by the client in processing the weights (called ‘input energy’ in FIG.5). The downsampling of MNIST to feed networks of different sizes proceeds via patch- averaging of the original 28-by-28 images. [0058] The performance of both networks is equivalent to random guessing below the attojoule (aJ) level and rises to near-digital accuracy when the input energy is about 100 picojoules (pJ). By virtue of the deviation of the diode ring mixer from exact multiplicative mixing at higher energies, the test accuracy experiences a degradation beyond the nanojoule (nJ) level whose severity depends on the network size. The figure presents two kinds of plots for each network size: the dashed lines depict the variation of the test accuracy with total input waveform energy while the solid lines depict the same accuracy, but as a function of the total client energy, which
Attorney Docket No. MIT-25446WO01 is the sum of the amplification energy cost inside the layernorm layers and the original input waveform energy. Optical Weight Distribution [0059] The central server or base station can also distribute a weight matrix to edge devices via optical carriers modulated with analog RF signals representing the weights in the weight matrix. The central server modulates the analog RF signals onto the optical carriers using optical modulators or other transmitter (TX) schemes such as gain-modulated lasers and fans out this optical analog weight signal over a fiber-optic network or free-space optical channels to the edge devices. Each edge device includes or is coupled to a photodetector that detects the analog weight signal. On each edge device, the RF signal is demultiplexed into N channels using techniques from wireless communication receivers, such as modulating the photodetector gain or using amplitude modulation of the RF local oscillator frequencies. At each edge device, the demultiplexed analog weight signal is then multiplied by a local data vector to compute the product of the weight matrix and the local data vector by integrating over the RF channels. [0060] FIG.6 is a plot of the photodetector output at an edge device as a function of time and frequency for an edge device ^^ performing deep learning inference on a local data vector ^^^^. The edge device’s photodetector detects an analog weight signal modulated with a weight matrix ^^ that is streamed into the edge device in real time for performing matrix vector multiplications with ultralow power multiply and accumulate (MAC) performance. The edge device channelizes the weights directly in the RF domain using methods from wireless communications, such as modulating the photodetector gain by the local data vector (analog input signal) ^^^^; amplitude modulation of the multiple narrow-band RF local oscillator (LO) frequencies used in the RF demultiplexer; or other RF demultiplexing methods. For example, consider demultiplexing the weights by mixing the analog weight signal locally with ^^ RF channels; in normal operation (e.g., cell phone usage), this mixer is sampled at a bandwidth of [channel bandwidth per symbol]. Here, however, the edge device samples the analog weight signal at a rate of [channel bandwidth per symbol]/[number of accumulation steps]. Then the edge device computes the product ^^ = ^^^^ by integrating over the RF channels by sampling at a rate of [per-channel bandwidth]/[number of accumulation steps]. [0061] By dividing the analog weight signal into ^^ RF channels, the edge device completes on the order of ^^ × ^^ MACs per sampling time on the edge device. This process works without a local modulator, which makes it much easier to implement. Moreover, the optical analog
Attorney Docket No. MIT-25446WO01 weight signal can reach the edge device through a single-mode fiber or a free-space optical channel—even a free-space optical channel that changes due to movement. This simplifies the process and makes it much easier to execute on a mobile edge device, such as a drone or smart phone. The hardware for signal processing at the edge devices and the server is also simplified, as we can use low-cost software defined radio equipment for example. In summary, the activation values (analog input signal) ^^^^ can be applied on the edge device by any of a number of methods, including: an optical modulator in front of the receiver, modulating the photodetector gain, amplitude modulation of the RF LO frequencies used in the RF demultiplexing, etc. [0062] Optically distributed analog weight signals can be distributed wirelessly in hybrid optical/wireless links to extend the coverage of area and edge device types beyond optical fiber- connected edge devices to edge devices connected to WiFi®, Bluetooth®, and/or cellular networks. Wirelessly connected edge devices encompass a very broad range of edge devices, including internet-of-things (IoT) devices such as mobile phones, WiFi® sensors, etc. Indeed, MIWEN can use existing components and equipment in present-day hardware, including WiFi®, 5G, and 6G devices. Specifically, with MIWEN it becomes possible to distribute the neural network weights—encoded once at the server side—to thousands of optical fiber- connected edge devices, each of which can be connected to thousands of wirelessly connected internet of things (IoT) devices, so that the total network can serve millions to hundreds of millions of devices which can now perform advanced machine learning computations and other signal processing, at a vanishingly small power consumption compared to present state of the art. [0063] An edge device connected to a fiber-optic link can transduce an optical analog weight signal into a wireless analog weight signal with an amplified antenna connected directly to the edge device’s photodetector. This avoids power-hungry sampling and analog-to-digital conversion, digital signal processing, and subsequent digital-to-analog, eliminating latency and reducing power consumption. Some fiber-connected edge devices, such as modems, can transduce analog weight signals from the optical domain to the wireless domain can be realized with little to no additional hardware in a range of devices. [0064] Free-space connected edge devices, such as satellites or drones, can also serve as analog weight signal relays. For instance, wirelessly connected devices (through wireless or free-space optical links) can re-transmit weight signals in mesh network configurations. Since the additional power consumption can be extremely small for this workload, existing cellular
Attorney Docket No. MIT-25446WO01 networks can do this in the background. Applications and Advantages of MIWEN [0065] A further advantage of MIWEN is its ability to provide enhanced data security. In traditional ML architectures, sensitive data often needs to be uploaded to the cloud (remote server) for processing, which can pose significant security risks. By allowing edge devices to perform real-time inference on their local data using the weight data streamed in real time from a server, MIWEN allows sensitive data to remain within the edge nodes without being transmitted to a remote server or other remote device. This is particularly useful in applications such as mobile healthcare, where medical data can be processed locally on a mobile phone without the need for it to be transmitted to the cloud. This protects the privacy of the user and reduces processing times, as the data does not need to be transmitted over the network. [0066] Another advantage of MIWEN is its ability to provide enhanced physical-layer security. Because MIWEN uses from the higher signal-to-noise ratios (SNR) of samples of convolutions over many temporal symbols rather needing a high SNR on each symbol, it can operate with signal amplitudes that are much lower than in typical communications settings. This translates to analog weight signal amplitudes that are too low for eavesdroppers or other adversaries to sample reliably. Lower analog weight signal amplitudes provides an additional layer of security, as it makes it much more difficult for adversaries to gain access to sensitive data. [0067] Overall, the MIWEN NetCast offers several advantages over traditional approaches to machine learning inference. By allowing edge devices to perform real-time inference on their local data using weight data streamed in real time from the server, it provides enhanced data security and faster processing times. Additionally, its ability to provide enhanced physical- layer security makes it particularly useful in applications where data security is of the utmost importance. [0068] MIWEN enables the deployment of advanced ML models on low-power devices and enables real-time applications such as self-driving cars. By using RF modulation to stream weight data and demultiplexing on the edge devices, it is possible to perform inference on devices such as cellular phones with ultralow power consumption and high classification accuracy. MIWEN has the potential to revolutionize distributed edge computing by enabling high-speed, low-latency, and low-power matrix-vector products, while also improving data security and privacy. Conclusion
Attorney Docket No. MIT-25446WO01 [0069] While various inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive teachings is/are used. Those skilled in the art will recognize or be able to ascertain, using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure. [0070] Also, various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments. [0071] All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms. [0072] The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.” [0073] The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements
Attorney Docket No. MIT-25446WO01 so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc. [0074] As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law. [0075] As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
Attorney Docket No. MIT-25446WO01 [0076] In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of” shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.