Disclosure of Invention
The invention aims to provide a chip architecture for AI calculation based on NVM (non-volatile memory), which overcomes the defects that when the conventional scheme for calculating in memory stores weights in a neural network in the form of analog signals by using the NVM, the transmission of the analog signals between layers in the process of realizing the neural network calculation and the processing of various signals are very inconvenient, the structure of an analog calculation array is rigid and is not beneficial to supporting a flexible neural network structure, and the reliability of a stored neural network model and the calculation accuracy are limited due to various noises and errors in the storage, reading and writing and calculation of the analog signals.
In order to achieve the above object, the present invention provides a chip architecture for performing AI calculation based on NVM, which includes an NVM array, an external interface module, an NPU (embedded neural network processor) and an MCU (Microcontroller Unit) connected by bus communication;
the NVM array is used for storing weight parameters of a digitalized neural network, a program operated by the MCU and a neural network model in a chip;
the NPU is used for digital domain accelerated calculation of the neural network;
the external interface module is used for receiving external AI operation instructions, inputting data and outputting AI calculation results outwards;
the MCU is used for executing the program based on the AI operation instruction so as to control the NVM array and the NPU to carry out AI calculation on the input data to obtain the result of the AI calculation.
The NPU and the NVM in the chip architecture are combined to perform AI neural network calculation, wherein the weight parameters of the neural network are digitally stored in the NVM array in the chip, the neural network calculation is also digital domain calculation, the NPU and the NVM array are controlled by the MCU based on an external AI calculation instruction to realize, the MCU controls the NVM array to load the weight parameters of the neural network stored in the MCU, a program operated by the MCU and a neural network model to perform AI calculation, compared with various existing storage schemes adopting the NVM to perform analog calculation, the digital storage and calculation mode calculation structure is flexible, and compared with the information stored in the NVM, the reliability, the precision and the reading accuracy of multi-level storage of analog signals are good, so that the scheme has high implementability, the implementation possibility, the external NVM storage speed bottleneck and the external input power consumption are reduced while the scheme breaks through, Flexibility and reliability.
Further, the chip architecture further includes a high-speed data read channel through which the NPU reads the weight parameters from the NVM array.
In addition to the on-chip bus, the scheme also sets a high-speed data reading channel between the NPU and the NVM array, and the high-speed data reading channel is used for supporting the bandwidth requirement of the NPU on the high-speed reading of the weight parameters, namely the weight data, of the neural network when the NPU performs digital domain operation.
Further, the NVM array is provided with a read channel, the read channel is N channels, N is a positive integer, the read channel reads N bits of data in one read cycle, and the NPU is configured to read the weight parameter from the NVM array through the read channel via the high-speed data read channel.
According to the scheme, a reading channel is set, wherein the number of the reading channel is N, preferably, N is 128-512, and N bits of data can be read in one reading period (usually 30-40 nanoseconds). The NPU reads the weight parameters of the neural network from the NVM array through the read channel through the high-speed data read channel, the bandwidth is far higher than the supportable read speed of the off-chip NVM, and the parameter read speed requirement required by the conventional neural network reasoning calculation can be supported.
Furthermore, the bit width of the high-speed data reading channel is m bits, and m is a positive integer; the chip architecture further comprises a data conversion unit, the data conversion unit comprises a cache module and a sequential reading module, the cache module is used for sequentially caching the weight parameters output by the reading channel according to cycles, the capacity of the cache module is N x k bits, and k represents the number of cycles; and the sequential reading module is used for converting the cache data in the cache module into m-bit wide and outputting the m-bit wide to the NPU through the high-speed data reading channel, wherein N x k is an integral multiple of m.
The arrangement further comprises a data conversion unit for converting data into a combination of data of the same bit width as the high speed data read channel, typically a combination of words of small width (e.g. 32 bits), for the case where the number of read channels does not correspond to the bit width and/or the frequency of the high speed data read channel. The NPU reads data from the data conversion unit via a high speed data read channel at its own clock frequency (up to over 1 GHz).
The data conversion unit provided by the scheme comprises a cache containing N x k bits and a sequential reader for outputting m bits at a time, wherein N x k is an integral multiple of m; the reading channel is connected with the NVM array, N bits can be output in each period, and k periods of data can be stored in the cache; the high speed data read channel width is m bits. The high-speed data read channel may include a read/write Command (CMD) and an Acknowledge (ACK) signal, which are connected to the NVM array read control circuitry. After the read operation is completed, the ACK signal informs the high-speed data reading channel and can also inform the on-chip bus at the same time, and the high-speed data reading channel asynchronously inputs the data in the cache into the NPU for multiple times through the sequential reading module.
Further, the chip architecture further includes a Static Random-Access Memory (SRAM), and the SRAM is communicatively connected to the NVM array, the external interface module, the NPU, and the MCU through the bus; the SRAM is used for caching data in the program execution process of the MCU, data in the NPU operation process and input and output data of the neural network model operation.
The chip architecture provided by the scheme comprises an embedded SRAM which is used as a cache required by operation and calculation of a chip internal system and is used for storing input and output data, intermediate data generated by calculation and the like. The method specifically comprises the steps of caching in the process of executing the program by the MCU, storing the executable program, system configuration parameters, calculation network structure configuration parameters and the like when the MCU runs; and the NPU operation caches and stores input and output data when the input and output data are operated by the neural network model.
Further, a plurality of neural network models are stored in the NVM array, and the AI operation instruction includes an algorithm selection instruction, and the algorithm selection instruction is used for selecting one of the plurality of neural network models as an algorithm for AI calculation.
The neural network models in the scheme are digitally stored in the NVM array, a plurality of neural network models can be stored according to the number of application scenes, and for the situation that various application scenes correspond to various neural network models, the MCU can flexibly select any one of the prestored neural network models to perform AI calculation according to an externally input algorithm selection instruction, so that the problem that the existing scheme integrating storage and calculation is rigid in array structure by adopting analog calculation and is not beneficial to supporting a flexible neural network structure is solved.
Further, the NVM array may employ One of a flash Memory process, an MRAM (magnetic Random Access Memory) process, an RRAM (resistive Random Access Memory) process, an MTP (Multiple Time Programming) process, an OTP (One Time Programming) process, and/or the Interface standard of the external Interface module is at least One of SPI (Serial Peripheral Interface), qpi (quad SPI) and a parallel Interface.
Further, the MCU is further configured to receive, through the external interface module, a data access instruction for operating the NVM array from the outside, and the MCU is further configured to complete logic control of basic operations of the NVM array based on the data access instruction.
Further, the NVM array employs one of a SONOS (flash memory process) flash memory process, a Floating Gate (flash memory process) flash memory process, and a Split Gate (flash memory process) flash memory process, and the interface standard of the external interface module is SPI and/or QPI;
the data access instruction is a standard flash memory operation instruction; the AI operation instruction and the data access instruction adopt the same instruction format and rule; the AI operation instruction comprises an operation code, and further comprises an address part and/or a data part, wherein the operation code of the AI operation instruction is different from the operation code of the standard flash memory operation instruction.
The chip architecture provided by the scheme is improved on the basis of the traditional flash memory chip architecture, specifically, the MCU and the NPU are embedded in the flash memory chip and are in communication connection through an on-chip Bus, and the on-chip Bus can be an Advanced High Performance Bus (AHB) or other communication Bus meeting the requirements, and is not limited herein. In the scheme, the NPU and the NVM are combined, namely calculation and storage are both in a chip, the weight parameters of the neural network are digitally stored in the NVM array, the neural network calculation is also digital domain calculation, and the NPU and the NVM array are controlled by the MCU based on an external AI operation instruction, so that the bottleneck of the off-chip NVM storage speed is broken through, the external input power consumption is reduced, and high implementability, flexibility and reliability are realized.
The scheme realizes the digital operation of the NVM array based on the MCU, specifically can include the basic operation of flash memories such as read-write erasing, and the like, and the external data access instruction and the external interface can adopt a standard flash memory chip format, so that the chip is easy to flexibly and simply apply. The MCU embedded in the scheme is used as a logic control unit of the NVM, a logic state machine in a standard flash memory is replaced, the chip structure is simplified, and the chip area is saved.
The NVM array in this scheme may be further configured to store externally input data not limited to data related to AI calculation, that is, may also be configured to store externally input other data related to AI calculation and externally input data unrelated to AI calculation, where the unrelated data specifically includes information such as system parameters, configurations and/or codes of an external device or system, in addition to the neural network model, the weight parameter and the program run by the system in the chip; the basic operations include operations such as reading, writing and erasing the neural network model, the weight parameters and a program run by the internal system, and operations such as directly reading, writing and erasing stored externally input data in the NVM array.
The instruction used for NVM direct operation and the instruction used for AI calculation processing adopt the same instruction format and rule. Taking SPI and QPI interfaces as examples, on the basis of traditional SPI and QPI flash memory operation instructions op _ code, the op _ code which is not used by flash memory operation is selected to be used for expressing an AI instruction, more information is transmitted in an address part, and AI data transmission is implemented in a data exchange period. The AI calculation can be realized only by expanding the instruction decoder to realize the multiplexing of the interface and adding a plurality of state registers and configuration registers.
Further, the chip architecture further includes a Direct Memory Access (DMA) channel, and the DMA channel is used for an external device to directly read and write the SRAM.
The positive progress effects of the invention are as follows:
the invention provides a chip architecture for AI calculation based on NVM, which combines NPU and NVM to calculate AI neural network, wherein the weight parameter of neural network is stored in NVM array in the chip in digital mode, the neural network calculation is also digital domain calculation, and is realized by MCU based on external AI calculation instruction to control NPU and NVM array, MCU controls NVM array to load weight parameter of neural network stored in it, program run by MCU and neural network model to calculate AI, compared with various existing storage schemes using NVM to perform analog calculation, the digital storage and calculation mode has flexible calculation structure, and the information stored in NVM has good reliability, high precision and high reading accuracy compared with analog signal multi-level storage, so that the invention breaks through the bottleneck of off-chip NVM storage speed and reduces external input power consumption, but also has high implementability, flexibility and reliability.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
The basic architecture and relationship of neural networks and artificial intelligence, non-volatile storage and in-memory computation are explained first.
As previously mentioned: artificial Intelligence (AI) algorithms are generated by mimicking human brain structures, and connect to dendrites of a large number of other neurons through synapses between neurons to form a neuron network with simple functions, thereby realizing all human intelligence activities. Human memory and intelligence are generally believed to be stored in the different coupling strengths at each synapse.
Neural network algorithms, emerging from the 60 s of the 20 th century, mimic the function of neurons with a function. The function accepts a plurality of inputs, each with a different weight, and the output is the multiplication of each input by a weight and the summation, as shown in the exemplary AI algorithm neuron map of FIG. 1. The process of learning the training is to adjust the weights. The function is output to many other neurons, forming a network. The algorithm has achieved abundant results and is widely applied. The utility neural networks all have a layered structure, there is no communication inside the neurons in the same layer, and the input of each neuron is connected with the outputs of a plurality of or all the neurons in the previous layer, such as the three-layer neural network diagram shown in fig. 2, which includes an input layer, a hidden layer, and an output layer, and the input layer and the hidden layer include 784 and 15 neurons (neurons), respectively. Different connection modes exist between different layers of the neural network, and the neural network is a fully-connected network.
More commonly, a graph of a convolutional neural network, as shown in fig. 3, has a two-dimensional structure (image) at both the input (input) and output (output), with connections only at nearby points.
However, a practical neural network often has a plurality of layers, and the network structure selectively includes one or more of convolution layer, image size reduction layer, and full connection layer.
Non-volatile storage:
nonvolatile memory (NVM) is a semiconductor storage medium that can hold contents after power is turned off. Common NVMs include flash memory, EEPROM (Electrically Erasable Programmable read only memory), MRAM, RRAM, FeRAM (ferroelectric random access memory), MTP, OTP, and the like. The NVM which is most widely used at present is Flash Memory (NOR Flash Memory), in which a NOR Flash Memory structure has higher reliability and faster reading speed than a NAND Flash Memory structure, and is commonly used for storing system codes, parameters, algorithms, and the like. In particular applications, the system may employ an external stand-alone NVM or an embedded NVM embedded within the system. Embedded NVM is generally compatible with CMOS (Complementary Metal-Oxide-Semiconductor) Semiconductor processes, can be integrated with logic computing chips, and has a faster read speed in the system.
Compared to other NVMs, there are cost and capacity advantages to current flashes. Many flash memory technologies on the market today already have the capability to store multiple bits. Flash memory is slow to erase (milliseconds) but read much faster (nanoseconds), and the read speed of flash memory and other NVMs can support the high bandwidth required for neural network computations.
Inner calculation of storage (In-Memory-calculating)
Since AI computation requires extremely high memory bandwidth, the architecture that separates the processor and memory/storage encounters a bottleneck of insufficient read speed. The industry has begun to study extensively the architecture of computing in one, memory computing. For example, as shown in fig. 4, a schematic diagram of AI calculation by adding circuits in a standard NVM array in the prior art, that is, an architecture for performing neural network calculation, which uses a nonvolatile memory to store weights required by the neural network calculation, and uses an analog circuit to perform vector multiplication, large-scale multiplication and addition can be performed in parallel, which can improve the operation speed and save power consumption.
In the prior art, the in-memory calculation of the NVM is realized by using a nonvolatile memory to store weights in a neural network and realizing the calculation of the neural network through a simulation method, but because the practical neural network basically has a plurality of layers and very complicated connection structures, the transmission of simulation signals between the layers and various processing are very inconvenient, the flexible neural network structure is not favorably supported, the realization and the application of the whole neural network model are quite difficult, and various noises and errors in the storage, the reading and the writing of the simulation signals and the calculation obviously influence the reliability and the calculation accuracy of the model.
FIG. 5 is a diagram of a chip architecture for performing AI calculations based on NVM in accordance with the present invention. As shown in fig. 5, a chip architecture for NVM-based AI computation of the present invention includes NVM array 7, external interface module 2, SRAM5, NPU6, and MCU1 communicatively connected via bus 4. The MCU1 reads from and writes to the SRAM5 and internal NVM array 7 via the bus 4, and communicates with the NPU 6. The NVM array 7 is used to store the weight parameters of the digitized neural network, the program run by the MCU1, and the neural network model on-chip. The NPU6 is used for digital domain acceleration calculations for neural networks. The external interface module 2 is used for receiving external AI operation instructions, inputting data and outputting AI calculation results outwards. The MCU1 is used to execute the program stored in the NVM array 7 based on external AI operation instructions to control the NVM array 7 and NPU6 to perform AI calculations on the input data to obtain the AI calculation results.
The SRAM5 is used as a cache for system operation and calculation inside the chip, and is used for storing input and output data, intermediate data generated by calculation, and the like. The method specifically comprises the steps of caching in the process of executing the program by the MCU1, storing the executable program, system configuration parameters, calculation network structure configuration parameters and the like when the MCU1 runs; the NPU6 operates and buffers, and stores input and output data when operated by a neural network model.
In the chip architecture provided by this embodiment, the NPU6 and the NVM are combined to perform AI neural network calculation, wherein the weight parameters of the neural network are digitally stored in the NVM array 7 inside the chip, the neural network calculation is also digital domain calculation, and specifically, the MCU1 controls the NPU6 and the NVM array 7 based on an external AI calculation instruction, the MCU1 controls the NVM array 7 to perform AI calculation by loading the weight parameters of the neural network stored inside, a program run by the MCU1, and a neural network model, compared with various existing storage schemes that use the NVM to perform analog calculation, the digital storage and calculation method has a flexible calculation structure, compared with the existing various storage schemes that use the NVM to perform analog calculation, the NVM storage information has good reliability, high precision, and high reading accuracy, so that the scheme provided by this embodiment breaks through the bottleneck of speed of using the NVM outside the chip and reduces the external input power consumption, but also has high implementability, flexibility and reliability.
In one embodiment, the neural network model is stored in the NVM array 7 digitally, and the neural network model stored in the NVM array 7 may be various. The external AI operation instruction comprises an algorithm selection instruction, and one of the neural network models is selected as an algorithm for AI calculation through the algorithm selection instruction.
The neural network models in this embodiment are digitally stored in the NVM array 7, and there may be a plurality of neural network models according to the number of application scenarios, and for the case that a plurality of application scenarios correspond to a plurality of neural network models, the MCU1 can flexibly select any one of the prestored neural network models according to an externally input algorithm selection instruction to perform AI calculation, thereby overcoming the problem that the analog calculation array structure in which storage and calculation are integrated is rigid and is not favorable for supporting a flexible neural network structure in the prior art.
In one embodiment, NVM array 7 employs one of, but not limited to, flash memory, MRAM, RRAM, MTP, OTP. The interface standard of the external interface module 2 is at least one of SPI, QPI, and parallel interface.
In other embodiments, NVM array 7 employs one of, but not limited to, SONOS flash memory, Floating Gate flash memory, and Split Gate flash memory technologies. The interface standard of external interface module 2 is SPI and/or QPI.
The chip architecture provided in this embodiment is improved on the basis of the traditional flash memory chip architecture, and specifically, the MCU1 and the NPU6 are embedded in the flash memory chip and communicatively connected through the on-chip bus 4, where the on-chip bus 4 may be an AHB bus or other communication buses meeting the requirements, and is not limited herein. In the scheme provided by the embodiment, the NPU6 and the NVM are combined, that is, calculation and storage are both on-chip, wherein the weight parameters of the neural network are digitally stored in the NVM array 7, the neural network calculation is also digital domain calculation, and specifically, the NPU6 and the NVM array 7 are controlled by the MCU1 based on an external AI operation instruction, so that the bottleneck of using the off-chip NVM storage speed is broken through, the external input power consumption is reduced, and high implementability, flexibility and reliability are achieved.
In one embodiment, in addition to on-chip bus 4 communication, the chip architecture includes a high-speed data read channel; in particular, to set up a high speed data read channel between NPU6 and NVM array 7, NPU6 is also used to read the weight parameters from NVM array 7 via the high speed data read channel. In this embodiment, the high-speed data reading channel is used to support the bandwidth requirement for high-speed reading of the weight parameters, i.e., the weight data, of the neural network when the NPU6 performs digital domain operation. The bit width of the high-speed data reading channel is m bits, and m is a positive integer.
In addition, the NVM array 7 is provided with N read channels, where N is a positive integer, and the read channels read N bits of data in one read cycle, and the NPU6 is used to read the weight parameters from the NVM array 7 through the read channels via the high-speed data read channels. Preferably, N is 128-512, and in one read cycle (typically 30-40 ns), NPU6 reads the weight parameters of the neural network from NVM array 7 through the read channel via the high-speed data read channel with m-bit width. Compared with the reading speed supportable by the off-chip NVM in the prior art, the bandwidth of the method is far higher, and the method can support the parameter reading speed requirement required by the common neural network reasoning calculation.
In one embodiment, the present chip architecture further comprises a data conversion unit. The data conversion unit is used for converting data into a combination of data with the same bit width as the high-speed data reading channel, usually a combination of words with small width (for example 32 bits), for the case that the number of reading channels is not consistent with the bit width and/or the frequency of the high-speed data reading channel is asynchronous. The NPU6 reads data from the data conversion unit via the high speed data read channel at its own clock frequency (which may be above 1 GHz).
Fig. 6 is a schematic diagram of a data conversion unit of the chip architecture of the present application. As shown in fig. 6, the data conversion unit includes a buffer module and a sequential reading module, the buffer module is configured to buffer N bits of data output from the NVM array 7 via the reading channel in sequence according to a cycle, the capacity of the buffer module is N × k bits, and k represents a cycle number. The sequential reading module is used for converting the cache data in the cache module into m-bit wide and outputting the m-bit wide to the NPU6 through the high-speed data reading channel, wherein N x k is an integral multiple of m.
The data conversion unit comprises a buffer module containing N x k bits and a sequential reader for outputting m bits at a time, namely the sequential reading module, wherein N x k is an integral multiple of m; the reading channel is connected with the NVM array 7, N bits can be output in each period, and k periods of data can be stored in the cache; the high speed data read channel width is m bits. The high speed data read channel may contain read and write Command (CMD) and reply (ACK) signals, which are connected to the NVM array 7 read control circuitry. After the read operation is completed, the ACK signal informs the high speed data read channel, which may also inform the on-chip bus, asynchronously multiple times through the sequential read module to input the cached data to the NPU 6.
In one embodiment, the MCU1 is further configured to receive, via the external interface module 2, data access commands for operating the NVM array 7 from outside, and the MCU1 is further configured to complete logic control of basic operations of the NVM array 7 based on the data access commands, which are standard flash memory operation commands; the AI operation instruction and the data access instruction adopt the same instruction format and rule; the AI operation instruction comprises an operation code, and further comprises an address part and/or a data part, wherein the operation code of the AI operation instruction is different from the operation code of the standard flash memory operation instruction.
The instruction for NVM direct operation and the instruction for AI calculation processing in this embodiment use the same instruction format and rules. Taking SPI and QPI interfaces as examples, on the basis of traditional SPI and QPI flash memory operation instructions op _ code, the op _ code which is not used by flash memory operation is selected to be used for expressing an AI instruction, more information is transmitted in an address part, and AI data transmission is implemented in a data exchange period. The AI calculation can be realized only by expanding the instruction decoder to realize the multiplexing of the interface and adding a plurality of state registers and configuration registers.
The MCU1 realizes the digital operation of the NVM array 7, which may specifically include the basic operations of flash memory such as read/write erase, etc., and the external data access command and the external interface may adopt the standard flash memory chip format, which is easy for the chip to be applied flexibly and simply. The MCU1 embedded in the chip is used as a logic control unit of the NVM, a logic state machine in a standard flash memory is replaced, the chip structure is simplified, and the chip area is saved.
The NVM array 7 in this embodiment may be used to store externally input data not limited to data related to AI calculation, that is, may also be used to store externally input other data related to AI calculation and externally input data unrelated to AI calculation, in addition to the neural network model, the weight parameter and the program run by the system inside the chip, where the unrelated data specifically includes information such as system parameters, configuration and/or codes of an external device or system; the basic operations include operations such as reading, writing, and erasing of the neural network model, the weight parameters, and the program run by the internal system, and operations such as reading, writing, and erasing of the stored externally input data directly in the NVM array 7.
In the specific implementation process, the MCU1 receives an external command for reading and writing operations of the NVM array 7, and completes the logic control of the NVM basic operations. These basic operations include storing and reading AI operation model algorithms and parameters, and can also be used for directly storing and reading system parameters, configurations, codes, etc. in the NVM array 7. The MCU1 also accepts external AI arithmetic commands, controls internal arithmetic logic and input/output, and is also used for internal control AI arithmetic logic.
FIG. 7 is a flowchart illustrating the operation of the instructions for invoking NVM read and write operations according to the chip architecture of the present application. As shown in fig. 7, the instruction execution flow is as follows:
step S101, the external device starts the chip where the NVM is located, and the MCU1 is powered on.
Step S102, without external instruction, the MCU1 runs the required codes and parameters and loads them into the SRAM5 from the NVM array 7, and the chip is in standby state.
Step S103, the external device sends an NVM operation instruction, and the MCU1 receives and processes the instruction, where the format and processing mode of the NVM operation instruction are the same as those of the conventional standard NVM.
In one embodiment, the chip architecture further includes a DMA channel 3, the DMA channel 3 being used by an external device to directly read from or write to the SRAM 5. The external interface module 2 realizes multiplexing of data and instructions, and realizes direct read-write operation of external equipment to the SRAM5 in the chip through the DMA channel 3, thereby improving the data transmission efficiency. The external device can also call the SRAM5 as a system memory resource through the DMA channel 3, so that the flexibility of chip application is increased.
Fig. 8 is a flowchart for executing an AI operation instruction based on the chip architecture of the present application. As shown in fig. 8, the AI operation instruction execution flow includes:
step S201, the external device starts the NVM chip, and the MCU1 is powered on.
Step S202, without external command, the MCU1 runs the required codes and parameters and loads them into the SRAM5 from the NVM array 7, and the chip is in standby state.
Step S203, the external device sends an algorithm selection command to select a certain neural network model stored in the NVM array 7 of the chip.
Step S204, the MCU1 processes the instruction, and the internal corresponding storage module is powered on and addressed.
In step S205, the external device sends an AI operation command and input data, and the data is buffered in the SRAM 5.
In step S206, the MCU1 starts the NPU6, and recognizes the input data according to the AI operation command.
Step S207, NPU6 reads the weight parameter data corresponding to the neural network model from NVM array 7 for calculation.
In step S208, the external device reads the AI calculation result from the chip through the external interface module 2.
Steps S205 to S208 may be repeated to input, calculate and output data continuously.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.