CN116306826A - Hardware acceleration circuit, data processing acceleration method, chip and accelerator - Google Patents
Hardware acceleration circuit, data processing acceleration method, chip and accelerator Download PDFInfo
- Publication number
- CN116306826A CN116306826A CN202111557307.8A CN202111557307A CN116306826A CN 116306826 A CN116306826 A CN 116306826A CN 202111557307 A CN202111557307 A CN 202111557307A CN 116306826 A CN116306826 A CN 116306826A
- Authority
- CN
- China
- Prior art keywords
- lookup table
- data
- circuit
- index value
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The application relates to a hardware acceleration circuit, a data processing acceleration method, a chip and an accelerator. The circuit comprises: the storage module is used for storing the first lookup table and the second lookup table; a look-up table circuit for outputting a plurality of index function values corresponding to the plurality of data elements based on the first look-up table in response to the index values of the respective plurality of data elements; and outputting an inverse corresponding to the addition result based on the second lookup table in response to the index value of the addition result; an adder for outputting an addition result, which is a result obtained by adding the plurality of exponential function values, to the lookup table circuit; and a multiplier for outputting a multiplication result of the exponent function value of the ith data element and the reciprocal of the addition result of the ith data element to obtain a flexible maximum value of the ith data element. According to the scheme provided by the application, the data processing speed in the calculation process of the Softmax function can be improved, and the Softmax function value can be obtained more quickly.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a hardware acceleration circuit, a data processing acceleration method, a chip and an accelerator.
Background
The nonlinear function introduces nonlinear characteristics into the artificial neural network, and plays a very important role in learning and understanding complex scenes of the artificial neural network. Nonlinear functions include, but are not limited to: a maximum flexibility (Softmax) function, a Sigmoid function, etc.
Taking the Softmax function as an example, the method is widely applied to deep learning. The function value of the Softmax function may be calculated by a general purpose computing unit such as a Central Processing Unit (CPU) or a Graphics Processor (GPU) in the related art. However, in the case where the processing of the neural network is performed by a hardware circuit such as a deep learning accelerator (Deep Learning Accelerator, abbreviated DLA) or a neural network processor (Neural Network Processing Unit, abbreviated NPU), if the Softmax function layer is located in the network middle layer of the neural network, job migration (job migration) overhead between the DLA/NPU and the CPU/GPU may be caused, so that the scheme of determining the nonlinear function value using the CPU/GPU is inefficient, resulting in an increase in system bandwidth and higher power consumption.
Disclosure of Invention
In order to solve or partially solve the problems existing in the related art, the application provides a hardware acceleration circuit, a data processing acceleration method and an accelerator, which can improve the data processing speed in the calculation process of a Softmax function and accelerate the acquisition of the Softmax function value.
In one aspect, the present application provides a hardware acceleration circuit, comprising:
the storage module is used for storing the first lookup table and the second lookup table;
a look-up table circuit for outputting a plurality of index function values corresponding to a plurality of data elements in a data set based on the first look-up table in response to index values of the plurality of data elements; and outputting an inverse corresponding to the addition result based on the second lookup table in response to the index value of the addition result;
an adder configured to output the addition result, which is obtained by adding the plurality of exponential function values, to the lookup table circuit;
and a multiplier for outputting a multiplication result of the exponent function value and the reciprocal of an ith data element in the plurality of data elements to obtain a flexible maximum value of the ith data element.
In an embodiment, the exponent function value is data with a bit width of N1 bits, the addition result is data with a bit width of N2 bits, and the index value of the addition result is data with a bit width of N3 bits, where N1 and N3 are smaller than N2;
the hardware acceleration circuit further includes: and the first conversion circuit is used for converting the addition operation result into an index value of the addition operation result based on the index value conversion parameter.
In one embodiment, the storage module comprises a static storage module;
the index value conversion parameters are stored in the static storage module;
the index value conversion parameter is determined according to Gaussian distribution data after the Gaussian distribution data of a plurality of addition operation results of a plurality of sample data sets are counted.
In an embodiment, the hardware acceleration circuit further comprises:
an index value conversion parameter acquisition circuit for determining and outputting the index value conversion parameter based on the addition operation result;
the first conversion circuit is used for converting the addition operation result into a corresponding index value based on the index value conversion parameter;
the lookup table circuit outputs the reciprocal corresponding to the addition operation result based on the second lookup table, specifically: and outputting the reciprocal corresponding to the addition operation result based on a selected second lookup table, wherein the second lookup table is a second lookup table corresponding to the index value conversion parameter in a plurality of alternative second lookup tables.
In an embodiment, the storage module includes a static storage module, and the plurality of alternative second lookup tables are stored in the static storage module; or,
the memory module includes a dynamic memory module, and the selected second lookup table is stored in the dynamic memory module.
In one embodiment, the storage module includes a first storage area and a second storage area, the first lookup table is stored in the first storage area, and the second lookup table is stored in the second storage area;
the look-up table circuit includes:
the first basic lookup table circuit unit comprises a first logic circuit, a first input end group, a first control end group and a first output end group, wherein the first input end group is connected with the first storage area; the first logic circuit is configured to: outputting a corresponding index function value from the first output terminal group in response to an index value of an i-th data element in the data set input from the first control terminal group;
the second basic lookup table circuit unit comprises a second logic circuit, a second input end group, a second control end group and a second output end group, wherein the second input end group is connected with the second storage area; the second logic circuit is configured to: outputting a corresponding reciprocal from the second output terminal group in response to the index value of the addition result input from the second control terminal group;
Wherein: the first basic lookup table circuit unit is N0 input N1 output, the second basic lookup table circuit unit is N3 input N4 output, and the value range of N0, N1, N3 and N4 is [8, 32].
In one embodiment, the storage module includes a first storage area for storing the first lookup table and the second lookup table in a time-sharing manner; or the storage module comprises a first storage area and a second storage area, wherein the first lookup table is stored in the first storage area, and the second lookup table is stored in the second storage area;
the look-up table circuit includes:
the first basic lookup table circuit unit comprises a first logic circuit, a first input end group, a first control end group and a first output end group, wherein the first input end group is connected with the storage module, and the first logic circuit is used for: outputting an index function value corresponding to an i-th data element from the first output terminal group based on the first lookup table in response to an index value of the i-th data element inputted from the first input terminal group in a first period, and outputting an inverse corresponding to the addition result from the first output terminal group based on the second lookup table in response to an index value of the addition result inputted from the first input terminal group in a second period after the first period;
Wherein:
the first basic lookup table circuit unit is in an N0 input N1 output state; or,
the first basic lookup table circuit unit further comprises a state control terminal group, wherein the state control terminal group is used for inputting a first state control signal in a first time period and inputting a second state control signal in a second time period so as to configure the basic lookup table circuit unit to: the first time period is N0 input N1 output state, and the second time period is N3 input N4 output state; wherein at least one pair of N0 and N3, and N1 and N4 are unequal.
In an embodiment, the hardware acceleration circuit further comprises:
and the second conversion circuit is used for converting the multiplication result from data with the bit width of N4 bits into a flexible maximum value with the bit width of N5 bits based on the index value conversion parameter, wherein N4 is larger than N5.
In an embodiment, the hardware acceleration circuit further comprises:
a subtractor for outputting a subtraction result of a plurality of initial data in an initial data set and a maximum value of the plurality of initial data to obtain the data set containing the plurality of data elements;
and the third conversion unit is used for converting the plurality of data elements into a plurality of index values corresponding to the first lookup table.
In an embodiment, the exponent function value, the addition result, the multiplication result, and the reciprocal of the addition result are fixed-point integers.
Another aspect of the present application provides an artificial intelligence chip comprising a hardware acceleration circuit as described above.
Yet another aspect of the present application provides a data processing acceleration method, the method including:
obtaining a plurality of index function values corresponding to a plurality of data elements in the data set based on the first lookup table;
obtaining the addition operation result of the index function values;
obtaining the reciprocal corresponding to the addition operation result based on a second lookup table;
obtaining a multiplication result of the inverse corresponding to the addition result of the exponent function value of the ith data element in the plurality of data elements, so as to obtain a flexible maximum value of the ith data element.
In an embodiment, the obtaining, based on the second lookup table, the reciprocal corresponding to the addition result includes:
converting the addition operation result from data with the bit width of N2 bits into an index value with the bit width of N3 bits based on index value conversion parameters, wherein N3 is smaller than N2;
and obtaining the reciprocal corresponding to the addition result based on the second lookup table and the index value of the addition result.
In an embodiment, the method further comprises:
writing the index value conversion parameters into a static storage module through a compiler;
the index value conversion parameter is determined according to Gaussian distribution data after the Gaussian distribution data of a plurality of addition operation results of a plurality of sample data sets are counted.
In an embodiment, the method further comprises:
determining the index value conversion parameter based on the addition operation result;
the obtaining the reciprocal corresponding to the addition result based on the second lookup table and the index value of the addition result includes:
determining a selected second lookup table corresponding to the index value conversion parameter from a plurality of alternative second lookup tables;
and obtaining the reciprocal corresponding to the addition result based on the selected second lookup table and the index value of the addition result.
In an embodiment, the method further comprises:
writing the plurality of alternative second lookup tables into a static storage module through a compiler; and/or the number of the groups of groups,
and loading the selected second lookup table to a dynamic storage module.
In an embodiment, before the obtaining, based on the first lookup table, a plurality of exponent function values corresponding to a plurality of data elements in the data set, the method further includes: subtracting a plurality of initial data in an initial data set from a maximum value in the plurality of initial data to obtain the data set comprising the plurality of data elements; and/or the number of the groups of groups,
The obtaining the multiplication result of the exponent function value and the reciprocal of the ith data element in the plurality of data elements further includes: and converting the multiplication result from data with the bit width of N4 bits to data with the bit width of N5 bits based on the index value conversion parameter, wherein N4 is larger than N5.
In an embodiment, the exponent function value, the addition result, the multiplication result, and the reciprocal of the addition result are fixed-point integers;
the exponent function value and the bit width of the reciprocal range are [8, 32].
In an embodiment, the method is used for realizing a flexible maximum function layer of a neural network, and the neural network is used for classifying data to be processed; wherein,,
the data to be processed includes at least one of voice data, text data, and image data.
In yet another aspect, the present application provides an artificial intelligence accelerator comprising:
a processor; and
a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.
The technical scheme that this application provided can include following beneficial effect:
According to the technical scheme, the index function value of each data element and the reciprocal corresponding to the addition operation result of the index function value of each data element are obtained in a table look-up mode, complex index operation and reciprocal operation are avoided, the data processing speed in the calculation process of the Softmax function can be improved, and the Softmax function value is obtained more quickly.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.
FIG. 1 is a schematic diagram of a neural network according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a neural network for classification according to an embodiment of the present application;
FIG. 3 is a block diagram of the hardware acceleration circuit of an embodiment of the present application;
FIG. 4 is a schematic diagram of the basic look-up table circuit unit according to an embodiment of the present application;
FIG. 5 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application;
FIG. 6 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application;
FIG. 7 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application;
FIG. 8 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application;
FIG. 9 is a flow chart of a data processing acceleration method according to an embodiment of the present application;
FIG. 10 is a flow chart of a data processing acceleration method according to another embodiment of the present application;
FIG. 11 is a flow chart of a data processing acceleration method according to another embodiment of the present application;
FIG. 12 is a block diagram of an artificial intelligence accelerator according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The calculation of the nonlinear function may involve the calculation of an exponential function and/or an inverse, such as the calculation of a Softmax function may involve the calculation of an exponent (exp) and an inverse of an exponent sum (1/sum_of_exp). Dedicated hardware pipes for Softmax functions are not feasible for large computing power implementations. For example, an increase in computing power can result in expensive hardware costs.
In view of the above problems, the embodiments of the present application provide a data processing acceleration scheme, which obtains, in a table look-up manner, an exponential function value of each data element and an inverse corresponding to an addition result of the exponential function value of each data element, thereby avoiding complex exponential operation and inverse operation, and improving a processing speed of a Softmax function.
Fig. 1 is a schematic structural diagram of a neural network according to an embodiment of the present application.
Referring to fig. 1, a topology of a neural network 100 is shown, including an input layer, a hidden layer, and an output layer. The neural network 100 is capable of receiving data elements I based on an input layer 1 、I 2 To perform a calculation or operation, and to generate output data O based on the result of the calculation 1 、O 2 。
For example, the neural network 100 may be a deep neural network (Deep Neural Networks, DNN for short) comprising one or more hidden layers. The neural network 100 in fig. 1 includes an input layer L1, two hidden layers L2, L3, and an output layer L4. Among these, DNNs include, but are not limited to, convolutional neural networks (Convolutional Neural Networks, CNN for short), recurrent neural networks (Recurrent Neural Network, RNN for short), and the like.
The four layers shown in fig. 1 are only for facilitating understanding of the technical solution of the present application, and are not to be construed as limiting the present application. For example, the neural network may include more or fewer hidden layers.
Nodes of different layers of the neural network 100 may be connected to each other for data transmission. For example, one node may receive data from other nodes to perform calculations on the received data and output the calculation results to nodes of other layers.
Each node may determine output data for the node based on the output data and weights received from the nodes in the previous layer. For example, in FIG. 1Representing the weight between the first node of the first layer and the first node of the second layer.Representing output data of a first node of the first layer.Representing the bias value of the first node in the second layer, the output data of the first node of the second layer may be represented as:The output data of the other nodes is calculated in a similar manner and will not be described in detail here.
In some embodiments, an activation function layer, such as a flexible maximum (softmax) function layer, is configured in the neural network, which may convert the resulting values for each class into probability values.
In some embodiments, the neural network is configured with a loss function layer after the flexible maximum function layer, the loss function layer being capable of calculating the loss as an objective function for training or learning.
It can be understood that the neural network can respond to the data to be processed, and the recognition result is obtained after the data to be processed is processed; the data to be processed may include, for example, at least one of voice data, text data, and image data.
One typical type of neural network is a neural network for classification. The neural network for classification may determine the class to which the data element belongs by calculating the probability that the data element corresponds to each class.
Fig. 2 is a schematic structural diagram of a neural network for classification according to an embodiment of the present application.
Referring to fig. 2, the neural network 200 for classification of the present embodiment may include a hidden layer 210, a fully connected layer (Full Connect Layer, abbreviated as FC layer) 220, a flexible maximum function layer 230, and a loss function layer 240.
As shown in fig. 2, the neural network 200 sequentially performs computation in order of the hidden layer 210 and the FC layer 220 in response to data to be classified, and the FC layer 220 outputs a computation result s corresponding to the classification probability of the data element. Wherein the FC layer 220 may include a plurality of nodes respectively corresponding to a plurality of classes, each node outputting a result value corresponding to a probability that the data element is classified into the corresponding class. For example, referring back to fig. 1, the fc layer 220 corresponds to the output layer L4 in fig. 1, and has two nodes corresponding to two classifications (first and second), and the output value of one node may be a result value indicating the probability that the data element is classified into the first class, and the output value of the other node may be a result value indicating the probability that the data element is classified into the second class. The FC layer 220 outputs the calculation result s to the flexible maximum function layer 230, and the flexible maximum function layer 230 converts the calculation result s into a probability value y, and may normalize the probability value y.
The flexible maximum function layer 230 outputs a probability value y to the loss function layer 240, and the loss function layer 240 may calculate a cross-entropy loss (cross-entcopy loss) L of the result s based on the probability value y.
During the back propagation learning process, the flexible maximum function layer 230 calculates the gradient of the cross entropy loss LThen, the FC layer 220 performs learning processing based on the gradient of the cross entropy loss L. For example, the weights of the FC layer 220 may be updated according to a gradient descent algorithm. Further, a subsequent learning process may be performed in the hidden layer 210.
The neural network 200 may be implemented in software, or in hardware circuitry, or in a combination of software and hardware. For example, in the case of a hardware circuit implementation, the hidden layer 210, the FC layer 220, the flexible maximum function layer 230, and the loss function layer 240 are all implemented by hardware circuits, and may be integrated in one artificial intelligence chip or distributed among a plurality of chips. By the configuration, when the flexible maximum function layer 230 is realized by the CPU/GPU, data migration between other layers of the neural network and processors such as the CPU/GPU is avoided, the data processing efficiency of the neural network can be improved, the data processing delay and the power consumption are reduced, and occupied bandwidth is prevented from being increased.
The following describes the technical scheme of the embodiments of the present application in detail with reference to the accompanying drawings.
Fig. 3 is a block diagram of a hardware acceleration circuit according to an embodiment of the present application. In this application, the hardware acceleration circuit may be used, for example, but not limited to, implementing the flexible maximum function layer 230 in the neural network 200, and may be, for example, but not limited to, a circuit component in a CPLD (Complex Programming logic device, complex programmable logic device) chip, an FPGA (Field Programmable Gate Array ) chip, a dedicated chip, or the like.
For ease of understanding the present application, the maximum flexibility function Softmax function is described below. Assuming an array X, the ith element X i The calculation formula of the Softmax function value of (2) can be shown as formula (1).
In the formula (1), sigma (x) i Represents the ith element x i Is a Softmax function value of e is a natural constant, x i The ith element, X, representing array X max Representing the largest element in the array X,the result of the addition of the exponential function values representing at least some of the elements in array X.
Referring to fig. 3, a hardware acceleration circuit 300 includes a memory module 10, a look-up table circuit 11, an adder 12, and a multiplier 13.
The storage module 10 is used for storing a first lookup table and a second lookup table. The Memory module 10 may be, for example, RAM (Random-Access Memory), ROM (Read-Only Memory), FLASH, or the like.
A Look Up Table (LUT) circuit 11 for outputting a plurality of index function values corresponding to the plurality of data elements based on the first Look Up Table in response to the index values of the respective plurality of data elements in the data set, and outputting an inverse corresponding to the addition result based on the second Look Up Table in response to the index values of the addition result.
An adder 12 for outputting the addition result obtained by adding the plurality of exponential function values to the lookup table circuit 11.
And a multiplier 13 for outputting a multiplication result of the exponent function value of the ith data element among the plurality of data elements and the reciprocal corresponding to the addition result to obtain a flexible maximum value of the ith data element.
In some embodiments, the look-up table circuit 11 comprises at least one basic look-up table circuit unit 20.
Referring to fig. 4, in one specific implementation, the basic look-up table circuit unit 20 includes a logic circuit 21, an input terminal group 22, a control terminal group 23, and an output terminal group 24; the input terminal group 22 is connected with the memory module 10, and inputs the data of the lookup table into the logic circuit 21; the logic circuit 21 selects a value corresponding to the index value in the lookup table from the index value (also referred to as an address) input from the control terminal group 23, and outputs the value from the output terminal group 24. The logic circuit 21 may be, for example, a logic gate circuit or a logic switch circuit. It is understood that in this application, an end group refers to a group of connection ends, including the case of one or more connection ends. The control terminal group 23 has a control terminals, and the output terminal group 24 has B output terminals, which are called the basic lookup table circuit unit 20 as a input B output.
It can be understood that the addition result of the exponential function value may be a result obtained by directly adding the exponential function value, or may be a result obtained by adding the exponential function value after performing a specific transformation. In the case of performing a specific transformation, the visual transformation type performs a corresponding inverse transformation or does not perform an inverse transformation process on the data processing result obtained later. Similarly, processing of other data should be understood broadly to include both of the above and should not be limited to processing of the data itself. Other embodiments are similar and will not be described in detail.
In this embodiment, the inverse corresponding to the exponent function value of each data element and the addition operation result of the exponent function value is obtained by the hardware lookup table circuit in a table lookup manner, so that complex exponent operation and inverse operation are avoided, the data processing speed in the calculation process of the Softmax function can be improved, and the Softmax function value can be obtained more quickly. On the other hand, excessive hardware circuit area and excessive cost for implementing the exponent operation and the reciprocal operation are avoided.
Fig. 5 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application.
Referring to fig. 5, a hardware acceleration circuit 500 includes a memory module 10, a lookup table circuit 11, an adder 12, a multiplier 13, an index value conversion parameter acquisition circuit 14, a first conversion circuit 15, and a second conversion circuit 16.
The storage module 10 is used for storing a first lookup table and a second lookup table.
The lookup table circuit 11 is configured to output a plurality of index function values corresponding to a plurality of data elements based on the first lookup table in response to index values of the respective plurality of data elements in the data set. The index value of a data element is data with a bit width of N0 bits.
In an embodiment, the index values of the data elements are sequentially input into the lookup table circuit 11, and the lookup table circuit 11 sequentially outputs the index function values corresponding to the data elements in the first lookup table. Each exponent function value in the first lookup table is data having a bit width of N1 bits.
An adder 12 for outputting an addition result, which is obtained by adding a plurality of exponential function values, to the lookup table circuit 11.
In one embodiment, adder 12 adds the exponential function values of the data elements to output an addition result having a bit width of N2 bits.
The index value conversion parameter acquisition circuit 14 and the first conversion circuit 15 are configured to acquire an index value corresponding to the addition result output from the adder 12.
The index value conversion parameter acquisition circuit 14 is configured to determine and output an index value conversion parameter based on the addition result.
The first conversion circuit 15 is configured to convert the addition result into a corresponding index value based on the index value conversion parameter. The index value output by the first conversion circuit 22 is data having a bit width of N3 bits.
The lookup table circuit 11 outputs an inverse corresponding to the index value of the addition result based on a selected second lookup table, which is a second lookup table corresponding to the index value conversion parameter among a plurality of alternative second lookup tables, in response to the index value of the addition result. Each reciprocal stored in the second lookup table is data having a bit width of N4 bits, that is, the reciprocal of the addition operation result output by the lookup table circuit 11 is data having a bit width of N4 bits.
The multiplier 13 is configured to output a multiplication result of an exponent function value of an i-th data element among the plurality of data elements and an inverse of the addition result. The multiplication result output by the multiplier 13 is data having a bit width of N5 bits.
The second conversion circuit 16 is configured to convert the multiplication result output from the multiplier 13 into data having a bit width of N6 bits based on the index value conversion parameter, thereby outputting a flexible maximum value of the ith data element.
In one embodiment, the index value of the data element is a fixed point integer having a bit width of 8 bits. Each exponential function value in the first lookup table is a fixed-point integer with the bit width of 8 bits, the addition result of the plurality of exponential function values is a fixed-point integer with the bit width of 32 bits, the index value and the reciprocal of the addition result are both fixed-point integers with the bit width of 8 bits, namely the second lookup table is 8 input and 8 output, the multiplication result is a fixed-point integer with the bit width of 16 bits, and the result converted by the multiplication result is a fixed-point integer with the bit width of 8 bits; namely, N0, N1, N3, N4, N6 are 8, N2 is 32, and N5 is 16.
It will be appreciated that in other implementations, N0-N6 may be other values; for example, N0, N1, N3, N4 may have a value in the range of [8, 32], in some embodiments may have a value in the range of [8, 12], e.g., N0, N1, N3, N4, N6 may not be equal, e.g., N0, N3 may have a value of 9, 10, 11, 12, N1, N4 of 8. Because of the wide dynamic range of Softmax function values, the related art mostly uses software modules to implement the function. The embodiment of the application provides a hardware circuit solution basically based on 8 bits, and can effectively balance important indexes such as circuit cost, power consumption, bandwidth, performance, data precision and the like.
In this embodiment, in the process of obtaining the reciprocal of the addition result of the exponential function value of each data element by the table lookup method, the index value conversion parameter is determined based on the addition result, the addition result is converted into the corresponding index value based on the index value conversion parameter, the selected second lookup table is determined from the plurality of alternative second lookup tables, and further, the reciprocal corresponding to the index value of the addition result is output based on the selected second lookup table according to the index value of the addition result. Because the index value conversion parameters are determined in real time according to the addition operation result of each time of table lookup, the reliability of the obtained table lookup result can be ensured.
Furthermore, by setting the calculation process of the softmax function as the processing of integer data and configuring the input/output data bit width of the two look-up tables in a smaller range, the memory resources occupied by the first look-up table and the second look-up table and the area of the look-up table circuit can be reduced, the occupied bandwidth can be reduced, and on the other hand, the look-up speed and the fixed point operation speed can be improved within the precision allowable range, so that the response speed of the circuit is further accelerated, and the power consumption is reduced.
In one embodiment, the index value conversion parameter includes an index value truncated parameter, and the first conversion circuit 15 truncates the index value of the addition result from a corresponding position in the addition result based on the index value truncated parameter.
In one embodiment, the index value conversion parameter acquisition circuit 14 includes a preamble 0 count (Leading Zero Count, LZC) circuit. The preamble 0 count circuit outputs the number of preamble 0 s in the addition result to the first conversion circuit 22; the number of leading 0 s is the number of 0 s occurring from the most significant bit of binary data to the first 1 s.
In another embodiment, the index value conversion parameter obtaining circuit 14 includes a preamble 1 detecting circuit for outputting the position data of the preamble 1 in the addition result to the first converting circuit 22. The preamble 1 is the first 1 scanned from the most significant bit of binary data. The number of preamble 0 s or the position data of preamble 1 s may be used as index value truncated parameters.
In one embodiment, the first conversion circuit 15 may include a first shifter. In one specific implementation, the first shifter takes the number of leading 0 s as a shift bit number, shifts the addition result to the left by the shift bit number, and outputs shifted data with a bit width of N3 bits, that is, data with N3 consecutive bits truncated from the addition result from the leading 1 to the lower bit direction, as an index value of the addition result. It will be appreciated that the specific configuration of the first conversion circuit may be made in accordance with the specific data structure of the index value.
The second conversion circuit 16 is configured to convert the multiplication result from data having a bit width of N5 bits to data having a bit width of N6 bits. In one particular implementation, the second conversion circuit 16 includes a second shifter. It will be appreciated that the second conversion circuit 16 may perform saturation (saturation), rounding, etc. as needed to make the data conversion result of the second conversion circuit 16 correspond to the data conversion of the first conversion circuit 22. Rounding includes, for example, rounding (round), rounding up, rounding down, rounding to zero, and the like.
In one embodiment, the memory module 10 comprises a static memory module. In one specific implementation, the static memory module is a ROM, and the plurality of alternative second lookup tables are written into the static memory module by the compiler; in another specific implementation, the static memory module is an SRAM, and the plurality of alternative second lookup tables are loaded into the SRAM after power-up; after the index value conversion parameter acquisition circuit 14 outputs the index value conversion parameter, the lookup table circuit 11 outputs the reciprocal corresponding to the index value of the addition result based on the selected second lookup table.
The plurality of alternative second lookup tables respectively correspond to different index value conversion parameters. Taking the index value conversion parameter as an example of the number of leading 0, for an addition result with a bit width of 32 bits, the smallest possible value of the number of leading 0 is 0 (i.e. the most significant bit of the addition result is 1), the largest possible value of the number of leading 0 is 31 (i.e. the least significant bit of the addition result is 1, and the other preceding bits are 0), i.e. the number of leading 0 may be any integer value within [0, 31], and there are 32 possible total. Different index value conversion parameters represent different value ranges of the addition result, and the value ranges of the reciprocal of the addition result are also different. Thus, the number of alternative second lookup tables is also 32, corresponding to the 32 possible index value conversion parameters. The corresponding second lookup table may be selected according to a specific value of the number of leading 0 s.
In another embodiment, the memory module 10 comprises a dynamic memory module, which may be, for example, a DRAM, for storing a selected second lookup table that is selectively written corresponding to the index value conversion parameter. The plurality of alternative second lookup tables may be stored in other memories, for example, after the compiler writes the index value conversion parameters output by the in-ROM index value conversion parameter acquisition circuit 14, the selected second lookup table is loaded into a dynamic memory module connected to the lookup table circuit 11; the lookup table circuit 11 outputs the reciprocal corresponding to the addition result based on the selected second lookup table stored in the dynamic memory module.
Fig. 6 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application.
Referring to fig. 6, a hardware acceleration circuit 600 includes a memory module 10, a lookup table circuit 11, an adder 12, a multiplier 13, a first conversion circuit 15, a subtractor 17, and a third conversion unit 18.
The subtractor 17 is configured to output subtraction results of a plurality of initial data in the initial data set and a maximum value in the plurality of initial data, respectively, to obtain a data set including a plurality of data elements.
By the subtraction operation, the range of the value range of the data element can be reduced, so that the scheme of the application can be conveniently realized by data with less bit width and corresponding hardware circuits. On the other hand, the value of each data element in the data set is negative or 0, so that the exponent function value of the data element with e as the base can be normalized to be in the range of (0, 1).
A third conversion unit 18 for converting each data element in the data set into an index value of the first lookup table.
To better understand the lookup process of this embodiment, table 1 below shows one specific example of the first lookup table, which is N0 input, N1 output, where N0 and N1 are both 8. The data elements of the first lookup table are index values having a bit width of N0 bits and the output data are index function values having a bit width of N1 bits. For ease of understanding, the data in table 1 are all represented in a 10-ary format. It will be appreciated that the first lookup table in the memory module 10 stores only the true values of the index function values, and the lookup table circuit is used to implement the mapping relationship between the index values and the true values of the index function values, and the data elements and the normalized index function values are listed in a parallel table for better understanding of the present application.
TABLE 1
In combination with the data elements output by subtractor 17 as negative or 0, as shown in Table 1, the range of the data elements is defined as [ -10,0]. To look up a table, the value range is [ -10,0]Discretized into 256 (i.e., 2) N0 ) A plurality of points, each point corresponding to an index function value as shown in the column "normalized index function value", each data element point corresponding to [0, 255 ] shown in the column "index value ] ]An integer value within the range, each normalized index function value corresponding to [0, 255 ] shown in the column "index function value ]]The data in the column "exponential function value" is stored as a true value in the memory module 10 for an integer value in the range, and a lookup table can be implemented by the index value.
The storage module 10 may include a static storage module, where the first lookup table and the second lookup table are stored in different storage units of the static storage module, and the static storage module further stores index value conversion parameters; the first lookup table, the second lookup table, and the index value conversion parameter may be written to the static storage module, for example, by a compiler.
In one embodiment, the index value conversion parameters may be determined offline and then written to the static storage module by the compiler. The first conversion circuit 15 can directly obtain the index value of the addition result according to the index value conversion parameter written into the static storage module. The index value conversion parameter can be determined according to Gaussian distribution data after the Gaussian distribution data of a plurality of addition operation results of a plurality of sample data sets are counted. In one particular implementation, multiple sample data sets may be obtained; for each sample data set, obtaining a plurality of index function values corresponding to a plurality of sample data elements through a lookup table circuit 11, and obtaining an addition result of the plurality of index function values through an adder 12, wherein the addition result is data with a bit width of N2 bits; and then, counting the Gaussian distribution of a plurality of addition results of a plurality of sample data sets, determining N3 bits with the most numerical distribution in the plurality of addition results according to the Gaussian distribution data, and taking the position data (for example, the starting bit number and/or the ending bit number of the N3 bits) corresponding to the N3 bits as an index value interception parameter. The first conversion circuit 15 may intercept N3 consecutive bits of data from the addition result output from the adder according to the index value interception parameter written in the static memory module in advance (for example, if the addition result of 32 bits is 0000000000000000_00000001_11000001 and the index value interception parameter is [23, 30], the intercepted data is the 23 rd to 30 th bits of 8 bits from the high order to the low order: 11100000), and the intercepted data is used as the index value of the addition result.
The lookup table circuit 11 is configured to output a plurality of index function values corresponding to a plurality of data elements based on the first lookup table in response to index values of the respective plurality of data elements in the data set.
An adder 12 for outputting an addition result, which is obtained by adding a plurality of exponential function values, to the lookup table circuit 11.
The first conversion circuit 15 is configured to convert the addition result into an index value with a bit width of N3 bits, that is, an index value of the second lookup table, based on the index value conversion parameter stored in the storage module 10.
The lookup table circuit 11 is further configured to output an inverse corresponding to the addition result based on the second lookup table in response to the N3-bit index value of the addition result.
The second lookup table corresponds to the index value conversion parameter, and after the index value conversion parameter is determined in the off-line manner, the second lookup table corresponding to the index value conversion parameter can be determined, so that the second lookup table can be written into the storage module 10 through the compiler, or loaded into the storage module 10 after being powered on.
A multiplier 13 for outputting a multiplication result of the exponent function value of the ith data element among the plurality of data elements and the reciprocal of the addition result to obtain a flexible maximum value of the ith data element.
In this embodiment, the index value conversion parameter is predetermined in an offline manner, the addition result is converted into a corresponding index value based on the index value conversion parameter, and then the reciprocal corresponding to the index value of the addition result is output based on the second lookup table according to the index value of the addition result. Because the index value conversion parameters are predetermined, the process of determining the index value conversion parameters and determining the selected second lookup table from a plurality of alternative second lookup tables according to the index value conversion parameters is avoided, and all the alternative second lookup tables do not need to be stored, so that the data processing amount can be reduced, the response speed of a circuit is improved, and the required hardware resources and power consumption are reduced.
Fig. 7 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application.
Referring to fig. 7, a hardware acceleration circuit 700 includes a memory module 10, a look-up table circuit 11, an adder 12, a multiplier 13, and a first conversion circuit 15.
The memory module 10 includes a first memory area 10A and a second memory area 10B, a first lookup table is stored in the first memory area 10A, and a second lookup table is stored in the second memory area 10B.
The look-up table circuit 11 comprises a first basic look-up table circuit unit 117 and a second basic look-up table circuit unit 118.
A first basic lookup table circuit unit 117 including a first input terminal group 1171, a first control terminal group 1172, a first output terminal group 1173, and a first logic circuit 1174, the first input terminal group 1171 being connected to the first memory region 10A; the first logic 1174 is configured to: in response to the index value of the data element input from the first control terminal group 1172, the corresponding index function value stored in the first memory area 10A is output from the first output terminal group 1173.
The second basic lookup table circuit unit 118 includes a second input terminal group 1181, a second control terminal group 1182, a second output terminal group 1183, and a second logic circuit 1184, where the second input terminal group 1181 is connected to the second storage area 10B; the second logic circuit 1184 is configured to: outputting, from the second output terminal group 1183, the corresponding reciprocal stored in the second storage area 10B in response to the index value of the addition result input from the second control terminal group 1182; wherein the first basic lookup table circuit unit is N0 input N1 output, the second basic lookup table circuit unit is N3 input N4 output, the value range of N0-N4 is [8, 32], and in some specific examples, the value range can be [8, 12].
In one embodiment, the first control end group 1172 sequentially inputs index values of a plurality of data elements in the data set to the first logic circuit 1174; the first logic 1174 outputs a corresponding exponent function value from the first output group 1173 in response to the exponent value.
Adder 12 adds a plurality of exponent function values corresponding to the plurality of data elements output from first output terminal group 1173, thereby obtaining an addition result of the plurality of exponent function values.
The first conversion circuit 15 is configured to convert the addition result into an index value with a bit width corresponding to the bit width.
The second control terminal group 1182 inputs the index value of the addition result to the second basic lookup table circuit unit 118; the second logic circuit 1184 outputs a corresponding reciprocal from the second output terminal group 1183 in response to the index value of the addition result input from the second control terminal group 1182.
The multiplier 13 multiplies the exponent function value corresponding to the i-th data element output by the first output terminal group 1173 and the reciprocal corresponding to the addition result output by the second output terminal group 1183 to obtain a multiplication result, so that the obtained multiplication result is the flexible maximum value corresponding to the i-th data element.
Fig. 8 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application.
Referring to fig. 8, a hardware acceleration circuit 800 includes a memory module 10, a look-up table circuit 11, an adder 12, a multiplier 13, and a first conversion circuit 15.
This embodiment is similar to the hardware acceleration circuit 400 shown in fig. 5, except that:
The look-up table circuit 11 comprises a first basic look-up table circuit unit 117.
The first basic lookup table circuit unit 117 includes a first input terminal group 1171, a first control terminal group 1172, a first output terminal group 1173, and a first logic circuit 1174, the first input terminal group 1111 is connected to the memory module 10, and the first logic circuit is configured to: the index function value corresponding to the i-th data element is output from the first output terminal group 1173 based on the first lookup table in response to the index value of the i-th data element input from the first input terminal group 1171 in the first period, and the reciprocal corresponding to the addition result is output from the first output terminal group 1173 based on the second lookup table in response to the index value of the addition result input from the first input terminal group 1171 in the second period after the first period.
In one embodiment, the memory module 10 includes a first memory area, and the first lookup table and the second lookup table are stored in the first memory area in a time-sharing manner. Because only one storage area is needed to store any one of the first lookup table and the second lookup table in a time-sharing way, the storage space occupied by the lookup table is effectively reduced, and the hardware cost can be reduced.
In another embodiment, the memory module 10 includes a first memory area and a second memory area, the first lookup table being stored in the first memory area and the second lookup table being stored in the second memory area.
In one specific implementation, N3 is equal to N0, and N4 is equal to N1, i.e., the first lookup table and the second lookup table are both N0 input N1 output, and the first basic lookup table circuit unit 117 is fixed in the N0 input N1 output state. In another embodiment, the first lookup table is an N0 input, N1 output, the second lookup table is an N3 input, N4 output, N3 is not equal to N0, and/or N4 is not equal to N1, i.e., at least one pair of N0 and N3, and N1 and N4 is not equal. The first basic lookup table circuit unit 117 further includes a state control terminal group 1175 for inputting a first state control signal for a first period of time and a second state control signal for a second period of time to configure the first basic lookup table circuit unit 117 to: the state is N0 input N1 output in the first time period, and the state is N2 input N3 output in the second time period.
It will be appreciated that in this embodiment, the first selector 30 and the second selector 32 are further included, where the first selector 30 is configured to output the exponent function value corresponding to the data element output by the first output terminal group 1173 to the adder 12, and output the reciprocal corresponding to the addition result output by the first output terminal group 1173 to the second selector 32. The second selector 32 is configured to selectively input the index value of the data element or the index finger of the addition operation result output from the first conversion circuit 15 to the first logic circuit 1174.
In this embodiment, only one basic lookup table circuit unit needs to be configured through multiplexing of the basic lookup table circuit units, so that the area and cost of the lookup table circuit can be effectively reduced.
Embodiments of data processing acceleration methods are also provided.
Fig. 9 is a flow chart of a data processing acceleration method according to an embodiment of the present application.
Referring to fig. 9, a data processing acceleration method includes:
in step S910, a plurality of index function values corresponding to a plurality of data elements in the data set are obtained based on the first lookup table.
In step S920, the addition result of the plurality of exponent function values is obtained.
In step S930, the reciprocal corresponding to the addition result is obtained based on the second lookup table.
In step S940, a multiplication result of the exponent function value of the ith data element of the plurality of data elements and the reciprocal corresponding to the addition result is obtained to obtain a flexible maximum value of the ith data element.
Fig. 10 is a flowchart of a data processing acceleration method according to another embodiment of the present application.
Referring to fig. 10, the data processing acceleration method of the present embodiment includes:
in step S1010, a plurality of initial data are subtracted from the maximum value to obtain a data set.
The maximum value of the plurality of initial data in the initial data set can be obtained through a subtracter, and the plurality of initial data and the maximum value are subtracted to obtain a data set containing a plurality of data elements.
By the above subtraction, the value of each data element in the data set is negative or 0, so that the exponent function value of the data element with e as the base can be normalized to be in the (0, 1) range.
In step S1020, each data element in the data set is converted into an index value of the first lookup table.
Each data element in the data set may be converted into an index value of the first lookup table by a third conversion unit.
The data element may be converted from a negative number or 0 to an index value of the first lookup table by conversion, the index value having a bit width of a fixed point integer of N0 bits.
In step S1030, a plurality of index function values corresponding to the plurality of data elements are obtained based on the first lookup table and the index value.
The index function values corresponding to the data elements in the data set can be obtained by the lookup table module based on the first lookup table and the index value.
It may be understood that the table lookup process of the plurality of data elements may be a parallel process, that is, the table lookup module is a multiple-input multiple-output module, the index values of the plurality of data elements are input to the table lookup module in parallel, the table lookup module outputs the corresponding plurality of index function output values in parallel, or the table lookup process of the plurality of data elements may be a serial process, that is, the index values of the plurality of data elements are sequentially input to the table lookup module, and the table lookup module sequentially outputs the index function values of each data element.
The index function value corresponding to the data element may be obtained by looking up a table, and the index function value may be a fixed-point integer having a bit width of N1 bits.
In step S1040, the addition result of the plurality of exponent function values is obtained.
The addition result of the plurality of exponent function values may be obtained by an adder. The addition result output by the adder is a fixed-point integer with a bit width of N2 bits.
In step S1050, the addition result is converted into an index value based on the index value conversion parameter.
The addition result is converted from data with a bit width of N2 bits to an index value with a bit width of N3 bits, that is, an index value of the second lookup table, based on a preset index value conversion parameter, by the first conversion circuit, wherein N3 is smaller than N2.
In this embodiment, the index value conversion parameter may be determined in an offline manner.
In one particular implementation, multiple sample data sets may be obtained; for each sample data set, a plurality of index function values corresponding to a plurality of sample data elements are obtained through a lookup table circuit, an addition result of the plurality of index function values is obtained through an adder, and the addition result is data with a bit width of N2 bits; and then, counting the Gaussian distribution of a plurality of addition results of a plurality of sample data sets, determining N3 bits with the most numerical distribution in the plurality of addition results according to the Gaussian distribution data, and taking the position data (for example, the starting bit number and/or the ending bit number of the N3 bits) corresponding to the N3 bits as an index value interception parameter. The first conversion circuit 15 may intercept N3 consecutive bits of data from the addition result output from the adder according to the index value interception parameter written in the static memory module in advance (for example, if the addition result of 32 bits is 0000000000000000_00000001_11000001 and the index value interception parameter is [23, 30], the intercepted data is the 23 rd to 30 th bits of 8 bits from the high order to the low order: 11100000), and the intercepted data is used as the index value of the addition result.
The second lookup table corresponds to the index value conversion parameter, and the second lookup table corresponding to the index value conversion parameter can be determined after the index value conversion parameter is determined in the off-line mode.
The second lookup table may be written to ROM by a compiler or the second lookup table may be loaded into RAM after power up of the circuit.
In step S1060, the reciprocal corresponding to the addition result is obtained based on the second lookup table and the index value of the addition result.
The inverse corresponding to the addition result may be obtained by the lookup table module based on the second lookup table and the index value of the addition result.
The inverse corresponding to the result of the addition operation of the plurality of exponential function values can be obtained by looking up a table, and the inverse may be a fixed-point integer having a bit width of N4 bits.
In step S1070, a multiplication result of the exponent function value of the i-th data element among the plurality of data elements and the reciprocal corresponding to the addition result is obtained.
The multiplication result of the exponent function value of the ith data element of the plurality of data elements and the reciprocal corresponding to the addition result may be obtained by the multiplier. The multiplication result may be a fixed point integer with a bit width of N5 bits.
In some embodiments, after obtaining the multiplication result, the second conversion circuit converts the multiplication result from data with a bit width of N5 bits to data with a bit width of N6 bits based on the index value conversion parameter, where N5 is greater than N6.
Fig. 11 is a flowchart of a data processing acceleration method according to another embodiment of the present application.
Referring to fig. 11, the data processing acceleration method of the present embodiment includes:
in step S1110, a plurality of initial data are subtracted from the maximum value to obtain a data set.
The maximum value of the plurality of initial data in the initial data set can be obtained through a subtracter, and the plurality of initial data and the maximum value are subtracted to obtain a data set containing a plurality of data elements.
In step S1120, each data element in the data set is converted into an index value of the first lookup table.
Each data element in the data set may be converted into an index value of the first lookup table by a third conversion unit.
In step S1130, a plurality of index function values corresponding to the plurality of data elements are obtained based on the first lookup table and the index value.
The plurality of index function values corresponding to the plurality of data elements in the data set may be obtained by a lookup table module based on the first lookup table and the index value.
In this embodiment, the exponent function value of the data element with e as the base is obtained by table look-up. It may be understood that the table lookup process of the data elements may be a parallel process, that is, the table lookup module is a multiple-input multiple-output module, the index values of the data elements are input to the table lookup module in parallel, the table lookup module outputs a corresponding plurality of index function values in parallel, or the table lookup process of the data elements may be a serial process, that is, the index values of the data elements are sequentially input to the table lookup module, and the table lookup module sequentially outputs the index function values of each data element.
In step S1140, the addition result of the plurality of index function values is obtained, and the index value conversion parameter is determined based on the addition result of the plurality of index function values.
The addition result of the plurality of index function values may be obtained by an adder, and the index value conversion parameter may be determined based on the addition result of the plurality of index function values.
In step S1150, a selected second lookup table corresponding to the index value conversion parameter is determined from a plurality of alternative second lookup tables.
The selected second lookup table corresponding to the index value conversion parameter may be determined from a plurality of alternative second lookup tables by a second lookup table determination module.
In one embodiment, the compiler writes a plurality of alternative second lookup tables into the static storage module, and after determining the selected second lookup table, the selected second lookup table may be loaded into the dynamic storage module.
In step S1160, the addition result is converted into a corresponding index value based on the index value conversion parameter.
The addition result may be converted into a corresponding index value based on the index value conversion parameter by the first conversion circuit.
In step S1170, the reciprocal corresponding to the addition result is obtained based on the selected second lookup table and the index value of the addition result.
The inverse corresponding to the addition result may be obtained by the lookup table module based on the selected second lookup table and the index value of the addition result.
In step S1180, a multiplication result of the exponent function value of the i-th data element of the plurality of data elements and the reciprocal corresponding to the addition result is obtained.
The multiplication result of the exponent function value of the ith data element of the plurality of data elements and the reciprocal corresponding to the addition result may be obtained by the multiplier.
The relevant features in the data processing acceleration method of the embodiment of the present application may refer to the relevant content in the foregoing hardware acceleration circuit embodiment, and will not be described in detail.
The data processing acceleration method according to the embodiment of the application can be applied to an artificial intelligent accelerator. FIG. 12 is a schematic diagram of an artificial intelligence accelerator according to an embodiment of the present application. Referring to fig. 12, an artificial intelligence accelerator 1200 includes a memory 1210 and a processor 1220.
The artificial intelligence accelerator 1220 may be a general purpose processor such as a CPU (Central Processing Unit ) or an artificial Intelligence Processor (IPU) for performing artificial intelligence operations. The artificial intelligence operations may include machine learning operations, brain-like operations, and the like. The machine learning operation comprises neural network operation, k-means operation, support vector machine operation and the like. The artificial intelligence processor may include, for example, one or a combination of a GPU (Graphics Processing Unit ), DLA (Deep Learning Accelerator, deep learning accelerator), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field-programmable gate array (Field-Programmable Gate Array, FPGA), application specific integrated circuit (Application Specific Integrated Circuit, ASIC). The specific type of processor is not limited by the present application.
Memory 1210 may include various types of storage units such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1220 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1210 may include any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1210 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROM, dual layer DVD-ROM), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.
Memory 1210 has stored thereon executable code that, when processed by processor 1220, causes processor 1220 to perform some or all of the methods described above.
In one possible implementation, an artificial intelligence accelerator may include multiple processors, each of which may independently run various tasks assigned thereto. The present application is not limited to the processor and the tasks that the processor operates.
It should be understood that, unless otherwise specified, each functional unit/module in the embodiments of the present application may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules described above may be implemented either in hardware or in software program modules.
The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The artificial intelligence processor may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP and ASIC, etc., unless otherwise specified. Unless otherwise indicated, the Memory modules may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cube HMC (Hybrid Memory Cube), and the like.
The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In one possible implementation, an artificial intelligence chip is also disclosed that includes the hardware acceleration circuit described above.
In one possible implementation, a board is also disclosed, which includes a memory device, an interface device, and a control device, and the artificial intelligence chip described above; wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; the control device is used for monitoring the state of the artificial intelligent chip.
In one possible implementation, an electronic device is disclosed that includes the artificial intelligence chip described above. The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) which, when executed by a processor of an electronic device (or a server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.
The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (20)
1. A hardware acceleration circuit, comprising:
the storage module is used for storing the first lookup table and the second lookup table;
a look-up table circuit for outputting a plurality of index function values corresponding to a plurality of data elements in a data set based on the first look-up table in response to index values of the plurality of data elements; and outputting an inverse corresponding to the addition result based on the second lookup table in response to the index value of the addition result;
an adder configured to output the addition result, which is obtained by adding the plurality of exponential function values, to the lookup table circuit;
And a multiplier for outputting a multiplication result of the exponent function value and the reciprocal of an ith data element in the plurality of data elements to obtain a flexible maximum value of the ith data element.
2. The hardware acceleration circuit of claim 1, wherein the processor is configured to,
the index function value is data with a bit width of N1 bits, the addition operation result is data with a bit width of N2 bits, the index value of the addition operation result is data with a bit width of N3 bits, and N1 and N3 are smaller than N2;
the hardware acceleration circuit further includes: and the first conversion circuit is used for converting the addition operation result into an index value of the addition operation result based on the index value conversion parameter.
3. The hardware acceleration circuit of claim 2, wherein:
the storage module comprises a static storage module;
the index value conversion parameters are stored in the static storage module;
the index value conversion parameter is determined according to Gaussian distribution data after the Gaussian distribution data of a plurality of addition operation results of a plurality of sample data sets are counted.
4. The hardware acceleration circuit of claim 2, further comprising:
An index value conversion parameter acquisition circuit for determining and outputting the index value conversion parameter based on the addition operation result;
the first conversion circuit is used for converting the addition operation result into a corresponding index value based on the index value conversion parameter;
the lookup table circuit outputs the reciprocal corresponding to the addition operation result based on the second lookup table, specifically: and outputting the reciprocal corresponding to the addition operation result based on a selected second lookup table, wherein the second lookup table is a second lookup table corresponding to the index value conversion parameter in a plurality of alternative second lookup tables.
5. The hardware acceleration circuit of claim 4, wherein:
the storage module comprises a static storage module, and the plurality of alternative second lookup tables are stored in the static storage module; or,
the memory module includes a dynamic memory module, and the selected second lookup table is stored in the dynamic memory module.
6. The hardware acceleration circuit of claim 1, wherein the processor is configured to,
the storage module comprises a first storage area and a second storage area, the first lookup table is stored in the first storage area, and the second lookup table is stored in the second storage area;
The look-up table circuit includes:
the first basic lookup table circuit unit comprises a first logic circuit, a first input end group, a first control end group and a first output end group, wherein the first input end group is connected with the first storage area; the first logic circuit is configured to: outputting a corresponding index function value from the first output terminal group in response to an index value of an i-th data element in the data set input from the first control terminal group;
the second basic lookup table circuit unit comprises a second logic circuit, a second input end group, a second control end group and a second output end group, wherein the second input end group is connected with the second storage area; the second logic circuit is configured to: outputting a corresponding reciprocal from the second output terminal group in response to the index value of the addition result input from the second control terminal group;
wherein: the first basic lookup table circuit unit is N0 input N1 output, the second basic lookup table circuit unit is N3 input N4 output, and the value range of N0, N1, N3 and N4 is [8, 32].
7. The hardware acceleration circuit of claim 1, wherein the processor is configured to,
the storage module comprises a first storage area and a second storage area, wherein the first storage area is used for storing the first lookup table and the second lookup table in a time-sharing mode; or the storage module comprises a first storage area and a second storage area, wherein the first lookup table is stored in the first storage area, and the second lookup table is stored in the second storage area;
The look-up table circuit includes:
the first basic lookup table circuit unit comprises a first logic circuit, a first input end group, a first control end group and a first output end group, wherein the first input end group is connected with the storage module, and the first logic circuit is used for: outputting an index function value corresponding to an i-th data element from the first output terminal group based on the first lookup table in response to an index value of the i-th data element inputted from the first input terminal group in a first period, and outputting an inverse corresponding to the addition result from the first output terminal group based on the second lookup table in response to an index value of the addition result inputted from the first input terminal group in a second period after the first period;
wherein:
the first basic lookup table circuit unit is in an N0 input N1 output state; or,
the first basic lookup table circuit unit further comprises a state control terminal group, wherein the state control terminal group is used for inputting a first state control signal in a first time period and inputting a second state control signal in a second time period so as to configure the basic lookup table circuit unit to: the first time period is N0 input N1 output state, and the second time period is N3 input N4 output state; wherein at least one pair of N0 and N3, and N1 and N4 are unequal.
8. The hardware acceleration circuit of claim 2, further comprising:
and the second conversion circuit is used for converting the multiplication result from data with the bit width of N4 bits into a flexible maximum value with the bit width of N5 bits based on the index value conversion parameter, wherein N4 is larger than N5.
9. The hardware acceleration circuit of claim 2, further comprising:
a subtractor for outputting a subtraction result of a plurality of initial data in an initial data set and a maximum value of the plurality of initial data to obtain the data set containing the plurality of data elements;
and the third conversion unit is used for converting the plurality of data elements into a plurality of index values corresponding to the first lookup table.
10. The hardware acceleration circuit of any one of claims 1 to 9, wherein:
the exponent function value, the addition result, the multiplication result and the reciprocal of the addition result are fixed-point integers.
11. An artificial intelligence chip, characterized in that the chip comprises a hardware acceleration circuit according to any one of claims 1 to 10.
12. A data processing acceleration method, characterized by comprising:
Obtaining a plurality of index function values corresponding to a plurality of data elements in the data set based on the first lookup table;
obtaining the addition operation result of the index function values;
obtaining the reciprocal corresponding to the addition operation result based on a second lookup table;
obtaining a multiplication result of the inverse corresponding to the addition result of the exponent function value of the ith data element in the plurality of data elements, so as to obtain a flexible maximum value of the ith data element.
13. The method of claim 12, wherein obtaining the corresponding inverse of the addition result based on the second lookup table comprises:
converting the addition operation result from data with the bit width of N2 bits into an index value with the bit width of N3 bits based on index value conversion parameters, wherein N3 is smaller than N2;
and obtaining the reciprocal corresponding to the addition result based on the second lookup table and the index value of the addition result.
14. The method as recited in claim 13, further comprising:
writing the index value conversion parameters into a static storage module through a compiler;
the index value conversion parameter is determined according to Gaussian distribution data after the Gaussian distribution data of a plurality of addition operation results of a plurality of sample data sets are counted.
15. The method as recited in claim 13, further comprising:
determining the index value conversion parameter based on the addition operation result;
the obtaining the reciprocal corresponding to the addition result based on the second lookup table and the index value of the addition result includes:
determining a selected second lookup table corresponding to the index value conversion parameter from a plurality of alternative second lookup tables;
and obtaining the reciprocal corresponding to the addition result based on the selected second lookup table and the index value of the addition result.
16. The method as recited in claim 15, further comprising:
writing the plurality of alternative second lookup tables into a static storage module through a compiler; and/or the number of the groups of groups,
and loading the selected second lookup table to a dynamic storage module.
17. The method of claim 12, wherein the step of determining the position of the probe is performed,
before the obtaining the plurality of exponent function values corresponding to the plurality of data elements in the data set based on the first lookup table, the method further includes: subtracting a plurality of initial data in an initial data set from a maximum value in the plurality of initial data to obtain the data set comprising the plurality of data elements; and/or the number of the groups of groups,
The obtaining the multiplication result of the exponent function value and the reciprocal of the ith data element in the plurality of data elements further includes: and converting the multiplication result from data with the bit width of N4 bits to data with the bit width of N5 bits based on the index value conversion parameter, wherein N4 is larger than N5.
18. The method as recited in claim 12, wherein:
the exponent function value, the addition operation result, the multiplication operation result and the reciprocal of the addition operation result are fixed-point integers;
the exponent function value and the bit width of the reciprocal range are [8, 32].
19. A method according to any one of claims 12 to 18, wherein the method is for implementing a flexible maximum function layer of a neural network for classifying data to be processed; wherein,,
the data to be processed includes at least one of voice data, text data, and image data.
20. An artificial intelligence accelerator, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 12-19.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111557307.8A CN116306826A (en) | 2021-12-18 | 2021-12-18 | Hardware acceleration circuit, data processing acceleration method, chip and accelerator |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111557307.8A CN116306826A (en) | 2021-12-18 | 2021-12-18 | Hardware acceleration circuit, data processing acceleration method, chip and accelerator |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116306826A true CN116306826A (en) | 2023-06-23 |
Family
ID=86798245
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111557307.8A Pending CN116306826A (en) | 2021-12-18 | 2021-12-18 | Hardware acceleration circuit, data processing acceleration method, chip and accelerator |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116306826A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119396845A (en) * | 2025-01-06 | 2025-02-07 | 深圳鲲云信息科技有限公司 | Hardware lookup table for artificial intelligence chip, method for loading data and computing device |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101589610A (en) * | 2007-01-25 | 2009-11-25 | 高通Mems科技公司 | Arbitrary power function using logarithm lookup table |
| CN109308520A (en) * | 2018-09-26 | 2019-02-05 | 阿里巴巴集团控股有限公司 | Realize the FPGA circuitry and method that softmax function calculates |
| CN109669962A (en) * | 2017-10-15 | 2019-04-23 | Gsi 科技公司 | The index of precision and accurate SOFTMAX are calculated |
| US20190325309A1 (en) * | 2017-08-19 | 2019-10-24 | Wave Computing, Inc. | Neural network output layer for machine learning |
| US10949498B1 (en) * | 2019-03-13 | 2021-03-16 | Xlnx, Inc. | Softmax circuit |
| CN112668691A (en) * | 2019-10-16 | 2021-04-16 | 三星电子株式会社 | Method and device with data processing |
| CN113407747A (en) * | 2020-03-17 | 2021-09-17 | 三星电子株式会社 | Hardware accelerator execution method, hardware accelerator and neural network device |
-
2021
- 2021-12-18 CN CN202111557307.8A patent/CN116306826A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101589610A (en) * | 2007-01-25 | 2009-11-25 | 高通Mems科技公司 | Arbitrary power function using logarithm lookup table |
| US20190325309A1 (en) * | 2017-08-19 | 2019-10-24 | Wave Computing, Inc. | Neural network output layer for machine learning |
| CN109669962A (en) * | 2017-10-15 | 2019-04-23 | Gsi 科技公司 | The index of precision and accurate SOFTMAX are calculated |
| CN109308520A (en) * | 2018-09-26 | 2019-02-05 | 阿里巴巴集团控股有限公司 | Realize the FPGA circuitry and method that softmax function calculates |
| US10949498B1 (en) * | 2019-03-13 | 2021-03-16 | Xlnx, Inc. | Softmax circuit |
| CN112668691A (en) * | 2019-10-16 | 2021-04-16 | 三星电子株式会社 | Method and device with data processing |
| CN113407747A (en) * | 2020-03-17 | 2021-09-17 | 三星电子株式会社 | Hardware accelerator execution method, hardware accelerator and neural network device |
Non-Patent Citations (2)
| Title |
|---|
| XIAO DONG ET AL: "Hardware Implementation of Softmax Function Based on Piecewise LUT", 《2019 IEEE INTERNATIONAL WORKSHOP ON FUTURE COMPUTING 》, 27 April 2020 (2020-04-27), pages 1 - 3 * |
| 孙齐伟: "PYNQ中实现SoftMax函数加速器", 《集成电路》, no. 6, 30 June 2019 (2019-06-30), pages 69 - 73 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119396845A (en) * | 2025-01-06 | 2025-02-07 | 深圳鲲云信息科技有限公司 | Hardware lookup table for artificial intelligence chip, method for loading data and computing device |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12518164B2 (en) | Processing method and accelerating device | |
| US11977968B2 (en) | Sparse processing in neural network processors | |
| US11656910B2 (en) | Data sharing system and data sharing method therefor | |
| CN111931925B (en) | Acceleration system of binary neural network based on FPGA | |
| CN110647981B (en) | Data processing method, data processing device, computer equipment and storage medium | |
| CN117494816B (en) | Model reasoning method, device, equipment and medium based on computing unit deployment | |
| CN110647722B (en) | Data processing method and device and related products | |
| CN113298843B (en) | Data quantization processing method, device, electronic equipment and storage medium | |
| TW202248874A (en) | Multiply-accumulate device and multiply-accumulate method | |
| CN109240644A (en) | A kind of local search approach and circuit for Yi Xin chip | |
| CN113408716B (en) | Computing device, method, board and computer readable storage medium | |
| CN116306826A (en) | Hardware acceleration circuit, data processing acceleration method, chip and accelerator | |
| CN110490317B (en) | Neural network operation device and operation method | |
| CN112446472B (en) | Methods, apparatus, and related products for processing data | |
| CN109389213B (en) | Storage device and method, data processing device and method, electronic device | |
| CN111353124A (en) | Computing method, apparatus, computer equipment and storage medium | |
| CN116306827A (en) | Hardware acceleration circuit, chip, data processing acceleration method, accelerator and equipment | |
| CN116306833A (en) | Hardware acceleration circuit, data processing acceleration method, chip and accelerator | |
| US20250156180A1 (en) | Hardware acceleration circuit, data processing acceleration method, chip, and accelerator | |
| CN116306825A (en) | Hardware acceleration circuit, data processing acceleration method, chip and accelerator | |
| CN112801276B (en) | Data processing method, processor and electronic device | |
| CN117391159A (en) | Hardware acceleration circuit, data processing acceleration method, chip and accelerator | |
| CN117391157A (en) | Hardware acceleration circuit, data processing acceleration method, chip and accelerator | |
| CN113626082B (en) | Data processing method, device and related products | |
| US20250156151A1 (en) | Hardware acceleration circuit, data processing acceleration method, chip, and accelerator |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |