CN114586019A

CN114586019A - memory-based processor

Info

Publication number: CN114586019A
Application number: CN202080071415.1A
Authority: CN
Inventors: E.西蒂; E.希勒尔; S.布劳多; D.沙米尔; G.达扬
Original assignee: Nuro Brad Ltd
Current assignee: Nuro Brad Ltd
Priority date: 2019-08-13
Filing date: 2020-08-13
Publication date: 2022-06-03
Also published as: TW202122993A; EP4010808A2; EP4010808A4; KR20220078566A; WO2021028723A3; WO2021028723A2

Abstract

In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, wherein the memory array includes a plurality of discrete memory banks. The integrated circuit can also include a processing array disposed on the substrate, wherein the processing array includes a plurality of processor sub-units, each of the plurality of processor sub-units being associated with one or more discrete memory banks among the plurality of discrete memory banks. The integrated circuit may also include a controller configured to implement at least one security measure with respect to an operation of the integrated circuit and to take one or more remedial actions if the at least one security measure is triggered.

Description

memory-based processor

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请主张以下各者的优先权：2019年8月13日申请的美国临时申请第62/886,328号；2019年9月29日申请的美国临时申请第62/907,659号；2020年2月7日申请的美国临时申请第62/971,912号；及2020年2月28日申请的美国临时申请第62/983,174号。前述申请以全文引用的方式并入本文中。This application claims priority to: US Provisional Application No. 62/886,328, filed August 13, 2019; US Provisional Application No. 62/907,659, filed September 29, 2019; February 7, 2020 U.S. Provisional Application No. 62/971,912, filed; and U.S. Provisional Application No. 62/983,174, filed Feb. 28, 2020. The aforementioned applications are incorporated herein by reference in their entirety.

技术领域technical field

本公开大体上关于用于促进存储器密集型操作的装置。具体而言，本公开关于包括耦接至专用存储器组的处理元件的硬件芯片。本公开还关于用于改良存储器芯片的功率效率及速度的装置。具体而言，本公开关于用于在存储器芯片上实施部分刷新或甚至无刷新的系统及方法。本公开还关于大小可选择的存储器芯片及存储器芯片上的双端口能力。The present disclosure generally relates to apparatus for facilitating memory-intensive operations. In particular, the present disclosure pertains to hardware chips including processing elements coupled to dedicated memory banks. The present disclosure also relates to apparatus for improving the power efficiency and speed of memory chips. In particular, the present disclosure pertains to systems and methods for implementing partial refresh or even no refresh on memory chips. The present disclosure also pertains to sizable memory chips and dual port capabilities on memory chips.

背景技术Background technique

随着处理器速度及存储器大小均继续增大，对有效处理速度的显着限制为冯诺依曼(von Neumann)瓶颈。冯诺依曼瓶颈由常规计算机架构所导致的吞吐量限制造成。具体而言，相较于由处理器进行的实际计算，自存储器至处理器的数据传送常会遇到瓶颈。因此，用以对存储器进行读取及写入的时钟循环的数量随着存储器密集型处理程序而显着增大。这些时钟循环导致较低的有效处理速度，这是因为对存储器进行读取及写入会消耗时钟循环，该时钟循环无法用于对数据执行操作。此外，处理器的计算带宽通常大于处理器用以存取存储器的总线的带宽。As both processor speed and memory size continue to increase, the significant limit to effective processing speed is the von Neumann bottleneck. Von Neumann bottlenecks are caused by throughput limitations imposed by conventional computer architectures. In particular, data transfer from memory to processor often encounters a bottleneck compared to the actual computation performed by the processor. Consequently, the number of clock cycles used to read and write to memory increases significantly with memory-intensive processing programs. These clock cycles result in lower effective processing speeds because reading and writing to memory consumes clock cycles that cannot be used to perform operations on the data. Furthermore, the computational bandwidth of a processor is typically greater than the bandwidth of the bus used by the processor to access memory.

这些瓶颈对于以下各项特别明显：存储器密集型处理程序，诸如神经网络及其他机器学习算法；数据库建构、索引搜寻及查询；以及包括比数据处理操作多的读取及写入操作的其他任务。These bottlenecks are particularly evident for memory-intensive processing programs, such as neural networks and other machine learning algorithms; database construction, index searches, and queries; and other tasks that include more read and write operations than data processing operations.

另外，可用数字数据的容量及粒度的快速增长已产生开发机器学习算法的机会且已启用新技术。然而，这也为数据库及平行计算的领域带来棘手的挑战。例如，社交媒体及物联网(IoT)的兴起以创记录的速率产生数字数据。此新数据可用以产生用于多种用途的算法，范围为新广告技术至工业处理程序的更精确控制方法。然而，新数据难以储存、处理、分析及处置。Additionally, the rapid growth in the volume and granularity of available digital data has created opportunities to develop machine learning algorithms and enabled new technologies. However, this also brings thorny challenges to the fields of databases and parallel computing. For example, the rise of social media and the Internet of Things (IoT) is generating digital data at record rates. This new data can be used to generate algorithms for a variety of purposes, ranging from new advertising technologies to more precise control methods for industrial processing programs. However, new data is difficult to store, process, analyze and dispose of.

新数据资源可为巨大的，有时为大约千兆(peta)字节至泽(zetta)字节。此外，这些数据资源的增长速率可能超过数据处理能力。因此，数据科学家已转向平行数据处理技术，以应对这些挑战。为了提高计算能力且处置大量数据，科学家已尝试产生能够进行平行密集型计算的系统及方法。但这些现有系统及方法跟不上数据处理要求，常常因为所使用的技术受该技术对用于数据管理、整合分隔数据及分析分段数据的额外资源的需求限制。New data resources can be huge, sometimes on the order of petabytes to zettabytes. Furthermore, the growth rate of these data resources may exceed the data processing capacity. Therefore, data scientists have turned to parallel data processing techniques to meet these challenges. In order to increase computing power and handle large amounts of data, scientists have attempted to produce systems and methods capable of parallel-intensive computing. But these existing systems and methods have not kept pace with data processing requirements, often because the technology used is limited by the technology's need for additional resources for data management, integration of segmented data, and analysis of segmented data.

为了促进对大数据集的操控，工程师及科学家现在正设法改良用以分析数据的硬件。例如，新的半导体处理器或芯片(例如本文中所描述的半导体处理器或芯片)可通过在以更适合存储器操作而非算术计算的技术制造的单一基板中并入存储器及处理功能而特定地针对数据密集型任务设计。利用特定地针对数据密集型任务而设计的集成电路，有可能满足新的数据处理要求。然而，应对大数据集的数据处理的此新方法需要解决芯片设计及制造中的新问题。例如，若针对数据密集型任务而设计的新芯片利用用于普通芯片的制造技术及架构制造，则该新芯片将具有不良的效能和/或不可接受的良率。此外，若该新芯片经设计以利用当前数据处置方法进行操作，则该新芯片将具有不良的效能，这是因为当前方法可限制芯片处置平行操作的能力。To facilitate manipulation of large data sets, engineers and scientists are now looking to improve the hardware used to analyze the data. For example, new semiconductor processors or chips, such as those described herein, may be specifically designed by incorporating memory and processing functions in a single substrate fabricated in a technology more suitable for memory operations than arithmetic computing Designed for data-intensive tasks. With integrated circuits specifically designed for data-intensive tasks, it is possible to meet new data processing requirements. However, this new approach to data processing for large datasets requires addressing new issues in chip design and fabrication. For example, if a new chip designed for data-intensive tasks is fabricated using the fabrication techniques and architectures used for common chips, the new chip will have poor performance and/or unacceptable yields. Furthermore, if the new chip is designed to operate with current data handling methods, the new chip will have poor performance because current methods can limit the chip's ability to handle parallel operations.

本公开描述用于减轻或克服上文所阐述的问题中的一个或多个以及现有技术中的其他问题的解决方案。The present disclosure describes solutions for alleviating or overcoming one or more of the problems set forth above, as well as other problems in the prior art.

发明内容SUMMARY OF THE INVENTION

在一些实施例中，一种集成电路可包括一基板及安置于该基板上的一存储器阵列，其中该存储器阵列包括多个离散存储器组。该集成电路还可包括安置于该基板上的一处理阵列，其中该处理阵列包括多个处理器子单元，该多个处理器子单元中的每一者与该多个离散存储器组当中的一个或多个离散存储器组相关联。该集成电路还可包括一控制器，该控制器被配置为相对于该集成电路的一操作实施至少一个安全措施且在该至少一个安全措施被触发的情况下采取一个或多个补救动作。In some embodiments, an integrated circuit can include a substrate and a memory array disposed on the substrate, wherein the memory array includes a plurality of discrete memory banks. The integrated circuit may also include a processing array disposed on the substrate, wherein the processing array includes a plurality of processor sub-units, each of the plurality of processor sub-units and one of the plurality of discrete memory banks or multiple discrete memory banks are associated. The integrated circuit may also include a controller configured to implement at least one safety measure with respect to an operation of the integrated circuit and to take one or more remedial actions if the at least one safety measure is triggered.

所公开实施例还可包括一种保护集成电路以防篡改的方法，其中该方法包括使用与集成电路相关联的控制器实施相对于集成电路的操作的至少一个安全措施及在至少一个安全措施被触发的情况下采取一个或多个补救动作，且其中该集成电路包括：基板；存储器阵列，其安置于基板上，该存储器阵列包括多个离散存储器组；及处理阵列，其安置于基板上，该处理阵列包括多个处理器子单元，该多个处理器子单元中的每一者与该多个离散存储器组当中的一个或多个离散存储器组相关联。The disclosed embodiments may also include a method of protecting an integrated circuit from tampering, wherein the method includes implementing, using a controller associated with the integrated circuit, at least one security measure relative to operation of the integrated circuit and when the at least one security measure is One or more remedial actions are taken upon triggering, and wherein the integrated circuit includes: a substrate; a memory array disposed on the substrate, the memory array including a plurality of discrete memory banks; and a processing array disposed on the substrate, The processing array includes a plurality of processor subunits, each of the plurality of processor subunits being associated with one or more discrete memory banks of the plurality of discrete memory banks.

所公开实施例可包括一种集成电路，其包含：基板；存储器阵列，其安置于基板上，该存储器阵列包括多个离散存储器组；处理阵列，其安置于基板上，该处理阵列包括多个处理器子单元，该多个处理器子单元中的每一者与该多个离散存储器组当中的一个或多个离散存储器组相关联；及控制器，其被配置为：实施相对于集成电路的操作的至少一个安全措施；其中至少一个安全措施包括在至少两个不同存储器部分中复制程序代码。The disclosed embodiments may include an integrated circuit comprising: a substrate; a memory array disposed on the substrate, the memory array including a plurality of discrete memory banks; a processing array disposed on the substrate, the processing array including a plurality of a processor sub-unit, each of the plurality of processor sub-units is associated with one or more discrete memory banks of the plurality of discrete memory banks; and a controller configured to: implement relative to the integrated circuit at least one security measure for the operation of the ; wherein the at least one security measure includes duplicating program code in at least two different memory portions.

在一些实施例中，提供一种分布式处理器存储器芯片，其包含：基板；存储器阵列，其安置于基板上；处理阵列，其安置于基板上；第一通信端口；及第二通信端口。该存储器阵列可包括多个离散存储器组。该处理阵列可包括多个处理器子单元，该多个处理器子单元中的每一者与多个离散存储器组当中的一个或多个离散存储器组相关联。该第一通信端口可被配置为在该分布式处理器存储器芯片与除另一分布式处理器存储器芯片以外的外部实体之间建立通信连接。该第二通信端口可被配置为在该分布式处理器存储器芯片与第一额外分布式处理器存储器芯片之间建立通信连接。In some embodiments, a distributed processor memory chip is provided that includes: a substrate; a memory array disposed on the substrate; a processing array disposed on the substrate; a first communication port; and a second communication port. The memory array may include multiple discrete memory banks. The processing array may include a plurality of processor subunits, each of the plurality of processor subunits being associated with one or more discrete memory banks of a plurality of discrete memory banks. The first communication port may be configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip. The second communication port may be configured to establish a communication connection between the distributed processor memory chip and the first additional distributed processor memory chip.

在一些实施例中，一种在第一分布式处理器存储器芯片与第二分布式处理器存储器芯片之间传送数据的方法可包括：使用与第一分布式处理器存储器芯片及第二分布式处理器存储器芯片中的至少一者相关联的控制器判定安置于第一分布式处理器存储器芯片上的多个处理器子单元当中的第一处理器子单元是否已准备好将数据传送至包括于第二分布式处理器存储器芯片中的第二处理器子单元；及在判定第一处理器子单元已准备好将数据传送至第二处理器子单元之后，使用由控制器控制的时钟启用信号以起始数据自第一处理器子单元至第二处理器子单元的传送。In some embodiments, a method of transferring data between a first distributed processor memory chip and a second distributed processor memory chip may include using a communication with the first distributed processor memory chip and the second distributed processor memory chip. A controller associated with at least one of the processor memory chips determines whether a first processor subunit of a plurality of processor subunits disposed on the first distributed processor memory chip is ready to transfer data to a processor including a second processor sub-unit in a second distributed processor memory chip; and after determining that the first processor sub-unit is ready to transfer data to the second processor sub-unit, enabling using a clock controlled by the controller The signal initiates the transfer of data from the first processor subunit to the second processor subunit.

在一些实施例中，一种存储器单元可包括：存储器阵列，其包括多个存储器组；至少一个控制器，其被配置为控制相对于多个存储器组的读取操作的至少一个方面；至少一个零值侦测逻辑单元，其被配置为侦测储存于多个存储器组的特定地址中的多位零值；且其中该至少一个控制器及该至少一个零值侦测逻辑单元被配置为响应于由该至少一个零值侦测逻辑进行的零值侦测而将零值指示符传回至存储器单元外部的一个或多个电路。In some embodiments, a memory cell may include: a memory array including a plurality of memory banks; at least one controller configured to control at least one aspect of a read operation with respect to the plurality of memory banks; at least one a zero value detection logic unit configured to detect multi-bit zero values stored in specific addresses of a plurality of memory banks; and wherein the at least one controller and the at least one zero value detection logic unit are configured to respond to A zero value indicator is communicated back to one or more circuits external to the memory cell upon zero value detection by the at least one zero value detection logic.

一些实施例可包括一种用于侦测多个离散存储器组的特定地址中的零值的方法，其包含：自存储器单元外部的电路接收读取储存于多个离散存储器组的地址中的数据的请求；响应于所接收请求而藉由控制器启动零值侦测逻辑单元以侦测所接收地址中的零值；及响应于由该零值侦测逻辑单元进行的零值侦测而藉由该控制器将零值指示符传输至电路。Some embodiments may include a method for detecting a zero value in a particular address of a plurality of discrete memory banks, comprising: reading data stored in an address of a plurality of discrete memory banks receiving from a circuit external to the memory cell request by the controller; in response to the received request, the zero value detection logic unit is activated by the controller to detect the zero value in the received address; and in response to the zero value detection by the zero value detection logic unit, the A zero value indicator is communicated to the circuit by the controller.

一些实施例可包括一种非暂时性计算机可读介质，其储存可由存储器单元的控制器执行以使存储器单元侦测多个离散存储器组的特定地址中的零值的指令集，该方法包含：自存储器单元外部的电路接收读取储存于多个离散存储器组的地址中的数据的请求；响应于所接收请求而藉由控制器启动零值侦测逻辑单元以侦测所接收地址中的零值；及响应于由该零值侦测逻辑单元进行的零值侦测而藉由该控制器将零值指示符传输至电路。Some embodiments may include a non-transitory computer-readable medium storing a set of instructions executable by a controller of a memory unit to cause the memory unit to detect a zero value in a particular address of a plurality of discrete memory banks, the method comprising: A request to read data stored in addresses of a plurality of discrete memory banks is received from a circuit external to the memory cell; in response to the received request, a zero value detection logic unit is activated by the controller to detect a zero in the received address value; and transmitting, by the controller, a zero value indicator to the circuit in response to the zero value detection by the zero value detection logic unit.

在一些实施例中，一种存储器单元可包括：一个或多个存储器组；组控制器；及地址产生器；其中地址产生器被配置为将相关联存储器组中待存取的当前行中的当前地址提供至组控制器，判定相关联存储器组中待存取的下一行的预测地址，且在相对于与当前地址相关联的当前行的读取操作完成之前将预测地址提供至组控制器。In some embodiments, a memory cell may include: one or more memory banks; a bank controller; and an address generator; wherein the address generator is configured to convert a memory cell in a current row to be accessed in an associated memory bank The current address is provided to the bank controller, determines the predicted address of the next row to be accessed in the associated memory bank, and provides the predicted address to the bank controller before the read operation relative to the current row associated with the current address is complete .

在一些实施例中，一种存储器单元可包括：一个或多个存储器组，其中一个或多个存储器组中的每一者包括多个行；第一行控制器，其被配置为控制多个行的第一子集；第二行控制器，其被配置为控制多个行的第二子集；单个数据输入端，其用以接收待储存于多个行中的数据；及单个数据输出端，其用以提供自多个行撷取的数据。In some embodiments, a memory cell may include: one or more memory banks, wherein each of the one or more memory banks includes a plurality of rows; a first row controller configured to control the plurality of a first subset of rows; a second row controller configured to control a second subset of the plurality of rows; a single data input to receive data to be stored in the plurality of rows; and a single data output terminal, which is used to provide data retrieved from multiple rows.

在一些实施例中，一种分布式处理器存储器芯片可包括：基板；存储器阵列，其安置于基板上，该存储器阵列包括多个离散存储器组；处理阵列，其安置于基板上，该处理阵列包括多个处理器子单元，该些处理器子单元中的每一者与该多个离散存储器组中的对应的专用存储器组相关联；第一多个总线，其各将多个处理器子单元中的一者连接至其对应的专用存储器组；及第二多个总线，其各将多个处理器子单元中的一者连接至多个处理器子单元中的另一者。存储器组中的至少一者可包括安置于基板上的至少一个DRAM存储器垫。处理器单元中的至少一者可包括与至少一个存储器垫相关联的一个或多个逻辑组件。至少一个存储器垫及一个或多个逻辑组件可被配置为充当用于多个处理子单元中的一个或多个的高速缓存。In some embodiments, a distributed processor memory chip may include: a substrate; a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory banks; a processing array disposed on the substrate, the processing array Including a plurality of processor sub-units, each of the processor sub-units is associated with a corresponding dedicated memory bank of the plurality of discrete memory banks; a first plurality of buses, each of which connects the plurality of processor sub-units one of the units is connected to its corresponding dedicated memory bank; and a second plurality of buses each connecting one of the plurality of processor subunits to another of the plurality of processor subunits. At least one of the memory banks may include at least one DRAM memory pad disposed on the substrate. At least one of the processor units may include one or more logic components associated with at least one memory pad. At least one memory pad and one or more logic components may be configured to act as a cache for one or more of the plurality of processing subunits.

在一些实施例中，一种执行分布式处理器存储器芯片中的至少一个指令的方法可包括：自分布式处理器存储器芯片的存储器阵列撷取一个或多个数据值；将一个或多个数据值储存于形成于分布式处理器存储器芯片的存储器垫中的寄存器中；及根据由处理器组件执行的至少一个指令存取储存于寄存器中的一个或多个数据值；其中该存储器阵列包括安置于基板上的多个离散存储器组；其中该处理器组件为包括于安置在基板上的处理阵列中的多个处理器子单元当中的处理器子单元，其中处理器子单元中的每一者与多个离散存储器组中的对应的专用存储器组相关联；且其中该寄存器由安置于基板上的存储器垫提供。In some embodiments, a method of executing at least one instruction in a distributed processor memory chip may include: retrieving one or more data values from a memory array of a distributed processor memory chip; converting one or more data values Values are stored in registers formed in memory pads of the distributed processor memory chips; and one or more data values stored in the registers are accessed according to at least one instruction executed by the processor component; wherein the memory array includes an arrangement of a plurality of discrete memory banks on a substrate; wherein the processor component is a processor subunit included in a plurality of processor subunits in a processing array disposed on the substrate, wherein each of the processor subunits is associated with a corresponding dedicated memory bank of the plurality of discrete memory banks; and wherein the register is provided by a memory pad disposed on the substrate.

一些实施例可包括一种装置，其包含：基板；处理单元，其安置于基板上；及存储器单元，其安置于基板上，其中该存储器单元被配置为储存待由处理单元存取的数据，且其中该处理单元包含被配置为充当用于处理单元的高速缓存的存储器垫。Some embodiments may include an apparatus comprising: a substrate; a processing unit disposed on the substrate; and a memory unit disposed on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit, and wherein the processing unit includes a memory pad configured to act as a cache for the processing unit.

预期处理系统处理以极高速率处理增加的信息量。举例而言，预期第五代(5G)移动因特网接收大量信息串流且以增加的速率处理这些信息串流。The processing system is expected to process the increased amount of information at an extremely high rate. For example, the fifth generation (5G) mobile Internet is expected to receive large amounts of information streams and process these information streams at an increased rate.

该处理系统可包括一个或多个缓冲器及一处理器。由处理器应用的处理操作可能具有某一潜时且此可能需要大量缓冲器。大量缓冲器可为代价高的和/或耗面积的。The processing system may include one or more buffers and a processor. The processing operations applied by the processor may have some latency and this may require a large number of buffers. Large numbers of buffers can be expensive and/or area consuming.

将大量信息自缓冲器传送至处理器可能需要缓冲器与处理器之间的高带宽连接器和/或高带宽总线，此亦可增加处理系统的成本及面积。Transferring large amounts of information from the buffer to the processor may require a high bandwidth connector and/or a high bandwidth bus between the buffer and the processor, which may also increase the cost and area of the processing system.

越来越需要提供高效处理系统。There is an increasing need to provide efficient processing systems.

该处理系统可包括一个或多个缓冲器及处理器。由处理器应用的处理操作可能具有某一潜时且此可能需要大量缓冲器。大量缓冲器可为代价高的和/或耗面积的。The processing system may include one or more buffers and processors. The processing operations applied by the processor may have some latency and this may require a large number of buffers. Large numbers of buffers can be expensive and/or area consuming.

一种分解式服务器包括多个子系统，而每一子系统具有独特作用。举例而言，一种分解式服务器可包括一个或多个交换子系统、一个或多个运算子系统及一个或多个储存子系统。A disaggregated server includes multiple subsystems, each with a unique role. For example, a disaggregated server may include one or more switching subsystems, one or more computing subsystems, and one or more storage subsystems.

一个或多个运算子系统及一个或多个储存子系统经由一个或多个交换子系统彼此耦接。One or more computing subsystems and one or more storage subsystems are coupled to each other via one or more switching subsystems.

运算子系统可包括多个运算单元。The arithmetic subsystem may include a plurality of arithmetic units.

交换子系统可包括多个交换单元。The switching subsystem may include multiple switching units.

储存子系统可包括多个储存单元。The storage subsystem may include multiple storage units.

此分解式服务器的瓶颈在于在子系统之间传送信息所需的带宽。The bottleneck of this disaggregated server is the bandwidth required to transfer information between subsystems.

当执行需要在不同运算子系统的所有(或至少大部分)运算单元(诸如，图形处理单元)之间共享信息单元的分布式计算时，尤其为如此。This is especially true when performing distributed computations that require information elements to be shared among all (or at least most) computing units of different computing subsystems, such as graphics processing units.

假定存在参与共享的N个运算单元，N为极大整数(例如，至少1024)，且N个运算单元中的每一者必须将信息单元发送至所有其他运算单元(及自所有其他运算单元接收信息单元)。在这些假定下，需要执行信息单元的大约N×N个传送处理程序。大量传送处理程序系耗时且耗能量的，且将显著地限制分解式服务器的吞吐量。Assume that there are N arithmetic units participating in the sharing, N is a very large integer (eg, at least 1024), and each of the N arithmetic units must send information units to (and receive from all other arithmetic units) information unit). Under these assumptions, approximately NxN transfer handlers of information units need to be executed. Bulk transfer handlers are time-consuming and energy-intensive, and will significantly limit the throughput of a disaggregated server.

越来越需要提供高效分解式服务器及执行分布式处理的高效方式。There is a growing need to provide efficient disaggregated servers and efficient ways to perform distributed processing.

数据库包括许多条目，该些条目包括多个字段。数据库处理通常包括执行一个或多个查询，该一个或多个查询包括一个或多个筛选参数(例如，识别一个或多个相关字段及一个或多个相关字段值)且亦包括一个或多个操作参数，该一个或多个操作参数可判定待执行的操作的类型、待在应用操作时使用的变量或常数，及其类似者。The database includes many entries that include multiple fields. Database processing typically involves executing one or more queries including one or more filter parameters (eg, identifying one or more related fields and one or more related field values) and also including one or more Operational parameters, the one or more operational parameters may determine the type of operation to be performed, variables or constants to be used in applying the operation, and the like.

举例而言，数据库查询可请求对数据库的所有记录执行统计操作(操作参数)，其中某一字段具有预定义范围内的值(筛选参数)。又对于另一实例，数据库查询可请求删除具有小于阈值(筛选参数)的某一字段的(操作参数)记录。For example, a database query may request a statistical operation (operation parameter) on all records of the database where a field has a value within a predefined range (filter parameter). For yet another example, a database query may request the deletion of records (operating parameters) that have a certain field that is less than a threshold (filtering parameters).

大型数据库通常储存于储存装置中。为了对查询作出响应，将数据库发送至存储器单元，通常为一个数据库区段接着另一数据库区段。Large databases are usually stored in storage devices. In response to a query, the database is sent to a memory unit, typically one database section after another.

将数据库区段的条目自存储器单元发送至不属于与存储器单元相同的集成电路的处理器。该些条目接着由处理器处理。The entry of the database section is sent from the memory cell to a processor that does not belong to the same integrated circuit as the memory cell. These entries are then processed by the processor.

对于储存于存储器单元中的数据库的每一数据库区段，处理包括以下步骤：(i)选择数据库区段的记录；(ii)将记录自存储器单元发送至处理器；(iii)藉由处理器筛选记录以判定记录是否相关；及(iv)对相关记录执行一个或多个额外操作(求和、应用任何其他数学运算和/或统计操作)。For each database segment of the database stored in the memory unit, processing includes the following steps: (i) selecting the records of the database segment; (ii) sending the records from the memory unit to the processor; (iii) by the processor Filtering records to determine whether the records are related; and (iv) performing one or more additional operations (summation, applying any other mathematical operations and/or statistical operations) on the related records.

筛选处理程序在所有记录被发送至处理器且处理器判定哪些记录相关之后结束。The screening process ends after all records are sent to the processor and the processor determines which records are relevant.

在数据库区段的相关条目不储存于处理器中的状况下，则需要在筛选阶段之后将这些相关记录发送至处理器以供进一步处理(应用在处理之后的操作)。In the case that the relevant entries of the database section are not stored in the processor, then these relevant records need to be sent to the processor for further processing after the screening phase (applying post-processing operations).

当多个处理操作在单个筛选之后时，则可将每一操作的结果发送至存储器单元且接着再次发送至处理器。When multiple processing operations follow a single filter, then the result of each operation can be sent to the memory unit and then again to the processor.

此处理程序为耗带宽且耗时的。This handler is bandwidth consuming and time consuming.

越来越需要提供执行数据库处理的高效方式。There is a growing need to provide efficient ways of performing database processing.

字嵌入为自然语言处理(NLP)中的语言模型化及特征学习技术的集合的统称，其中将来自词汇表的字或词组映射至元素的向量。在概念上，其涉及从每字具有许多维度的空间至具有低得多的维度的连续向量空间的数学嵌入(www.wikipedia.org)。Word embedding is a collective term for a collection of language modeling and feature learning techniques in natural language processing (NLP), where words or phrases from a vocabulary are mapped to vectors of elements. Conceptually, it involves mathematical embedding from a space with many dimensions per word to a continuous vector space with much lower dimensions (www.wikipedia.org).

产生此映射的方法包括神经网络、字同现矩阵的降维、机率模型、可解释知识库方法及依据字出现的上下文的显式表示。Methods for generating this map include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and explicit representations in terms of the context in which the word appears.

字及词组嵌入在用作基础输入表示时已展示为提高诸如语法剖析及情感分析的NLP任务的效能。Word and phrase embeddings have been shown to improve the performance of NLP tasks such as parsing and sentiment analysis when used as the underlying input representation.

语句可分段成字或词组，且每一区段可由向量表示。语句可由矩阵表示，该矩阵包括表示语句的字或词组的所有向量。Sentences can be segmented into words or phrases, and each segment can be represented by a vector. A statement can be represented by a matrix that includes all the vectors representing the words or phrases of the statement.

将字映射至向量的词汇表可储存于存储器单元(诸如，动态随机存取存储器(DRAM))中，该存储器单元可使用字或词组(或表示字的索引)进行存取。A vocabulary that maps words to vectors can be stored in a memory unit, such as a dynamic random access memory (DRAM), that can be accessed using words or phrases (or indices representing words).

该些存取可为随机存取，此减少DRAM的吞吐量。此外，该些存取可使DRAM饱和，尤其在将大量存取馈入至DRAM时。These accesses can be random accesses, which reduces the throughput of the DRAM. Furthermore, these accesses can saturate the DRAM, especially when a large number of accesses are fed into the DRAM.

特定而言，包括于语句中的字通常相当随机。甚至在使用DRAM突发时，存取储存映射的DRAM存储器亦将通常导致随机存取的较低效能，这是因为通常在突发期间，DRAM存储器组条目(在同时被存取的不同存储器组的多个条目当中)的一小部分中的仅一者将储存与某一语句相关的条目。In particular, the words included in the sentences are usually quite random. Even when using DRAM bursts, accessing store-mapped DRAM memory will generally result in lower performance for random access, because typically during bursts, DRAM memory bank entries (in different memory banks being accessed at the same time) Only one of a small subset of entries) will store entries related to a statement.

因此，DRAM存储器的吞吐量低且为非连续的。Therefore, the throughput of DRAM memory is low and non-contiguous.

在主计算机的控制下自DRAM存储器撷取语句的每一字或词组，该主计算机在DRAM存储器的集成电路外部且必须基于对字的位置的了解来控制表示每一字或区段的每一向量的每次撷取，此为耗时且耗资源的任务。Each word or phrase of a sentence is retrieved from DRAM memory under the control of a host computer that is external to the integrated circuit of the DRAM memory and must control each word or segment representing each word based on knowledge of the location of the word Each fetch of the vector, which is a time-consuming and resource-intensive task.

预期数据中心及其他计算机化系统以极高速率处理及交换增加量的信息。Data centers and other computerized systems are expected to process and exchange increasing amounts of information at extremely high rates.

增加量的数据的交换可为数据中心及其他计算机化系统的瓶颈，且可使此类数据中心及其他计算机化系统仅利用其能力的一部分。The exchange of increasing amounts of data can be a bottleneck for data centers and other computerized systems, and can cause such data centers and other computerized systems to utilize only a fraction of their capacity.

图96A说明现有技术数据库12010及现有技术服务器主板12011的实例。数据库可包括多个服务器，每一服务器包括多个服务器主板(亦表示为“CPU+存储器+网络”)。每一服务器主板12011包括CPU 12012(诸如但不限于因特尔的XEON)，该CPU接收业务，连接至存储器单元12013(表示为RAM)及多个数据库加速器(DB加速器)12014。96A illustrates an example of a prior art database 12010 and a prior art server motherboard 12011. The database may include multiple servers, each server including multiple server motherboards (also denoted "CPU+memory+network"). Each server motherboard 12011 includes a CPU 12012 (such as, but not limited to, Intel's XEON) that receives traffic, is connected to a memory unit 12013 (denoted as RAM) and a plurality of database accelerators (DB accelerators) 12014 .

DB加速器为可选的，且DB加速操作可由CPU 12012执行。The DB accelerator is optional, and DB acceleration operations may be performed by the CPU 12012.

所有业务流经CPU，且CPU可经由具有相对有限带宽的链路(诸如，PCIe)耦接至DB加速器。All traffic flows through the CPU, and the CPU may be coupled to the DB accelerator via a relatively limited bandwidth link, such as PCIe.

大量资源专用于在多个服务器主板之间投送信息单元。Numerous resources are dedicated to routing information units across multiple server boards.

越来越需要提供高效数据中心及其他计算机化系统。There is an increasing need to provide efficient data centers and other computerized systems.

诸如神经网络的人工智能(AI)应用的大小显著增加。为了应对神经网络的增加的大小，各作为AI加速服务器(包括服务器主板)的多个服务器用以执行神经网络处理任务，诸如但不限于训练。包括配置于不同机架中的多个AI加速服务器的系统的实例展示于图1中。Artificial intelligence (AI) applications such as neural networks have grown significantly in size. To cope with the increased size of neural networks, multiple servers, each serving as AI acceleration servers (including server motherboards), are used to perform neural network processing tasks, such as, but not limited to, training. An example of a system including multiple AI acceleration servers configured in different racks is shown in FIG. 1 .

在典型的训练会话中，同时处理大量图像以提供大量值，诸如损失。大量值在不同AI加速服务器之间输送且导致例外量的业务。举例而言，可跨越位于不同AI加速服务器中的多个GPU运算一些神经网络层，且可能需要消耗带宽的网络上聚集。In a typical training session, a large number of images are processed simultaneously to provide a large number of values, such as losses. A large amount of value is transferred between different AI acceleration servers and results in an exceptional amount of traffic. For example, some neural network layers may be computed across multiple GPUs located in different AI acceleration servers, and may require aggregation on a network that consumes bandwidth.

例外量的业务的传送需要超高带宽，其可能不可行或可能不具成本效益。The delivery of exceptional amounts of traffic requires ultra-high bandwidth, which may not be feasible or may not be cost-effective.

图97A说明包括子系统的系统12050，每一子系统包括：交换器12051，其用于连接具有服务器主板12055的AI加速服务器12052，该服务器主板包括RAM存储器(RAM 12056)、中央处理单元(CPU)12054、网络适配器(NIC)12053，而CPU 12054连接(经由PCIe总线)至多个AI加速器12057(诸如，图形处理单元、AI芯片(AI ASIC)、FPGA及其类似者)。NIC藉由网络(使用例如以太网络、UDP链路及其类似者)耦接至彼此(例如，藉由一个或多个交换器)，且这些NIC可能够输送系统所需的超高带宽。97A illustrates a system 12050 including subsystems, each including a switch 12051 for connecting an AI acceleration server 12052 having a server motherboard 12055 including RAM memory (RAM 12056), a central processing unit (CPU) ) 12054, a network adapter (NIC) 12053, and the CPU 12054 is connected (via the PCIe bus) to a number of AI accelerators 12057 (such as graphics processing units, AI chips (AI ASICs), FPGAs, and the like). The NICs are coupled to each other (eg, through one or more switches) by a network (using, eg, Ethernet, UDP links, and the like), and these NICs may be capable of delivering the ultra-high bandwidth required by the system.

越来越需要提供高效AI运算系统。There is a growing need to provide efficient AI computing systems.

根据本公开的其他实施例，非暂时性计算机可读储存媒体可储存程序指令，该程序指令由至少一个处理设备执行且执行本文中所描述的方法中的任一者。According to other embodiments of the present disclosure, a non-transitory computer-readable storage medium may store program instructions that are executed by at least one processing device and perform any of the methods described herein.

前文的一般性描述和下文的详细描述仅是示例性和说明性的，并不限制权利要求。The foregoing general description and the following detailed description are exemplary and explanatory only and do not limit the claims.

附图说明Description of drawings

并入于本公开中且构成本公开的一部分的随附图式说明各种所公开实施例。在图式中：The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate the various disclosed embodiments. In the schema:

图1为中央处理单元(CPU)的示意图。FIG. 1 is a schematic diagram of a central processing unit (CPU).

图2为图形处理单元(GPU)的示意图。FIG. 2 is a schematic diagram of a graphics processing unit (GPU).

图3A为符合所公开实施例的示例性硬件芯片的一实施例的示意图。3A is a schematic diagram of one embodiment of an exemplary hardware chip consistent with the disclosed embodiments.

图3B为符合所公开实施例的示例性硬件芯片的另一实施例的示意图。3B is a schematic diagram of another embodiment of an exemplary hardware chip consistent with the disclosed embodiments.

图4为由符合所公开实施例的示例性硬件芯片执行的通用命令的示意图。4 is a schematic diagram of a generic command executed by an exemplary hardware chip consistent with the disclosed embodiments.

图5为由符合所公开实施例的示例性硬件芯片执行的专门命令的示意图。5 is a schematic diagram of specialized commands executed by an exemplary hardware chip consistent with the disclosed embodiments.

图6为供用于符合所公开实施例的示例性硬件芯片中的处理群组的示意图。6 is a schematic diagram of a processing group for use in an exemplary hardware chip consistent with the disclosed embodiments.

图7A为符合所公开实施例的处理群组的矩形阵列的示意图。7A is a schematic diagram of a rectangular array of processing groups consistent with disclosed embodiments.

图7B为符合所公开实施例的处理群组的椭圆形阵列的示意图。7B is a schematic diagram of an elliptical array of processing groups consistent with disclosed embodiments.

图7C为符合所公开实施例的硬件芯片的阵列的示意图。7C is a schematic diagram of an array of hardware chips consistent with the disclosed embodiments.

图7D为符合所公开实施例的硬件芯片的另一阵列的示意图。7D is a schematic diagram of another array of hardware chips consistent with the disclosed embodiments.

图8为描绘用于编译一系列指令以供在符合所公开实施例的示例性硬件芯片上执行的示例性方法的流程图。8 is a flowchart depicting an exemplary method for compiling a series of instructions for execution on an exemplary hardware chip consistent with the disclosed embodiments.

图9为存储器组的示意图。9 is a schematic diagram of a memory bank.

图10为存储器组的示意图。Figure 10 is a schematic diagram of a memory bank.

图11为符合所公开实施例的具有子组控制件的示例性存储器组的一实施例的示意图。11 is a schematic diagram of an embodiment of an exemplary memory bank with subbank controls consistent with the disclosed embodiments.

图12为符合所公开实施例的具有子组控制件的示例性存储器组的另一实施例的示意图。12 is a schematic diagram of another embodiment of an exemplary memory bank with subbank controls consistent with the disclosed embodiments.

图13为符合所公开实施例的示例性存储器芯片的功能方块图。13 is a functional block diagram of an exemplary memory chip consistent with the disclosed embodiments.

图14为符合所公开实施例的示例性冗余逻辑区块集合的功能方块图。14 is a functional block diagram of an exemplary set of redundant logical blocks consistent with disclosed embodiments.

图15为符合所公开实施例的示例性逻辑区块的功能方块图。15 is a functional block diagram of an exemplary logic block consistent with the disclosed embodiments.

图16为符合所公开实施例的与总线连接的示例性逻辑区块的功能方块图。16 is a functional block diagram of an exemplary logic block connected to a bus in accordance with the disclosed embodiments.

图17为符合所公开实施例的串联连接的示例性逻辑区块的功能方块图。17 is a functional block diagram of exemplary logic blocks connected in series in accordance with the disclosed embodiments.

图18为符合所公开实施例的成二维阵列连接的示例性逻辑区块的功能方块图。18 is a functional block diagram of exemplary logic blocks connected in a two-dimensional array in accordance with disclosed embodiments.

图19为符合所公开实施例的处于复杂连接中的示例性逻辑区块的功能方块图。19 is a functional block diagram of exemplary logic blocks in complex connections consistent with disclosed embodiments.

图20为说明符合所公开实施例的冗余区块启用处理程序的示例性流程图。20 is an exemplary flowchart illustrating a redundant block enablement process consistent with disclosed embodiments.

图21为说明符合所公开实施例的地址指派处理程序的示例性流程图。21 is an exemplary flow diagram illustrating an address assignment handler consistent with the disclosed embodiments.

图22为符合所公开实施例的示例性处理设备的功能方块图。22 is a functional block diagram of an exemplary processing device consistent with the disclosed embodiments.

图23为符合所公开实施例的示例性处理设备的功能方块图。23 is a functional block diagram of an exemplary processing device consistent with disclosed embodiments.

图24包括符合所公开实施例的示例性存储器配置图。24 includes an exemplary memory configuration diagram consistent with disclosed embodiments.

图25为说明符合所公开实施例的存储器配置处理程序的示例性流程图。25 is an exemplary flow diagram illustrating a memory configuration handler consistent with the disclosed embodiments.

图26为说明符合所公开实施例的存储器读取处理程序的示例性流程图。26 is an exemplary flowchart illustrating a memory read processing procedure consistent with the disclosed embodiments.

图27为说明符合所公开实施例的处理程序执行的示例性流程图。27 is an exemplary flow diagram illustrating the execution of a handler consistent with the disclosed embodiments.

图28为符合本公开的具有刷新控制器的存储器芯片的一实施例。28 is an embodiment of a memory chip with a refresh controller consistent with the present disclosure.

图29A为符合本公开的一实施例的刷新控制器。29A is a refresh controller consistent with one embodiment of the present disclosure.

图29B为符合本公开的另一实施例的刷新控制器。29B is a refresh controller consistent with another embodiment of the present disclosure.

图30为符合本公开的通过刷新控制器执行的处理程序的一实施例的流程图。30 is a flow diagram of one embodiment of a processing routine executed by a refresh controller consistent with the present disclosure.

图31为符合本公开的由编译程序实施的处理程序的一实施例的流程图。31 is a flow diagram of one embodiment of a processing routine implemented by a compiler consistent with the present disclosure.

图32为符合本公开的由编译程序实施的处理程序的另一实施例的流程图。32 is a flowchart of another embodiment of a processing routine implemented by a compiler consistent with the present disclosure.

图33展示符合本公开的通过所储存图案配置的示例刷新控制器。33 shows an example refresh controller configured with a stored pattern in accordance with the present disclosure.

图34为符合本公开的由刷新控制器内的软件实施的处理程序的示例流程图。34 is an example flow diagram of a processing routine implemented by software within a refresh controller consistent with the present disclosure.

图35A展示符合本公开的包括晶粒的示例晶圆。35A shows an example wafer including a die consistent with the present disclosure.

图35B展示符合本公开的连接至输入/输出总线的示例存储器芯片。35B shows an example memory chip connected to an input/output bus in accordance with the present disclosure.

图35C展示符合本公开的包括成行布置且连接至输入输出总线的存储器芯片的示例晶圆。35C shows an example wafer including memory chips arranged in rows and connected to input-output buses in accordance with the present disclosure.

图35D展示符合本公开的形成群组且连接至输入输出总线的两个存储器芯片。35D shows two memory chips grouped and connected to an input-output bus in accordance with the present disclosure.

图35E展示符合本公开的示例晶圆，其包括以六边形晶格置放且连接至输入输出总线的晶粒。35E shows an example wafer consistent with the present disclosure that includes dies placed in a hexagonal lattice and connected to input-output buses.

图36A至图36D展示符合本公开的连接至输入/输出总线的存储器芯片的各种可能配置。36A-36D show various possible configurations of memory chips connected to an input/output bus consistent with the present disclosure.

图37展示符合本公开的共享胶合逻辑(glue logic)的晶粒的示例分组。37 shows an example grouping of dies sharing glue logic consistent with the present disclosure.

图38A至图38B展示符合本公开的穿过晶圆的示例切割。38A-38B show example cuts through a wafer consistent with the present disclosure.

图38C展示符合本公开的晶圆上的晶粒的示例布置及输入输出总线的布置。38C shows an example arrangement of dies on a wafer consistent with the present disclosure and arrangement of input and output buses.

图39展示符合本公开的具有互连处理器子单元的晶圆上的示例存储器芯片。39 shows an example memory chip on a wafer with interconnected processor subunits consistent with the present disclosure.

图40为符合本公开的从晶圆布局存储器芯片的群组的处理程序的一示例流程图。40 is an example flow diagram of a process for laying out groups of memory chips from a wafer in accordance with the present disclosure.

图41A为符合本公开的从晶圆布局存储器芯片的群组的处理程序的另一示例流程图。41A is another example flow diagram of a process for laying out groups of memory chips from a wafer in accordance with the present disclosure.

图41B至图41C为符合本公开的判定用于从晶圆切割存储器芯片的一个或多个群组的切割图案的处理程序的示例流程图。41B-41C are example flowcharts of a process for determining a dicing pattern for dicing one or more groups of memory chips from a wafer, consistent with the present disclosure.

图42为符合本公开的提供沿着列的双端口存取的存储器芯片内的电路系统的示例。42 is an example of circuitry within a memory chip that provides dual port access along a column consistent with the present disclosure.

图43为符合本公开的提供沿着行的双端口存取的存储器芯片内的电路系统的示例。43 is an example of circuitry within a memory chip that provides dual port access along a row consistent with the present disclosure.

图44为符合本公开的提供沿着行和列两者的双端口存取的存储器芯片内的电路系统的示例。44 is an example of circuitry within a memory chip that provides dual port access along both rows and columns consistent with the present disclosure.

图45A为使用复制存储器阵列或垫的双读取。Figure 45A is a double read using a replicated memory array or pad.

图45B为使用复制存储器阵列或垫的双写入。Figure 45B is a double write using a replicated memory array or pad.

图46为符合本公开的具有用于沿着列的双端口存取的开关元件的存储器芯片内的电路系统的示例。46 is an example of circuitry within a memory chip having switching elements for dual port access along a column consistent with the present disclosure.

图47A为符合本公开的用于在单端口存储器阵列或垫上提供双端口存取的一处理程序的示例流程图。47A is an example flow diagram of a process for providing dual port access on a single port memory array or pad in accordance with the present disclosure.

图47B为符合本公开的用于在单端口存储器阵列或垫上提供双端口存取的另一处理程序的示例流程图。47B is an example flow diagram of another process routine for providing dual port access on a single port memory array or pad in accordance with the present disclosure.

图48为符合本公开的提供沿着行和列两者的双端口存取的存储器芯片内的电路系统的另一示例。48 is another example of circuitry within a memory chip that provides dual port access along both rows and columns consistent with the present disclosure.

图49为符合本公开的用于存储器垫内的双端口存取的开关元件的示例。49 is an example of a switching element for dual port access within a memory pad consistent with the present disclosure.

图50为符合本公开的具有被配置为存取部分字的缩减单元的示例集成电路。50 is an example integrated circuit having reduced cells configured to access partial words in accordance with the present disclosure.

图51为用于使用如关于图50所描述的缩减单元的存储器组。FIG. 51 is a memory bank for use with reduced cells as described with respect to FIG. 50 .

图52为符合本公开的使用集成至PIM逻辑中的缩减单元的存储器组。52 is a memory bank using reduced cells integrated into PIM logic consistent with the present disclosure.

图53为符合本公开的使用PIM逻辑以启动用于存取部分字的开关的存储器组。53 is a memory bank using PIM logic to enable switches for accessing partial words in accordance with the present disclosure.

图54A为符合本公开的具有用于撤销启动以存取部分字的分段列多任务器的存储器组。54A is a memory bank with a segmented rank multiplexer for deactivating to access partial words in accordance with the present disclosure.

图54B为符合本公开的用于存储器中的部分字存取的处理程序的示例流程图。54B is an example flow diagram of a processing routine for partial word access in memory consistent with the present disclosure.

图55为包括多个存储器垫的现有存储器芯片。FIG. 55 is a conventional memory chip including multiple memory pads.

图56为符合本公开的具有用于在线断开期间缩减功率消耗的启动电路的存储器芯片的一实施例。56 is an embodiment of a memory chip having a startup circuit for reducing power consumption during line disconnection consistent with the present disclosure.

图57为符合本公开的具有用于在线断开期间缩减功率消耗的启动电路的存储器芯片的另一实施例。57 is another embodiment of a memory chip with startup circuitry for reducing power consumption during line disconnection consistent with the present disclosure.

图58为符合本公开的具有用于在线断开期间缩减功率消耗的启动电路的存储器芯片的又一实施例。58 is yet another embodiment of a memory chip having startup circuitry for reducing power consumption during line disconnection consistent with the present disclosure.

图59为符合本公开的具有用于在线断开期间缩减功率消耗的启动电路的存储器芯片的再一实施例。FIG. 59 is yet another embodiment of a memory chip with startup circuitry for reducing power consumption during line disconnection consistent with the present disclosure.

图60为符合本公开的具有用于在线断开期间缩减功率消耗的全局字线及区域字线的存储器芯片的一实施例。60 is one embodiment of a memory chip having global and local word lines for reduced power consumption during line-off in accordance with the present disclosure.

图61为符合本公开的具有用于在线断开期间缩减功率消耗的全局字线及区域字线的存储器芯片的另一实施例。61 is another embodiment of a memory chip having global and local word lines for reduced power consumption during line-off in accordance with the present disclosure.

图62为符合本公开的用于依序断开存储器中的线的处理程序的流程图。62 is a flowchart of a processing routine for sequentially disconnecting lines in memory consistent with the present disclosure.

图63为用于存储器芯片的现有测试器。Figure 63 is a conventional tester for memory chips.

图64为用于存储器芯片的另一现有测试器。FIG. 64 is another conventional tester for memory chips.

图65为符合本公开的使用与存储器在相同基板上的逻辑单元测试存储器芯片的一实施例。65 is one embodiment of testing a memory chip using logic cells on the same substrate as the memory, consistent with the present disclosure.

图66为符合本公开的使用与存储器在相同基板上的逻辑单元测试存储器芯片的另一实施例。66 is another embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.

图67为符合本公开的使用与存储器在相同基板上的逻辑单元测试存储器芯片的又一实施例。67 is yet another embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.

图68为符合本公开的使用与存储器在相同基板上的逻辑单元测试存储器芯片的再一实施例。68 is yet another embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.

图69为符合本公开的使用与存储器在相同基板上的逻辑单元测试存储器芯片的另一实施例。69 is another embodiment of testing a memory chip using logic cells on the same substrate as the memory consistent with the present disclosure.

图70为符合本公开的用于测试存储器芯片的处理程序的流程图。70 is a flowchart of a process routine for testing memory chips consistent with the present disclosure.

图71为符合本公开的用于测试存储器芯片的另一处理程序的流程图。71 is a flowchart of another process for testing memory chips consistent with the present disclosure.

图72A为符合本发明的实施例的包括存储器阵列及处理阵列的集成电路的图解表示。72A is a diagrammatic representation of an integrated circuit including a memory array and a processing array in accordance with an embodiment of the present invention.

图72B为符合本发明的实施例的集成电路内部的存储器区的图解表示。72B is a diagrammatic representation of a memory region within an integrated circuit consistent with embodiments of the present invention.

图73A为符合本发明的实施例的具有控制器的实例配置的集成电路的图解表示。73A is a diagrammatic representation of an integrated circuit with an example configuration of a controller consistent with embodiments of the present invention.

图73B为符合本发明的实施例的用于同时执行复制模型的配置的图解表示。Figure 73B is a diagrammatic representation of a configuration for concurrently executing replication models in accordance with embodiments of the present invention.

图74A为符合本发明的实施例的具有控制器的另一实例配置的集成电路的图解表示。74A is a diagrammatic representation of an integrated circuit with another example configuration of a controller consistent with embodiments of the present invention.

图74B为根据例示性所公开实施例的保护集成电路的方法的流程图表示。74B is a flowchart representation of a method of protecting an integrated circuit in accordance with an illustratively disclosed embodiment.

图74C为根据例示性所公开实施例的位于芯片内的各个点处的侦测元件的图解表示。74C is a diagrammatic representation of detection elements located at various points within a chip, according to an exemplary disclosed embodiment.

图75A为符合本发明的实施例的包括多个分布式处理器存储器芯片的可扩展处理器存储器系统的图解表示。75A is a diagrammatic representation of a scalable processor memory system including multiple distributed processor memory chips consistent with an embodiment of the present invention.

图75B为符合本发明的实施例的包括多个分布式处理器存储器芯片的可扩展处理器存储器系统的图解表示。75B is a diagrammatic representation of a scalable processor memory system including multiple distributed processor memory chips, consistent with an embodiment of the present invention.

图75C为符合本发明的实施例的包括多个分布式处理器存储器芯片的可扩展处理器存储器系统的图解表示。Figure 75C is a diagrammatic representation of a scalable processor memory system including multiple distributed processor memory chips consistent with embodiments of the present invention.

图75D为符合本发明的实施例的双端口分布式处理器存储器芯片的图解表示。Figure 75D is a diagrammatic representation of a dual port distributed processor memory chip consistent with an embodiment of the invention.

图75E为符合本发明的实施例的实例时序图。75E is an example timing diagram consistent with embodiments of the present invention.

图76为符合本发明的实施例的具有整合式控制器及接口模块且构成可扩展处理器存储器系统的处理器存储器芯片的图解表示。76 is a diagrammatic representation of a processor memory chip having an integrated controller and interface module and constituting a scalable processor memory system in accordance with an embodiment of the present invention.

图77为符合本发明的实施例的用于在图75A中所展示的可扩展处理器存储器系统中的处理器存储器芯片之间传送数据的流程图。77 is a flow diagram for transferring data between processor memory chips in the scalable processor memory system shown in FIG. 75A, consistent with an embodiment of the present invention.

图78A说明符合本发明的实施例的用于在芯片层级侦测储存于实施于存储器芯片中的多个存储器组的一个或多个特定地址中的零值的系统。78A illustrates a system for detecting, at the chip level, zero values stored in one or more specific addresses of multiple memory banks implemented in a memory chip, in accordance with embodiments of the present invention.

图78B说明符合本发明的实施例的用于在存储器组层级侦测储存于多个存储器组的特定地址中的一个或多个中的零值的存储器芯片。78B illustrates a memory chip for detecting, at the memory bank level, zero values stored in one or more of particular addresses of a plurality of memory banks, consistent with embodiments of the present invention.

图79说明符合本发明的实施例的用于在存储器垫层级侦测储存于多个存储器垫的特定地址中的一个或多个中的零值的存储器组。79 illustrates a memory bank for detecting, at the memory pad level, zero values stored in one or more of specific addresses of a plurality of memory pads, consistent with embodiments of the present invention.

图80为说明符合本发明的实施例的侦测多个离散存储器组的特定地址中的零值的例示性方法的流程图。80 is a flowchart illustrating an exemplary method of detecting zero values in particular addresses of multiple discrete memory banks, consistent with embodiments of the present invention.

图81A说明符合本发明的实施例的用于基于下一行预测启动与存储器组相关联的下一行的系统。81A illustrates a system for launching a next row associated with a memory bank based on next row prediction, consistent with embodiments of the present invention.

图81B说明符合本发明的实施例的图81A的系统的另一实施例。Figure 81B illustrates another embodiment of the system of Figure 81A consistent with embodiments of the present invention.

图81C说明符合本发明的实施例的每一存储器子组的第一及第二子组行控制器。Figure 81C illustrates the first and second subset row controllers for each memory subset consistent with embodiments of the present invention.

图81D说明符合本发明的实施例的下一行预测的实施例。Figure 81D illustrates an embodiment of next row prediction consistent with embodiments of the present invention.

图81E说明符合本发明的实施例的存储器组的实施例。Figure 81E illustrates an embodiment of a memory bank consistent with embodiments of the present invention.

图81F说明符合本发明的实施例的存储器组的另一实施例。Figure 81F illustrates another embodiment of a memory bank consistent with embodiments of the present invention.

图82说明符合本发明的实施例的用于减少存储器行启动惩罚的双重控制存储器组。Figure 82 illustrates a dual control memory bank for reducing memory row start penalties in accordance with embodiments of the present invention.

图83A说明存取及启动存储器组的行的第一实例。Figure 83A illustrates a first example of accessing and enabling a row of a memory bank.

图83B说明存取及启动存储器组的行的第二实例。Figure 83B illustrates a second example of accessing and enabling a row of a memory bank.

图83C说明存取及启动存储器组的行的第三实例。Figure 83C illustrates a third example of accessing and enabling a row of a memory bank.

图84提供传统CPU/寄存器文件及外部存储器架构的图解表示。Figure 84 provides a diagrammatic representation of a conventional CPU/register file and external memory architecture.

图85A说明符合一个实施例的具有充当寄存器文件的存储器垫的例示性分布式处理器存储器芯片。85A illustrates an exemplary distributed processor memory chip with memory pads serving as register files, in accordance with one embodiment.

图85B说明符合另一实施例的具有被配置为充当寄存器文件的存储器垫的例示性分布式处理器存储器芯片。85B illustrates an exemplary distributed processor memory chip with memory pads configured to function as register files, in accordance with another embodiment.

图85C说明符合另一实施例的具有充当寄存器文件的存储器垫的例示性装置。85C illustrates an exemplary device having a memory pad serving as a register file in accordance with another embodiment.

图86提供表示符合所公开实施例的用于在分布式处理器存储器芯片中执行至少一个指令的例示性方法的流程图。86 provides a flowchart representing an exemplary method for executing at least one instruction in a distributed processor memory chip consistent with the disclosed embodiments.

图87A包括分解式服务器的实例；Figure 87A includes an example of a decomposed server;

图87B为分布式处理的实例；Figure 87B is an example of distributed processing;

图87C为存储器/处理单元的实例；Figure 87C is an example of a memory/processing unit;

图87D为存储器/处理单元的实例；Figure 87D is an example of a memory/processing unit;

图87E为存储器/处理单元的实例；Figure 87E is an example of a memory/processing unit;

图87F为包括存储器/处理单元及一个或多个通信模块的集成电路的实例；87F is an example of an integrated circuit including a memory/processing unit and one or more communication modules;

图87G为包括存储器/处理单元及一个或多个通信模块的集成电路的实例；87G is an example of an integrated circuit including a memory/processing unit and one or more communication modules;

图87H为方法的实例；Figure 87H is an example of a method;

图87I为方法的实例；Figure 87I is an example of a method;

图88A为方法的实例；Figure 88A is an example of a method;

图88B为方法的实例；Figure 88B is an example of a method;

图88C为方法的实例；Figure 88C is an example of a method;

图89A为存储器/处理单元及词汇表的实例；Figure 89A is an example of a memory/processing unit and vocabulary;

图89B为存储器/处理单元的实例；Figure 89B is an example of a memory/processing unit;

图89C为存储器/处理单元的实例；Figure 89C is an example of a memory/processing unit;

图89D为存储器/处理单元的实例；Figure 89D is an example of a memory/processing unit;

图89E为存储器/处理单元的实例；Figure 89E is an example of a memory/processing unit;

图89F为存储器/处理单元的实例；Figure 89F is an example of a memory/processing unit;

图89G为存储器/处理单元的实例；Figure 89G is an example of a memory/processing unit;

图89H为存储器/处理单元的实例；Figure 89H is an example of a memory/processing unit;

图90A为系统的实例；Figure 90A is an example of a system;

图90B为系统的实例；Figure 90B is an example of a system;

图90C为系统的实例；Figure 90C is an example of a system;

图90D为系统的实例；Figure 90D is an example of a system;

图90E为系统的实例；Figure 90E is an example of a system;

图90F为方法的实例；Figure 90F is an example of a method;

图91A为存储器及筛选系统、储存装置以及CPU的实例；Figure 91A is an example of a memory and screening system, storage, and CPU;

图91B为存储器及处理系统、储存装置以及CPU的实例；91B is an example of a memory and processing system, storage device, and CPU;

图92A为存储器及处理系统、储存装置以及CPU的实例；92A is an example of a memory and processing system, storage device, and CPU;

图92B为存储器/处理单元的实例；Figure 92B is an example of a memory/processing unit;

图92C为存储器及筛选系统、储存装置以及CPU的实例；Figure 92C is an example of a memory and screening system, storage device, and CPU;

图92D为存储器及处理系统、储存装置以及CPU的实例；92D is an example of a memory and processing system, storage device, and CPU;

图92E为存储器及处理系统、储存装置以及CPU的实例；Figure 92E is an example of a memory and processing system, storage device, and CPU;

图92F为方法的实例；Figure 92F is an example of a method;

图92G为方法的实例；Figure 92G is an example of a method;

图92H为方法的实例；Figure 92H is an example of a method;

图92I为方法的实例；Figure 92I is an example of a method;

图92J为方法的实例；Figure 92J is an example of a method;

图92K为方法的实例；Figure 92K is an example of a method;

图93A为混合集成电路的实例的横截面图；93A is a cross-sectional view of an example of a hybrid integrated circuit;

图93B为混合集成电路的实例的横截面图；93B is a cross-sectional view of an example of a hybrid integrated circuit;

图93C为混合集成电路的实例的横截面图；93C is a cross-sectional view of an example of a hybrid integrated circuit;

图93D为混合集成电路的实例的横截面图；93D is a cross-sectional view of an example of a hybrid integrated circuit;

图93E为混合集成电路的实例的俯视图；93E is a top view of an example of a hybrid integrated circuit;

图93F为混合集成电路的实例的俯视图；93F is a top view of an example of a hybrid integrated circuit;

图93G为混合集成电路的实例的俯视图；93G is a top view of an example of a hybrid integrated circuit;

图93H为混合集成电路的实例的横截面图；93H is a cross-sectional view of an example of a hybrid integrated circuit;

图93I为混合集成电路的实例的横截面图；93I is a cross-sectional view of an example of a hybrid integrated circuit;

图93J为方法的实例；Figure 93J is an example of a method;

图94A为储存系统、一个或多个装置及运算系统的实例；94A is an example of a storage system, one or more devices, and a computing system;

图94B为储存系统、一个或多个装置及运算系统的实例；94B is an example of a storage system, one or more devices, and a computing system;

图94C为一个或多个装置及运算系统的实例；Figure 94C is an example of one or more devices and computing systems;

图94D为一个或多个装置及运算系统的实例；Figure 94D is an example of one or more devices and computing systems;

图94E为数据库加速集成电路的实例；Figure 94E is an example of a database acceleration integrated circuit;

图94F为数据库加速集成电路的实例；Figure 94F is an example of a database acceleration integrated circuit;

图94G为数据库加速集成电路的实例；Figure 94G is an example of a database acceleration integrated circuit;

图94H为数据库加速单元的实例；Figure 94H is an example of a database acceleration unit;

图94I为刀片以及数据库加速集成电路的群组的实例；94I is an example of a group of blade and database acceleration integrated circuits;

图94J为数据库加速集成电路的群组的实例；94J is an example of a group of database acceleration integrated circuits;

图94K为数据库加速集成电路的群组的实例；94K is an example of a group of database acceleration integrated circuits;

图94L为数据库加速集成电路的群组的实例；94L is an example of a group of database acceleration integrated circuits;

图94M为数据库加速集成电路的群组的实例；94M is an example of a group of database acceleration integrated circuits;

图94N为系统的实例；Figure 94N is an example of a system;

图94O为系统的实例；Figure 940 is an example of a system;

图94P为方法的实例；Figure 94P is an example of a method;

图95A为方法的实例；Figure 95A is an example of a method;

图95B为方法的实例；Figure 95B is an example of a method;

图95C为方法的实例；Figure 95C is an example of a method;

图96A为现有技术系统的实例；Figure 96A is an example of a prior art system;

图96B为系统的实例；Figure 96B is an example of a system;

图96C为数据库加速器板的实例；Figure 96C is an example of a database accelerator board;

图96D为系统的一部分的实例；Figure 96D is an example of a portion of a system;

图97A为现有技术系统的实例；Figure 97A is an example of a prior art system;

图97B为系统的实例；及Figure 97B is an example of a system; and

图97C为AI网络适配器的实例。Figure 97C is an example of an AI network adapter.

具体实施方式Detailed ways

以下详细描述参考随附图式。在任何方便之处，在图式及以下描述中使用相同参考编号来指代相同或类似部分。虽然本文中描述了若干说明性实施例，但修改、调适及其他实施为可能的。例如，可对图式中所说明的组件进行替代、添加或修改，且可通过替代、重排序、移除步骤或添加步骤至所公开方法来修改本文中所描述的说明性方法。因此，以下详细描述不限于所公开实施例及示例。反而，适当范畴由所附权利要求界定。The following detailed description refers to the accompanying drawings. Wherever convenient, the same reference numbers are used in the drawings and the following description to refer to the same or like parts. While several illustrative embodiments are described herein, modifications, adaptations, and other implementations are possible. For example, components illustrated in the figures may be replaced, added, or modified, and the illustrative methods described herein may be modified by substituting, reordering, removing steps, or adding steps to the disclosed methods. Accordingly, the following detailed description is not limited to the disclosed embodiments and examples. Rather, the proper scope is defined by the appended claims.

处理器架构processor architecture

如贯穿本公开所使用，术语「硬件芯片」指半导体晶圆(诸如，硅或其类似物)，其上形成有一个或多个电路元件(诸如，晶体管、电容器、电阻器和/或其类似物)。该电路元件可形成处理元件或存储器元件。「处理元件」指代共同执行至少一个逻辑功能(诸如，算术功能、逻辑门、其他布尔运算(Boolean operations)或其类似物)的一个或多个电路元件。处理元件可为通用处理元件(诸如，可配置的多个晶体管)或专用处理元件(诸如，经设计以执行特定逻辑功能的特定逻辑门或多个电路元件)。「存储器元件」指可用以储存数据的一个或多个电路元件。「存储器元件」也可被称作「存储器胞元」。存储器元件可为动态(使得需要电刷新以维持数据储存)、静态(使得数据在失去电力之后持续存在至少一段时间)或非易失性的存储器。As used throughout this disclosure, the term "hardware chip" refers to a semiconductor wafer (such as silicon or the like) on which is formed one or more circuit elements (such as transistors, capacitors, resistors, and/or the like) thing). The circuit elements may form processing elements or memory elements. "Processing element" refers to one or more circuit elements that collectively perform at least one logical function, such as an arithmetic function, logic gates, other Boolean operations, or the like. A processing element may be a general-purpose processing element (such as a configurable plurality of transistors) or a special-purpose processing element (such as a specific logic gate or circuit elements designed to perform a specific logic function). A "memory element" refers to one or more circuit elements that can be used to store data. A "memory element" may also be referred to as a "memory cell." The memory elements may be dynamic (so that electrical refresh is required to maintain data storage), static (so that data persists for at least a period of time after a loss of power), or non-volatile memory.

处理元件可接合以形成处理器子单元。「处理器子单元」因此可包含可执行至少一个任务或指令(例如，属于一处理器指令集)的处理元件的最小分组。例如，一子单元可包含被配置为共同执行指令的一个或多个通用处理元件、与经配置成以互补方式执行指令的一个或多个专用处理元件配对的一个或多个通用处理元件，或其类似物。该处理器子单元可以以阵列布置在一基板(例如，一晶圆)上。尽管「阵列」可包含矩形形状，但阵列中的子单元的任何布置可形成于基板上。The processing elements can be joined to form a processor subunit. A "processor sub-unit" may thus comprise the smallest grouping of processing elements that can execute at least one task or instruction (eg, belonging to a processor instruction set). For example, a subunit may include one or more general-purpose processing elements configured to collectively execute instructions, one or more general-purpose processing elements paired with one or more special-purpose processing elements configured to execute instructions in a complementary manner, or its analogs. The processor subunits may be arranged in an array on a substrate (eg, a wafer). Although an "array" may comprise a rectangular shape, any arrangement of subunits in an array may be formed on a substrate.

存储器元件可接合以形成存储器组(memory bank)。例如，存储器组可包含沿着至少一条导线(或其他导电连接件)链接的存储器元件的一个或多个线。此外，存储器元件可在另一方向上沿着至少一条添加导线链接。例如，存储器元件可沿着字线及比特线布置，如下文所解释。尽管存储器组可包含线，但组中的元件的任何布置可用以在基板上形成组。此外，一个或多个组可电接合至至少一个存储器控制器以形成存储器阵列。尽管存储器阵列可包含组的矩形布置，但阵列中的组的任何布置可形成于基板上。Memory elements can be bonded to form memory banks. For example, a memory bank may include one or more lines of memory elements linked along at least one wire (or other conductive connection). Additionally, the memory elements may be linked in the other direction along at least one additional wire. For example, memory elements may be arranged along word lines and bit lines, as explained below. Although a memory bank may contain lines, any arrangement of elements in a bank may be used to form a bank on a substrate. Additionally, one or more groups may be electrically coupled to at least one memory controller to form a memory array. Although a memory array may contain a rectangular arrangement of banks, any arrangement of banks in an array may be formed on a substrate.

如贯穿本公开进一步所使用，「总线」指基板的组件之间的任何通信连接件。例如，导线或线(形成电连接件)、光纤(形成光学连接件)或进行元件之间的通信的任何其他连接件可被称作「总线」。As used further throughout this disclosure, a "bus" refers to any communication connection between components of a substrate. For example, wires or wires (forming electrical connections), optical fibers (forming optical connections), or any other connections that carry out communication between elements may be referred to as "buses."

常规处理器使通用逻辑电路与共享存储器配对。共享存储器可储存用于由逻辑电路执行的指令集以及用于指令集的执行且由指令集的执行产生的数据两者。如下文所描述，一些常规处理器使用高速缓存系统来缩减执行自共享存储器提取时的延迟；然而，常规高速缓存系统保持共享。常规处理器包括中央处理单元(CPU)、图形处理单元(GPU)、各种特殊应用集成电路(ASIC)或其类似物。图1展示CPU的示例，且图2展示GPU的示例。Conventional processors pair general-purpose logic circuits with shared memory. Shared memory may store both a set of instructions for execution by logic circuits and data used for execution of the set of instructions and resulting from execution of the set of instructions. As described below, some conventional processors use cache systems to reduce latency when performing fetches from shared memory; however, conventional cache systems remain shared. Conventional processors include central processing units (CPUs), graphics processing units (GPUs), various application-specific integrated circuits (ASICs), or the like. Figure 1 shows an example of a CPU, and Figure 2 shows an example of a GPU.

如图1所示，CPU 100可包含处理单元110，处理单元110可包括一个或多个处理器子单元，诸如处理器子单元120a及处理器子单元120b。尽管图1中未描绘，但每一处理器子单元可包含多个处理元件。此外，处理单元110可包括一个或多个层级的片上高速缓存。此类高速缓存元件通常与处理单元110形成于相同半导体晶粒上，而非经由形成于基板中的一个或多个总线连接至处理器子单元120a及120b，该基板含有处理器子单元120a及120b以及高速缓存元件。对于常规处理器中的第一阶(L1)及第二阶(L2)高速缓存，直接在相同晶粒上而非经由总线连接的布置为常用的。替代地，在早期处理器中，L2高速缓存系使用子单元与L2高速缓存之间的背侧总线而在处理器子单元当中共享。背侧总线通常大于下文所描述的前侧总线。因此，因为高速缓存要供晶粒上的所有处理器子单元共享，所以高速缓存130可与处理器子单元120a及120b在相同晶粒上形成或经由一个或多个背侧总线以通信方式耦接至处理器子单元120a及120b。在不具有总线(例如，高速缓存直接形成于晶粒上)的实施例以及使用背侧总线的实施例两者中，高速缓存在CPU的处理器子单元之间共享。As shown in FIG. 1, CPU 100 may include processing unit 110, which may include one or more processor subunits, such as processor subunit 120a and processor subunit 120b. Although not depicted in FIG. 1, each processor sub-unit may include multiple processing elements. Additionally, processing unit 110 may include one or more levels of on-chip cache. Such cache elements are typically formed on the same semiconductor die as processing unit 110, rather than being connected to processor subunits 120a and 120b via one or more buses formed in a substrate containing processor subunits 120a and 120b. 120b and the cache element. For first-level (L1) and second-level (L2) caches in conventional processors, arrangements directly on the same die rather than connected via a bus are common. Instead, in earlier processors, the L2 cache was shared among processor subunits using a backside bus between the subunit and the L2 cache. The backside bus is generally larger than the front side bus described below. Thus, because the cache is to be shared by all processor subunits on a die, cache 130 may be formed on the same die as processor subunits 120a and 120b or communicatively coupled via one or more backside buses Connected to the processor sub-units 120a and 120b. In both embodiments that do not have a bus (eg, the cache is formed directly on the die) and embodiments that use a backside bus, the cache is shared among the processor subunits of the CPU.

此外，处理单元110与共享存储器140a及存储器140b通信。例如，存储器140a及140b可表示共享动态随机存取存储器(DRAM)的存储器组。尽管描绘为具有两个存储器组，但大部分常规存储器芯片包括介于八个与十六个之间的存储器组。因此，处理器子单元120a及120b可使用共享存储器140a及140b储存数据，该数据接着由处理器子单元120a及120b进行操作。然而，此布置导致存储器140a及140b与处理单元110之间的总线在处理单元110的时钟速度超过总线的数据传送速度时成为瓶颈。对于常规处理器，通常系如此情况，从而导致低于基于时钟速率及晶体管数量的规定处理速度的有效处理速度。Additionally, the processing unit 110 is in communication with shared memory 140a and memory 140b. For example, memories 140a and 140b may represent memory banks that share dynamic random access memory (DRAM). Although depicted as having two memory banks, most conventional memory chips include between eight and sixteen memory banks. Thus, the processor sub-units 120a and 120b may use the shared memory 140a and 140b to store data, which is then manipulated by the processor sub-units 120a and 120b. However, this arrangement causes the bus between the memories 140a and 140b and the processing unit 110 to become a bottleneck when the clock speed of the processing unit 110 exceeds the data transfer speed of the bus. For conventional processors, this is often the case, resulting in an effective processing speed that is lower than the specified processing speed based on clock rate and transistor count.

如图2中所展示，GPU中也存在类似缺陷。GPU 200可包含处理单元210，处理单元210可包括一个或多个处理器子单元(例如，子单元220a、220b、220c、220d、220e、220f、220g、220h、220i、220j、220k、220l、220m、220n、220o及220p)。此外，处理单元210可包括一个或多个层级的片上高速缓存和/或寄存器文件。此类高速缓存元件通常与处理单元210形成于相同半导体晶粒上。实际上，在图2的实施例中，高速缓存210与处理单元210形成于相同晶粒上且在所有处理器子单元当中共享，而高速缓存230a、230b、230c及230d分别形成于处理器子单元的子集上且专用于该处理器子单元。Similar flaws exist in GPUs, as shown in Figure 2. GPU 200 may include a processing unit 210, which may include one or more processor subunits (eg, subunits 220a, 220b, 220c, 220d, 220e, 220f, 220g, 220h, 220i, 220j, 220k, 220l, 220m, 220n, 220o and 220p). Additionally, processing unit 210 may include one or more levels of on-chip cache and/or register files. Such cache elements are typically formed on the same semiconductor die as processing unit 210 . In fact, in the embodiment of FIG. 2, the cache 210 and the processing unit 210 are formed on the same die and are shared among all the processor sub-units, while the caches 230a, 230b, 230c and 230d are respectively formed on the processor sub-units on a subset of units and dedicated to this processor subunit.

此外，处理单元210与共享存储器250a、250b、250c及250d通信。例如，存储器250a、250b、250c及250d可表示共享DRAM的存储器组。因此，处理单元210的处理器子单元可使用共享存储器250a、250b、250c及250d储存数据，该数据接着由该处理器子单元进行操作。然而，此布置导致存储器250a、250b、250c及250d与处理单元210之间的总线成为瓶颈，其类似于上文关于CPU所描述的瓶颈。Additionally, the processing unit 210 is in communication with shared memories 250a, 250b, 250c, and 250d. For example, memories 250a, 250b, 250c, and 250d may represent memory banks that share DRAM. Accordingly, the processor subunit of the processing unit 210 may use the shared memories 250a, 250b, 250c, and 250d to store data, which is then manipulated by the processor subunit. However, this arrangement causes the bus between the memories 250a, 250b, 250c, and 250d and the processing unit 210 to become a bottleneck, similar to that described above with respect to the CPU.

所公开硬件芯片的概述Overview of the disclosed hardware chip

图3A为描绘示例性硬件芯片300的实施例的示意图。硬件芯片300可包含经设计以缓解上文关于CPU、GPU及其他常规处理器所描述的瓶颈的分布式处理器。分布式处理器可包括在空间上分布于单一基板上的多个处理器子单元。此外，如上文所解释，在本公开的分布式处理器中，对应存储器组还在空间上分布于基板上。在一些实施例中，分布式处理器可与一组指令相关联，且分布式处理器的处理器子单元中的每一个可负责执行包括于该组指令中的一个或多个任务。FIG. 3A is a schematic diagram depicting an embodiment of an exemplary hardware chip 300 . Hardware chip 300 may include distributed processors designed to alleviate the bottlenecks described above with respect to CPUs, GPUs, and other conventional processors. A distributed processor may include multiple processor subunits that are spatially distributed on a single substrate. Furthermore, as explained above, in the distributed processor of the present disclosure, the corresponding memory banks are also spatially distributed on the substrate. In some embodiments, a distributed processor may be associated with a set of instructions, and each of the processor subunits of the distributed processor may be responsible for performing one or more tasks included in the set of instructions.

如图3A中所描绘，硬件芯片300可包含多个处理器子单元，例如，逻辑及控制子单元320a、320b、320c、320d、320e、320f、320g及320h。如图3A中进一步所描绘，每一处理器子单元可具有一专用存储器实例。例如，逻辑及控制子单元320a可操作地连接至专用存储器实例330a，逻辑及控制子单元320b可操作地连接至专用存储器实例330b，逻辑及控制子单元320c可操作地连接至专用存储器实例330c，逻辑及控制子单元320d可操作地连接至专用存储器实例330d，逻辑及控制子单元320e可操作地连接至专用存储器实例330e，逻辑及控制子单元320f可操作地连接至专用存储器实例330f，逻辑及控制子单元320g可操作地连接至专用存储器实例330g，且逻辑及控制子单元320h可操作地连接至专用存储器实例330h。As depicted in Figure 3A, hardware chip 300 may include multiple processor subunits, eg, logic and control subunits 320a, 320b, 320c, 320d, 320e, 320f, 320g, and 320h. As further depicted in Figure 3A, each processor sub-unit may have an instance of dedicated memory. For example, logic and control subunit 320a is operably connected to dedicated memory instance 330a, logic and control subunit 320b is operably connected to dedicated memory instance 330b, logic and control subunit 320c is operably connected to dedicated memory instance 330c, The logic and control subunit 320d is operably connected to the special purpose memory instance 330d, the logic and control subunit 320e is operably connected to the special purpose memory instance 330e, the logic and control subunit 320f is operably connected to the special purpose memory instance 330f, the logic and Control subunit 320g is operably connected to dedicated memory instance 330g, and logic and control subunit 320h is operably connected to dedicated memory instance 330h.

尽管图3A将每个存储器实例描绘为单一存储器组，但硬件芯片300可包括两个或多于两个存储器组作为用于硬件芯片300上的处理器子单元的专用存储器实例。此外，尽管图3A将每一处理器子单元描绘为包含逻辑组件及用于专用存储器组的控制件两者，但硬件芯片300可使用用于存储器组的控制件，该控制件至少部分地与该逻辑组件分开。此外，如图3A中所描绘，可将两个或多于两个处理器子单元及其对应存储器组分组成例如处理群组310a、310b、310c及310d。「处理群组」可表示上面形成有硬件芯片300的基板上的空间区别。因此，处理群组可包括用于群组中的存储器组的其他控制件，例如，控制件340a、340b、340c及340d。另外或替代地，「处理群组」可表示用于编译代码以供在硬件芯片300上执行的目的的逻辑分组。因此，用于硬件芯片300的编译程序(下文进一步描述)可在硬件芯片300上的处理群组之间划分整组指令。Although FIG. 3A depicts each memory instance as a single memory bank, hardware chip 300 may include two or more memory banks as dedicated memory instances for processor sub-units on hardware chip 300 . Additionally, although FIG. 3A depicts each processor sub-unit as including both logic components and controls for dedicated memory banks, hardware chip 300 may use controls for memory banks that are at least partially related to The logical components are separated. Furthermore, as depicted in Figure 3A, two or more processor sub-units and their corresponding memory components may be grouped into, for example, processing groups 310a, 310b, 310c, and 310d. A "processing group" may represent a spatial distinction on a substrate on which the hardware chip 300 is formed. Thus, a processing group may include other controls for the memory banks in the group, eg, controls 340a, 340b, 340c, and 340d. Additionally or alternatively, a "processing group" may represent a logical grouping for the purpose of compiling code for execution on hardware chip 300 . Thus, a compiler for hardware chip 300 (described further below) may divide entire sets of instructions among processing groups on hardware chip 300 .

此外，主机350可将指令、数据及其他输入提供至硬件芯片300且自该硬件芯片读取输出。因此，一组指令可全部在单一晶粒上，例如在代管硬件芯片300的晶粒上执行。实际上，晶粒外的仅有通信可包括指令至硬件芯片300的加载、发送至硬件芯片300的任何输入及从硬件芯片300读取的任何输出。因此，所有计算及存储器操作可在晶粒上(在硬件芯片300上)执行，这是因为硬件芯片300的处理器子单元与硬件芯片300的专用存储器组通信。In addition, host 350 may provide instructions, data, and other inputs to hardware chip 300 and read outputs from the hardware chip. Thus, a set of instructions may all be executed on a single die, such as the die hosting the hardware chip 300 . In fact, the only communication outside the die may include the loading of instructions to the hardware chip 300 , any input sent to the hardware chip 300 , and any output read from the hardware chip 300 . Thus, all computation and memory operations can be performed on the die (on the hardware chip 300 ) because the processor subunit of the hardware chip 300 communicates with the dedicated memory banks of the hardware chip 300 .

图3B为另一示例性硬件芯片300'的实施例的示意图。尽管描绘为硬件芯片300的替代，但图3B中所描绘的架构可至少部分地与图3A中所描绘的架构组合。3B is a schematic diagram of an embodiment of another exemplary hardware chip 300'. Although depicted as an alternative to hardware chip 300, the architecture depicted in Figure 3B may be combined, at least in part, with the architecture depicted in Figure 3A.

如图3B中所描绘，硬件芯片300'可包含多个处理器子单元，例如，处理器子单元350a、350b、350c及350d。如图3B中进一步所描绘，每一处理器子单元可具有多个专用存储器实例。例如，处理器子单元350a可操作地连接至专用存储器实例330a及330b，处理器子单元350b可操作地连接至专用存储器实例330c及330d，处理器子单元350c可操作地连接至专用存储器实例330e及330f，且处理器子单元350d可操作地连接至专用存储器实例330g及330h。此外，如图3B中所描绘，可将处理器子单元及其对应存储器组分组成例如处理群组310a、310b、310c及310d。如上文所解释，「处理群组」可表示上面形成有硬件芯片300'的基板上的空间区别和/或用于编译代码以供在硬件芯片300'上执行的目的的逻辑分组。As depicted in FIG. 3B, hardware chip 300' may include multiple processor subunits, eg, processor subunits 350a, 350b, 350c, and 350d. As further depicted in Figure 3B, each processor sub-unit may have multiple dedicated memory instances. For example, processor subunit 350a is operably connected to special-purpose memory instances 330a and 330b, processor sub-unit 350b is operably connected to special-purpose memory instances 330c and 330d, and processor sub-unit 350c is operably connected to special-purpose memory instance 330e and 330f, and processor sub-unit 350d is operably connected to dedicated memory instances 330g and 330h. Furthermore, as depicted in Figure 3B, processor subunits and their corresponding memory components may be grouped into processing groups 310a, 310b, 310c, and 310d, for example. As explained above, a "processing group" may represent a spatial distinction on a substrate on which the hardware chip 300' is formed and/or a logical grouping for the purpose of compiling code for execution on the hardware chip 300'.

如图3B中进一步所描绘，处理器子单元可经由总线彼此通信。例如，如图3B所展示，处理器子单元350a可经由总线360a与处理器子单元350b通信，经由总线360c与处理器子单元350c通信，且经由总线360f与处理器子单元350d通信。类似地，处理器子单元350b可经由总线360a与处理器子单元350a通信(如上文所描述)，经由总线360e与处理器子单元350c通信，且经由总线360d与处理器子单元350d通信。此外，处理器子单元350c可经由总线360c与处理器子单元350a通信(如上文所描述)，经由总线360e与处理器子单元350b通信(如上文所描述)，且经由总线360b与处理器子单元350d通信。因此，处理器子单元350d可经由总线360f与处理器子单元350a通信(如上文所描述)，经由总线360d与处理器子单元350b通信(如上文所描述)，且经由总线360b与处理器子单元350c通信(如上文所描述)。本领域技术人员将理解，可使用比图3B中所描绘的总线少的总线。例如，可消除总线360e，使得处理器子单元350b与350c之间的通信经由处理器子单元350a和/或350d传递。类似地，可消除总线360f，使得处理器子单元350a与处理器子单元350d之间的通信经由处理器子单元350b或350c传递。As further depicted in Figure 3B, the processor subunits may communicate with each other via a bus. For example, as shown in Figure 3B, processor subunit 350a may communicate with processor subunit 350b via bus 360a, with processor subunit 350c via bus 360c, and with processor subunit 350d via bus 360f. Similarly, processor subunit 350b may communicate with processor subunit 350a (as described above) via bus 360a, with processor subunit 350c via bus 360e, and with processor subunit 350d via bus 360d. Additionally, processor subunit 350c may communicate with processor subunit 350a (as described above) via bus 360c, with processor subunit 350b (as described above) via bus 360e, and with processor subunit 350b via bus 360b Unit 350d communicates. Accordingly, processor subunit 350d may communicate with processor subunit 350a (as described above) via bus 360f, with processor subunit 350b (as described above) via bus 360d, and with processor subunit 350b via bus 360b Unit 350c communicates (as described above). Those skilled in the art will understand that fewer buses than those depicted in Figure 3B may be used. For example, bus 360e may be eliminated so that communications between processor subunits 350b and 350c pass through processor subunits 350a and/or 350d. Similarly, bus 360f may be eliminated so that communications between processor subunit 350a and processor subunit 350d pass through processor subunit 350b or 350c.

此外，本领域技术人员将理解，可使用除图3A及图3B中所描绘的架构以外的架构。例如，各具有单处理器子单元及存储器实例的处理群组的阵列可布置在基板上。处理器子单元可另外或替代地形成用于对应的专用存储器组的控制器的部分、用于对应的专用存储器的存储器垫的控制器的部分，或其类似物。Furthermore, those skilled in the art will understand that architectures other than those depicted in Figures 3A and 3B may be used. For example, an array of processing groups each having a single processor subunit and a memory instance can be arranged on a substrate. The processor subunit may additionally or alternatively form part of the controller for the corresponding dedicated memory bank, part of the controller for the corresponding dedicated memory memory pad, or the like.

鉴于上文所描述的架构，相较于传统架构，硬件芯片300及300'可显着提高存储器密集型任务的效率。例如，数据库操作及人工智能算法(诸如，神经网络)为存储器密集型任务的示例，对于存储器密集型任务，传统架构在效率上低于硬件芯片300及300'。因此，硬件芯片300及300'可被称作数据库加速器处理器和/或人工智能加速器处理器。Given the architectures described above, hardware chips 300 and 300' can significantly improve the efficiency of memory-intensive tasks compared to conventional architectures. For example, database operations and artificial intelligence algorithms (such as neural networks) are examples of memory-intensive tasks for which conventional architectures are less efficient than hardware chips 300 and 300'. Therefore, the hardware chips 300 and 300' may be referred to as database accelerator processors and/or artificial intelligence accelerator processors.

配置所公开硬件芯片Configure the disclosed hardware chip

上文所描述的硬件芯片架构可被配置为用于代码执行。例如，每一处理器子单元可与硬件芯片中的其他处理器子单元隔开而个别地执行代码(定义一组指令)。因此，替代依赖于操作系统来管理多线程处理或使用多任务处理(其为并发的而非平行的)，本公开的硬件芯片可允许处理器子单元完全平行地操作。The hardware chip architecture described above can be configured for code execution. For example, each processor subunit may execute code (defining a set of instructions) separately from other processor subunits in the hardware chip. Thus, instead of relying on the operating system to manage multi-threading or using multi-tasking (which is concurrent rather than parallel), the hardware chips of the present disclosure may allow processor sub-units to operate in full parallel.

除上文所描述的完全平行实施以外，指派给每一处理器子单元的指令中的至少一些可重叠。例如，分布式处理器上的多个处理器子单元可执行重叠指令作为例如操作系统或其他管理软件的实施，同时执行非重叠指令以便在操作系统或其他管理软件的内容背景内执行平行任务。In addition to the fully parallel implementation described above, at least some of the instructions assigned to each processor subunit may overlap. For example, multiple processor subunits on a distributed processor may execute overlapping instructions as an implementation of, for example, an operating system or other management software, while simultaneously executing non-overlapping instructions to perform parallel tasks within the context of the operating system or other management software.

图4描绘通过处理群组410进行的用于执行通用命令的示例性处理程序400。例如，处理群组410可包含本公开的硬件芯片(例如，硬件芯片300、硬件芯片300'或其类似物)的一部分。FIG. 4 depicts an exemplary processing procedure 400 by processing group 410 for executing generic commands. For example, processing group 410 may comprise a portion of a hardware chip of the present disclosure (eg, hardware chip 300, hardware chip 300', or the like).

如图4中所描绘，命令可发送至与专用存储器实例420配对的处理器子单元430。外部主机(例如，主机350)可将该命令发送至处理群组410以供执行。替代地，主机350可能已发送包括该命令的指令集以用于储存于存储器实例420中，使得处理器子单元430可从存储器实例420取回命令且执行所取回的命令。因此，该命令可由处理元件440执行，该处理元件为可配置以执行所接收的命令的通用处理元件。此外，处理群组410可包括用于存储器实例420的控制件460。如图4中所描绘，控制件460可执行处理元件440在执行所接收的命令时所需的对存储器实例420的任何读取和/或写入。在执行命令之后，处理群组410可将命令的结果输出至例如外部主机或输出至相同硬件芯片上的不同处理群组。As depicted in FIG. 4 , commands may be sent to processor sub-unit 430 paired with dedicated memory instance 420 . An external host (eg, host 350) may send the command to processing group 410 for execution. Alternatively, host 350 may have sent an instruction set including the command for storage in memory instance 420 so that processor sub-unit 430 may retrieve the command from memory instance 420 and execute the retrieved command. Accordingly, the command may be executed by processing element 440, which is a general-purpose processing element configurable to execute the received command. Additionally, processing group 410 may include controls 460 for memory instance 420 . As depicted in FIG. 4, control 460 may perform any reads and/or writes to memory instance 420 required by processing element 440 in executing the received command. After executing the command, processing group 410 may output the result of the command to, for example, an external host or to a different processing group on the same hardware chip.

在一些实施例中，如图4中所描绘，处理器子单元430还可以包括地址生成器450。「地址生成器」可包含多个处理元件，多个处理元件被配置为判定用于执行读取及写入的一个或多个存储器组中的地址，且也可对位于所判定地址处的数据执行操作(例如，加法、减法、乘法或其类似物)。例如，地址生成器450可判定用于对存储器进行的任何读取或写入的地址。在一个示例中，地址生成器450可通过在不再需要读取值时用基于命令所判定的新值覆写读取值来提高效率。另外或替代地，地址生成器450可选择可用地址以用于储存来自命令执行的结果。此可允许为后一时钟循环调度结果读出，这对于外部主机较为便利。在另一示例中，地址生成器450可在诸如向量或矩阵乘法累加(multiply-accumulate)计算的多循环计算期间判定读取及写入的地址。因此，地址生成器450可维持或计算用于读取数据及写入多循环计算的中间结果的存储器地址，使得处理器子单元430可继续处理而不必储存这些存储器地址。In some embodiments, the processor sub-unit 430 may also include an address generator 450, as depicted in FIG. 4 . An "address generator" may include a plurality of processing elements configured to determine addresses in one or more memory banks for performing reads and writes, and may also process data at the determined addresses Perform an operation (eg, addition, subtraction, multiplication, or the like). For example, the address generator 450 may determine the address for any reads or writes to memory. In one example, the address generator 450 may improve efficiency by overwriting the read value with a new value determined based on the command when the read value is no longer needed. Additionally or alternatively, address generator 450 may select an available address for storing results from command execution. This may allow the result readout to be scheduled for a subsequent clock cycle, which is convenient for an external host. In another example, address generator 450 may determine addresses for reads and writes during multi-cycle computations, such as vector or matrix multiply-accumulate computations. Thus, address generator 450 may maintain or calculate memory addresses for reading data and writing intermediate results of multi-cycle computations so that processor sub-unit 430 may continue processing without having to store these memory addresses.

图5描绘通过处理群组510进行的用于执行专门命令的示例性处理程序500。例如，处理群组510可包含本公开的硬件芯片(例如，硬件芯片300、硬件芯片300'或其类似物)的一部分。FIG. 5 depicts an exemplary processing procedure 500 by processing group 510 for executing specialized commands. For example, processing group 510 may comprise a portion of a hardware chip of the present disclosure (eg, hardware chip 300, hardware chip 300', or the like).

如图5中所描绘，专门命令(例如，乘法累加命令)可发送至与专用存储器实例520配对的处理元件530。外部主机(例如，主机350)可将该命令发送至处理元件530以供执行。因此，该命令可由处理元件530在来自主机的给定信号下执行，该处理元件为可配置以执行特定命令(包括所接收的命令)的专门处理元件。替代地，处理元件530可从存储器实例520取回命令以供执行。因此，在图5的示例中，处理元件530为乘法累加(MAC)电路，该电路被配置为执行从外部主机接收或从存储器实例520取回的MAC命令。在执行命令之后，处理群组410可将命令的结果输出至例如外部主机或输出至相同硬件芯片上的不同处理群组。尽管关于单一命令及单一结果来描绘，但可接收或取回并执行多个命令，且多个结果可在输出之前在处理群组510上组合。As depicted in FIG. 5 , specialized commands (eg, multiply-accumulate commands) may be sent to processing element 530 paired with specialized memory instance 520 . An external host (eg, host 350) may send the command to processing element 530 for execution. Thus, the command may be executed by processing element 530, which is a specialized processing element configurable to execute a particular command (including received commands), under a given signal from the host. Alternatively, processing element 530 may retrieve commands from memory instance 520 for execution. Thus, in the example of FIG. 5 , processing element 530 is a multiply-accumulate (MAC) circuit configured to execute MAC commands received from an external host or retrieved from memory instance 520 . After executing the command, processing group 410 may output the result of the command to, for example, an external host or to a different processing group on the same hardware chip. Although depicted with respect to a single command and a single result, multiple commands may be received or retrieved and executed, and multiple results may be combined on processing group 510 prior to output.

尽管在图5中描绘为MAC电路，但额外或替代的专门电路可包括于处理群组510中。例如，可实施MAX读取命令(其传回向量的最大值)、MAX0读取命令(也被称作整流器的常用功能，其传回整个向量，而且传回为0的最大值)，或其类似物。Although depicted in FIG. 5 as MAC circuitry, additional or alternative specialized circuitry may be included in processing group 510 . For example, the MAX read command (which returns the maximum value of the vector), the MAX0 read command (also known as a common function of rectifiers, which returns the entire vector and returns the maximum value of 0), or its analog.

尽管分开地描绘，但图4的一般处理群组410及图5的专门处理群组510可组合。例如，通用处理器子单元可耦接至一个或多个专门处理器子单元以形成处理器子单元。因此，通用处理器子单元可用于不可由一个或多个专门处理器子单元执行的所有指令。Although depicted separately, the general processing group 410 of FIG. 4 and the specialized processing group 510 of FIG. 5 may be combined. For example, a general-purpose processor sub-unit may be coupled to one or more specialized processor sub-units to form a processor sub-unit. Thus, a general-purpose processor subunit may be used for all instructions that are not executable by one or more specialized processor subunits.

本领域技术人员将理解，可通过专门逻辑电路来处置神经网络实施及其他记忆密集型任务。例如，数据库查询、封包检测、字符串比较及其他功能在由本文中所描述的硬件芯片执行的情况下可提高效率。Those skilled in the art will understand that neural network implementation and other memory-intensive tasks may be handled by specialized logic circuits. For example, database queries, packet inspection, string comparisons, and other functions may improve efficiency if performed by the hardware chips described herein.

用于分布式处理的基于存储器的架构Memory-Based Architecture for Distributed Processing

在符合本公开的硬件芯片上，专用总线可在该芯片上的处理器子单元之间和/或在该处理器子单元与其对应的专用存储器组之间传送数据。使用专用总线可降低仲裁成本，这是因为竞争请求系不可能的或容易使用软件而非使用硬件来避免。On a hardware chip consistent with the present disclosure, a dedicated bus may transfer data between processor subunits on the chip and/or between the processor subunits and their corresponding dedicated memory banks. Using a dedicated bus can reduce arbitration costs because competing requests are impossible or easy to avoid using software rather than hardware.

图6示意性地描绘处理群组600的示意图。处理群组600可供用于硬件芯片(例如，硬件芯片300、硬件芯片300'或其类似物)中。处理器子单元610可经由总线630连接至存储器620。存储器620可包含随机可存取存储器(RAM)元件，其储存用于由处理器子单元610执行的数据及代码。在一些实施例中，存储器620可为N路存储器(其中N为等于或大于1的数字，其暗示交错的存储器620中的区段的数量)。因为处理器子单元610经由总线630耦接至专用于处理器子单元610的存储器620，所以N可保持相对较小而不损害执行效能。此表示对常规多路寄存器文件或高速缓存的改良，其中较低N通常导致较低执行效能，且较高N通常导致大的面积及功率损失。FIG. 6 schematically depicts a schematic diagram of a processing group 600 . Processing group 600 may be used in a hardware chip (eg, hardware chip 300, hardware chip 300', or the like). The processor subunit 610 may be connected to the memory 620 via the bus 630 . Memory 620 may include random-accessible memory (RAM) elements that store data and code for execution by processor sub-unit 610 . In some embodiments, memory 620 may be an N-way memory (where N is a number equal to or greater than 1, which implies the number of segments in memory 620 that are interleaved). Because processor sub-unit 610 is coupled to memory 620 dedicated to processor sub-unit 610 via bus 630, N can be kept relatively small without compromising execution performance. This represents an improvement over conventional multi-way register files or caches, where lower N generally results in lower performance, and higher N generally results in large area and power penalties.

可根据例如一个或多个任务中所涉及的数据的大小而调整存储器620的大小、通路的数量及总线630的宽度以满足使用处理群组600的系统的任务及应用程序实施的要求。存储器元件620可包含此项技术中已知的一个或多个类型的存储器，例如，易失性存储器(诸如，RAM、DRAM、SRAM、相变RAM(PRAM)、磁阻式RAM(MRAM)、电阻式RAM(ReRAM)或其类似物)或非易失性存储器(诸如，快闪存储器或ROM)。根据一些实施例，存储器元件620的一部分可包含第一存储器类型，而另一部分可包含另一存储器类型。例如，存储器元件620的代码区可包含ROM元件，而存储器元件620的数据区可包含DRAM元件。此分割的另一示例为将神经网络的权重储存于快闪存储器中，而将用于计算的数据储存于DRAM中。The size of memory 620, the number of lanes, and the width of bus 630 may be adjusted to meet the requirements of the task and application implementation of the system using processing group 600, for example, depending on the size of the data involved in one or more tasks. Memory element 620 may comprise one or more types of memory known in the art, eg, volatile memory (such as RAM, DRAM, SRAM, phase-change RAM (PRAM), magnetoresistive RAM (MRAM), Resistive RAM (ReRAM or the like) or non-volatile memory such as flash memory or ROM. According to some embodiments, a portion of memory element 620 may include a first memory type, while another portion may include another memory type. For example, the code region of memory element 620 may include ROM elements, while the data region of memory element 620 may include DRAM elements. Another example of this split is to store the weights of the neural network in flash memory and the data for computation in DRAM.

处理器子单元610包含处理元件640，该处理元件可包含处理器。该处理器可为管线式或非管线式的，可为定制精简指令集计算(RISC)元件或其他处理方案，实施于此项技术中已知的任何商业集成电路(IC)(诸如，ARM、ARC、RISCV等)上，如本领域技术人员所了解。处理元件640可包含控制器，该控制器在一些实施例中包括算术逻辑单元(ALU)或其他控制器。The processor sub-unit 610 includes a processing element 640, which may include a processor. The processor may be pipelined or non-pipelined, a custom reduced instruction set computing (RISC) element or other processing scheme, implemented in any commercial integrated circuit (IC) known in the art (such as ARM, ARC, RISCV, etc.), as understood by those skilled in the art. Processing element 640 may include a controller, which in some embodiments includes an arithmetic logic unit (ALU) or other controller.

根据本公开的一些实施例，执行所接收或所储存的代码的处理元件640可包含通用处理元件，且因此为灵活的并能够执行广泛多种处理操作。当比较在特定操作的执行期间所消耗的功率时，非专用电路系统通常比特定操作专用电路系统消耗更多功率。因此，当执行特定的复杂算术计算时，处理元件640可比专用硬件消耗更多功率且执行效率更低。因此，根据一些实施例，处理元件640的控制器可经设计以执行特定操作(例如，加法或「移动」操作)。According to some embodiments of the present disclosure, the processing element 640 executing the received or stored code may comprise a general-purpose processing element, and thus be flexible and capable of performing a wide variety of processing operations. When comparing the power consumed during the performance of a particular operation, the non-dedicated circuitry typically consumes more power than the particular operation dedicated circuitry. Thus, when performing certain complex arithmetic calculations, processing element 640 may consume more power and perform less efficiently than dedicated hardware. Thus, according to some embodiments, the controller of processing element 640 may be designed to perform specific operations (eg, add or "move" operations).

在本公开的一实施例中，特定操作可由一个或多个加速器650执行。每一加速器可为专用的且经编程以执行特定计算(诸如，乘法、浮点向量运算或其类似物)。通过使用加速器，每个处理器子单元的每次计算所消耗的平均功率可降低，且计算吞吐量此后增大。可根据系统经设计以实施的应用程序(例如，执行神经网络、执行数据库查询或其类似物)而选择加速器650。加速器650可由处理元件640配置且可与处理元件串接地操作以用于降低功率消耗且加速计算及计算。加速器可另外或替代地用以在诸如智能型直接存储器存取(DMA)周边设备的处理群组600的存储器与MUX/DEMUX/输入/输出端口(例如，MUX 650及DEMUX 660)之间传送数据。In an embodiment of the present disclosure, certain operations may be performed by one or more accelerators 650 . Each accelerator may be dedicated and programmed to perform specific computations (such as multiplication, floating-point vector operations, or the like). By using accelerators, the average power consumed per computation per processor subunit can be reduced, and computational throughput thereafter increased. Accelerator 650 may be selected according to the application for which the system is designed to be implemented (eg, executing neural networks, executing database queries, or the like). Accelerator 650 may be configured by processing element 640 and may operate in tandem with the processing element for reducing power consumption and accelerating computation and computation. Accelerators may additionally or alternatively be used to transfer data between the memory of processing group 600, such as intelligent direct memory access (DMA) peripherals, and MUX/DEMUX/input/output ports (eg, MUX 650 and DEMUX 660) .

加速器650可被配置为执行多种功能。例如，一个加速器可被配置为执行常用于神经网络中的16比特浮点计算或8比特整数计算。加速器功能的另一示例为常用于神经网络的训练阶段期间的32比特浮点计算。加速器功能的又一示例为查询处理，诸如用于数据库中的查询处理。在一些实施例中，加速器650可包含用以执行这些功能的专门处理元件和/或可根据储存于存储器元件620上的配置数据而配置使得其可加以修改。Accelerator 650 may be configured to perform various functions. For example, an accelerator can be configured to perform 16-bit floating point computations or 8-bit integer computations commonly used in neural networks. Another example of accelerator functionality is the 32-bit floating point computations commonly used during the training phase of neural networks. Yet another example of accelerator functionality is query processing, such as for query processing in a database. In some embodiments, accelerator 650 may include specialized processing elements to perform these functions and/or may be configured such that it may be modified according to configuration data stored on memory element 620 .

加速器650可另外或替代地实施存储器移动的可配置的脚本处理列表以对数据至/从存储器620或至/从其他加速器和/或输入/输出的移动进行计时。因此，如下文进一步所解释，使用处理群组600的硬件芯片内部的所有数据移动可使用软件同步而非硬件同步。例如，一个处理群组(例如，群组600)中的加速器可每十个循环将数据从其输入端传送至其加速器，接着在下一个循环输出数据，藉此使信息从处理群组的存储器流送至另一存储器。Accelerator 650 may additionally or alternatively implement a configurable script processing list of memory moves to time the movement of data to/from memory 620 or to/from other accelerators and/or input/output. Therefore, as explained further below, all data movement within the hardware chips using processing group 600 may use software synchronization rather than hardware synchronization. For example, accelerators in one processing group (eg, group 600) may transfer data from its input to its accelerator every ten cycles, and then output data on the next cycle, thereby streaming information from the memory of the processing group to another memory.

如图6中进一步所描绘，在一些实施例中，处理群组600还可包含连接至其输入端口的至少一个输入多任务器(MUX)660及连接至其输出端口的至少一个输出DEMUX 670。这些MUX/DEMUX可由来自处理元件640和/或来自加速器650中的一个的控制信号(未图标)控制，该控制信号系根据正由处理元件640进行的当前指令和/或由加速器650中的加速器执行的操作而判定。在一些情境中，可能需要处理群组600(根据来自其代码存储器的预定义指令)将数据从其输入端口传送至其输出端口。因此，除DEMUX/MUX中的每个连接至处理元件640及加速器650以外，输入MUX(例如，MUX 660)中的一个或多个也可经由一个或多个总线直接连接至输出DEMUX(例如，DEMUX 670)。As further depicted in FIG. 6, in some embodiments, processing group 600 may also include at least one input multiplexer (MUX) 660 connected to its input ports and at least one output DEMUX 670 connected to its output ports. These MUX/DEMUXs may be controlled by control signals (not shown) from the processing element 640 and/or from one of the accelerators 650 according to the current instruction being made by the processing element 640 and/or by an accelerator in the accelerators 650 determined by the operation performed. In some scenarios, processing group 600 may be required to transfer data from its input port to its output port (according to predefined instructions from its code memory). Thus, in addition to each of the DEMUX/MUX being connected to processing element 640 and accelerator 650, one or more of the input MUXs (eg, MUX 660 ) may also be directly connected to output DEMUXs (eg, MUX 660 ) via one or more buses DEMUX 670).

图6的处理群组600可排成阵列以形成分布式处理器，例如，如图7A中所描绘。处理群组可安置于基板710上以形成阵列。在一些实施例中，基板710可包含诸如硅的半导体基板。另外或替代地，基板710可包含电路板，诸如可挠性电路板。The processing group 600 of Figure 6 may be arranged in an array to form a distributed processor, eg, as depicted in Figure 7A. Processing groups may be disposed on substrate 710 to form an array. In some embodiments, the substrate 710 may comprise a semiconductor substrate such as silicon. Additionally or alternatively, the substrate 710 may comprise a circuit board, such as a flexible circuit board.

如图7A中所描绘，基板710可包括安置于其上的多个处理群组，诸如处理群组600。因此，基板710包括存储器阵列，该存储器阵列包括多个组，诸如组720a、720b、720c、720d、720e、720f、720g及720h。此外，基板710包括处理阵列，该处理阵列可包括多个处理器子单元，诸如子单元730a、730b、730c、730d、730e、730f、730g及730h。As depicted in FIG. 7A, substrate 710 may include multiple process groups, such as process group 600, disposed thereon. Thus, substrate 710 includes a memory array that includes a plurality of groups, such as groups 720a, 720b, 720c, 720d, 720e, 720f, 720g, and 720h. Further, substrate 710 includes a processing array, which may include a plurality of processor subunits, such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h.

此外，如上文所解释，每一处理群组可包括一处理器子单元及专用于该处理器子单元的一个或多个对应的存储器组。因此，如图7A中所描绘，每一子单元与一对应的专用存储器组相关联，例如：处理器子单元730a与存储器组720a相关联，处理器子单元730b与存储器组720b相关联，处理器子单元730c与存储器组720c相关联，处理器子单元730d与存储器组720d相关联，处理器子单元730e与存储器组720e相关联，处理器子单元730f与存储器组720f相关联，处理器子单元730g与存储器组720g相关联，处理器子单元730h与存储器组720h相关联。Furthermore, as explained above, each processing group may include a processor sub-unit and one or more corresponding memory banks dedicated to that processor sub-unit. Thus, as depicted in Figure 7A, each subunit is associated with a corresponding dedicated memory bank, eg, processor subunit 730a is associated with memory bank 720a, processor subunit 730b is associated with memory bank 720b, processing Processor subunit 730c is associated with memory bank 720c, processor subunit 730d is associated with memory bank 720d, processor subunit 730e is associated with memory bank 720e, processor subunit 730f is associated with memory bank 720f, and processor subunit 730f is associated with memory bank 720f. Unit 730g is associated with memory bank 720g, and processor sub-unit 730h is associated with memory bank 720h.

为了允许每一处理器子单元与其对应的专用存储器组通信，基板710可包括将处理器子单元中的一个连接至其对应的专用存储器组的第一多个总线。因此，总线740a将处理器子单元730a连接至存储器组720a，总线740b将处理器子单元730b连接至存储器组720b，总线740c将处理器子单元730c连接至存储器组720c，总线740d将处理器子单元730d连接至存储器组720d，总线740e将处理器子单元730e连接至存储器组720e，总线740f将处理器子单元730f连接至存储器组720f，总线740g将处理器子单元730g连接至存储器组720g，且总线740h将处理器子单元730h连接至存储器组720h。此外，为了允许每一处理器子单元与其他处理器子单元通信，基板710可包括将处理器子单元中的一个连接至处理器子单元中的另一个的第二多个总线。在图7A的示例中，总线750a将处理器子单元730a连接至处理器子单元750e，总线750b将处理器子单元730a连接至处理器子单元750b，总线750c将处理器子单元730b连接至处理器子单元750f，总线750d将处理器子单元730b连接至处理器子单元750c，总线750e将处理器子单元730c连接至处理器子单元750g，总线750f将处理器子单元730c连接至处理器子单元750d，总线750g将处理器子单元730d连接至处理器子单元750h，总线750h将处理器子单元730h连接至处理器子单元750g，总线750i将处理器子单元730g连接至处理器子单元750g，且总线750j将处理器子单元730f连接至处理器子单元750e。To allow each processor subunit to communicate with its corresponding dedicated memory bank, substrate 710 may include a first plurality of buses connecting one of the processor subunits to its corresponding dedicated memory bank. Thus, bus 740a connects processor subunit 730a to memory bank 720a, bus 740b connects processor subunit 730b to memory bank 720b, bus 740c connects processor subunit 730c to memory bank 720c, and bus 740d connects processor subunit 730c to memory bank 720c. unit 730d connects to memory bank 720d, bus 740e connects processor subunit 730e to memory bank 720e, bus 740f connects processor subunit 730f to memory bank 720f, and bus 740g connects processor subunit 730g to memory bank 720g, And bus 740h connects processor subunit 730h to memory bank 720h. Furthermore, to allow each processor subunit to communicate with other processor subunits, the substrate 710 may include a second plurality of buses connecting one of the processor subunits to another of the processor subunits. In the example of Figure 7A, bus 750a connects processor subunit 730a to processor subunit 750e, bus 750b connects processor subunit 730a to processor subunit 750b, and bus 750c connects processor subunit 730b to the processor subunit 750b processor subunit 750f, bus 750d connects processor subunit 730b to processor subunit 750c, bus 750e connects processor subunit 730c to processor subunit 750g, and bus 750f connects processor subunit 730c to processor subunit 750g Unit 750d, bus 750g connects processor subunit 730d to processor subunit 750h, bus 750h connects processor subunit 730h to processor subunit 750g, bus 750i connects processor subunit 730g to processor subunit 750g , and bus 750j connects processor subunit 730f to processor subunit 750e.

因此，在图7A中所展示的示例布置中，多个逻辑处理器子单元布置成至少一行及至少一列。该第二多个总线将每一处理器子单元连接至相同行中的至少一个邻近处理器子单元且连接至相同列中的至少一个邻近处理器子单元。图7A可被称作「部分块连接」。Thus, in the example arrangement shown in Figure 7A, a plurality of logical processor sub-units are arranged in at least one row and at least one column. The second plurality of buses connect each processor subunit to at least one adjacent processor subunit in the same row and to at least one adjacent processor subunit in the same column. Figure 7A may be referred to as a "partial block connection".

图7A中所展示的配置可经修改以形成「完全块连接」。完全块连接包括连接对角线处理器子单元的额外总线。例如，该第二多个总线可包括处理器子单元730a与处理器子单元730f之间、处理器子单元730b与处理器子单元730e之间、处理器子单元730b与处理器子单元730g之间、处理器子单元730c与处理器子单元730f之间、处理器子单元730c与处理器子单元730h之间以及处理器子单元730d与处理器子单元730g之间的额外总线。The configuration shown in Figure 7A can be modified to form "full block connections." A full block connection includes additional buses connecting diagonal processor subunits. For example, the second plurality of buses may include between processor subunit 730a and processor subunit 730f, between processor subunit 730b and processor subunit 730e, and between processor subunit 730b and processor subunit 730g There are additional buses between, between processor subunit 730c and processor subunit 730f, between processor subunit 730c and processor subunit 730h, and between processor subunit 730d and processor subunit 730g.

完全块连接可用于卷积计算，在卷积计算中，使用储存于附近处理器子单元中的数据及结果。例如，在卷积图像处理期间，每一处理器子单元可接收图像的块(诸如，像素或像素群组)。为了计算卷积结果，每一处理器子单元可从所有八个邻近处理器子单元获取数据，该邻近处理器子单元中的每个已接收对应块。在部分块连接中，来自对角线邻近处理器子单元的数据可经由连接至该处理器子单元的其他邻近处理器子单元传递。因此，芯片上的分布式处理器可为人工智能加速器处理器。Full block connections can be used for convolution computations, where data and results stored in nearby processor subunits are used. For example, during convolutional image processing, each processor subunit may receive a block of an image (such as a pixel or group of pixels). To compute the convolution result, each processor subunit may obtain data from all eight adjacent processor subunits, each of which has received a corresponding block. In a partial block connection, data from a diagonally adjacent processor subunit may be passed via other adjacent processor subunits connected to that processor subunit. Therefore, the distributed processor on the chip can be an artificial intelligence accelerator processor.

在卷积计算的特定实施例中，可跨越多个处理器子单元来划分N×M图像。每一处理器子单元可在其对应块上通过A×B滤波器执行卷积。为了对块之间的边界上的一个或多个像素执行滤波，每一处理器子单元可能需要来自相邻处理器子单元的数据，该相邻处理器子单元具有包括相同边界上的像素的块。因此，针对每一处理器子单元产生的代码配置该子单元以计算卷积且每当需要来自邻近子单元的数据时从第二多个总线提取。将数据输出至第二多个总线的对应命令被提供至该子单元以确保所需数据传送的适当时序。In certain embodiments of convolution computations, the NxM image may be partitioned across multiple processor subunits. Each processor sub-unit may perform convolution with an AxB filter on its corresponding block. In order to perform filtering on one or more pixels on the boundaries between blocks, each processor sub-unit may require data from adjacent processor sub-units with data that includes pixels on the same boundary piece. Thus, the code generated for each processor subunit configures that subunit to compute convolutions and fetch from the second plurality of buses whenever data from adjacent subunits is needed. Corresponding commands to output data to the second plurality of buses are provided to the subunit to ensure proper timing of the required data transfers.

图7A的部分块连接可修改为N部分块连接。在此修改中，第二多个总线可进一步将每一处理器子单元连接至在图7A的总线延其所沿的四个方向(也即，上、下、左及右)上在该处理器子单元的阈值距离内(例如，在n个处理器子单元内)的处理器子单元。可对完全块连接进行类似修改(以产生N完全块连接)，使得第二多个总线进一步将每一处理器子单元连接至在除两个对角线方向以外的图7A的总线延其所沿的四个方向上在该处理器子单元的阈值距离内(例如，在n个处理器子单元内)的处理器子单元。The partial block connections of FIG. 7A can be modified to N partial block connections. In this modification, the second plurality of buses may further connect each processor subunit to the processing in the four directions (ie, up, down, left, and right) along which the bus of FIG. 7A extends. processor subunits within a threshold distance of the processor subunits (eg, within n processor subunits). A similar modification can be made to the full block connections (to produce N full block connections), such that the second plurality of buses further connect each processor subunit to the bus of FIG. 7A in directions other than two diagonal directions. Processor subunits that are within a threshold distance of that processor subunit (eg, within n processor subunits) in four directions along.

其他布置也是可能的。例如，在图7B中所展示的布置中，总线750a将处理器子单元730a连接至处理器子单元730d，总线750b将处理器子单元730a连接至处理器子单元730b，总线750c将处理器子单元730b连接至处理器子单元730c，且总线750d将处理器子单元730c连接至处理器子单元730d。因此，在图7B中所展示的示例布置中，多个处理器子单元布置成星形图案。第二多个总线将每一处理器子单元连接至星形图案内的至少一个邻近处理器子单元。Other arrangements are also possible. For example, in the arrangement shown in Figure 7B, bus 750a connects processor subunit 730a to processor subunit 730d, bus 750b connects processor subunit 730a to processor subunit 730b, and bus 750c connects processor subunit 730b Unit 730b is connected to processor subunit 730c, and bus 750d connects processor subunit 730c to processor subunit 730d. Thus, in the example arrangement shown in Figure 7B, the plurality of processor subunits are arranged in a star pattern. The second plurality of buses connect each processor subunit to at least one adjacent processor subunit within the star pattern.

其他布置(未示出)也是可能的。例如，可使用相邻者连接布置，使得多个处理器子单元布置在一个或多个线中(例如，类似于图7A中所描绘的情况)。在相邻者连接布置中，第二多个总线将每一处理器子单元连接至相同线中在左方的处理器子单元、相同线中在右方的处理器子单元、相同线中在左方及右方两者的处理器子单元，等等。Other arrangements (not shown) are also possible. For example, a neighbor connection arrangement may be used such that multiple processor subunits are arranged in one or more lines (eg, similar to the situation depicted in Figure 7A). In a neighbor-connected arrangement, a second plurality of buses connects each processor subunit to the processor subunit on the left in the same line, the processor subunit in the same line on the right, the processor subunit in the same line Both the left and right processor subunits, and so on.

在另一实施例中，可使用N线性连接布置。在N线性连接布置中，第二多个总线将每一处理器子单元连接至在该处理器子单元的阈值距离内(例如，在n个处理器子单元内)的处理器子单元。N线性连接布置可与线阵列(上文所描述)、矩形阵列(图7A中所描绘)、椭圆形阵列(图7B中所描绘)或任何其他几何阵列共同使用。In another embodiment, an N linear connection arrangement may be used. In an N linear connection arrangement, the second plurality of buses connects each processor subunit to processor subunits within a threshold distance of that processor subunit (eg, within n processor subunits). The N-linear connection arrangement can be used with line arrays (described above), rectangular arrays (depicted in Figure 7A), elliptical arrays (depicted in Figure 7B), or any other geometric array.

在又一实施例中，可使用N对数连接布置。在N对数连接布置中，第二多个总线将每一处理器子单元连接至在该处理器子单元的二的幂的阈值距离内(例如，在2ⁿ个处理器子单元内)的处理器子单元。N对数连接布置可与线阵列(上文所描述)、矩形阵列(图7A中所描绘)、椭圆形阵列(图7B中所描绘)或任何其他几何阵列共同使用。In yet another embodiment, an N logarithmic connection arrangement may be used. In an N log-connected arrangement, a second plurality of buses connect each processor subunit to a processor subunit within a threshold distance of the power of two (eg, within ²ⁿ processor subunits) of the processor subunit. processor sub-unit. The N-log connection arrangement can be used with line arrays (described above), rectangular arrays (depicted in Figure 7A), elliptical arrays (depicted in Figure 7B), or any other geometric array.

可组合上文所描述的连接方案中的任一者以用于相同硬件芯片中。例如，可在一个区中使用完全块连接，而在另一区中使用部分块连接。在另一实施例中，可在一个区中使用N线性连接布置，而在另一区中使用N完全块连接。Any of the connection schemes described above can be combined for use in the same hardware chip. For example, a full block connection can be used in one zone and a partial block connection in another zone. In another embodiment, an N linear connection arrangement may be used in one region and N full block connections in another region.

替代存储器芯片的处理器子单元之间的专用总线或除该专用总线以外，也可使用一个或多个共享总线以互连分布式处理器的所有处理器子单元(或处理器子单元的子集)。仍可通过使用由处理器子单元执行的代码对共享总线上的数据传送进行计时来避免共享总线上的冲突，如下文进一步所解释。替代共享总线或除共享总线以外，也可使用可配置总线以动态地连接处理器子单元以形成连接至分开总线的处理器单元的群组。例如，可配置总线可包括晶体管或可由处理器子单元来控制以将数据传送引导至选定处理器子单元的其他机构。Instead of or in addition to a dedicated bus between processor subunits of a memory chip, one or more shared buses may be used to interconnect all processor subunits (or subunits of processor subunits) of a distributed processor. set). Conflicts on the shared bus can still be avoided by using code executed by the processor subunit to time data transfers on the shared bus, as explained further below. Instead of or in addition to a shared bus, a configurable bus may also be used to dynamically connect processor sub-units to form groups of processor units connected to separate buses. For example, a configurable bus may include transistors or other mechanisms that may be controlled by a processor subunit to direct data transfers to selected processor subunits.

在图7A及图7B两者中，处理阵列的多个处理器子单元在空间上分布于存储器阵列的多个离散存储器组当中。在其他替代实施例(未示出)中，多个处理器子单元可聚集在基板的一个或多个区中，且多个存储器组可聚集在基板的一个或多个其他区中。在一些实施例中，可使用空间分布与聚集的组合(未图标)。例如，基板的一个区可包括处理器子单元的丛集，基板的另一区可包括存储器组的丛集，且基板的又一区可包括分布于存储器组当中的处理阵列。In both Figures 7A and 7B, the plurality of processor subunits of the processing array are spatially distributed among the plurality of discrete memory banks of the memory array. In other alternative embodiments (not shown), multiple processor subunits may be clustered in one or more regions of the substrate, and multiple memory banks may be clustered in one or more other regions of the substrate. In some embodiments, a combination of spatial distribution and aggregation (not shown) may be used. For example, one region of the substrate may include a cluster of processor subunits, another region of the substrate may include a cluster of memory banks, and yet another region of the substrate may include processing arrays distributed among the memory banks.

本领域技术人员将认识到，在基板上将处理器群组600排成阵列并非排他性实施例。例如，每一处理器子单元可与至少两个专用存储器组相关联。因此，可替代处理群组600或与处理群组600组合地使用图3B的处理群组310a、310b、310c及310d，以形成处理阵列及存储器阵列。可使用包括例如三个、四个或多于四个专用存储器组的其他处理群组(未示出)。Those skilled in the art will recognize that arraying processor groups 600 on a substrate is not an exclusive embodiment. For example, each processor sub-unit may be associated with at least two dedicated memory banks. Accordingly, processing groups 310a, 310b, 310c, and 310d of FIG. 3B may be used in place of or in combination with processing group 600 to form processing arrays and memory arrays. Other processing groups (not shown) including, for example, three, four, or more than four dedicated memory banks may be used.

多个处理器子单元中的每个可被配置为相对于包括于多个处理器子单元中的其他处理器子单元独立地执行与特定应用程序相关联的软件代码。例如，如下文所解释，指令的多个子系列可分组为机器码且被提供至每一处理器子单元以供执行。Each of the plurality of processor sub-units may be configured to execute software code associated with a particular application program independently of other processor sub-units included in the plurality of processor sub-units. For example, as explained below, multiple sub-series of instructions may be grouped into machine code and provided to each processor sub-unit for execution.

在一些实施例中，每一专用存储器组包含至少一个动态随机存取存储器(DRAM)。替代地，存储器组可包含诸如静态随机存取存储器(SRAM)、DRAM、快闪存储器或其类似物的存储器类型的混合。In some embodiments, each dedicated memory bank includes at least one dynamic random access memory (DRAM). Alternatively, the memory bank may contain a mix of memory types such as static random access memory (SRAM), DRAM, flash memory, or the like.

在常规处理器中，处理器子单元之间的数据共享通常通过共享存储器来执行。共享存储器通常需要大部分芯片面积和/或执行由额外硬件(诸如，仲裁器)管理的总线。如上文所描述，该总线造成瓶颈。此外，可在芯片外部的共享存储器通常包括缓存一致性机制及更复杂的高速缓存(例如，L1高速缓存、L2高速缓存及共享DRAM)，以便将准确且最新的数据提供至处理器子单元。如下文进一步所解释，图7A及图7B中所描绘的专用总线允许无硬件管理(诸如，仲裁器)的硬件芯片。此外，使用如图7A及图7B中所描绘的专用存储器允许消除复杂的高速缓存层及一致性机制。In conventional processors, data sharing between processor subunits is typically performed through shared memory. Shared memory typically requires most of the chip area and/or implements a bus managed by additional hardware such as an arbiter. As described above, this bus creates a bottleneck. Additionally, shared memory, which may be off-chip, typically includes cache coherency mechanisms and more complex caches (eg, L1 cache, L2 cache, and shared DRAM) to provide accurate and up-to-date data to processor subunits. As explained further below, the dedicated bus depicted in Figures 7A and 7B allows a hardware chip without hardware management, such as an arbiter. Furthermore, the use of dedicated memory as depicted in Figures 7A and 7B allows for the elimination of complex cache layers and coherency mechanisms.

反而，为了允许每一处理器子单元存取由其他处理器子单元计算和/或储存于专用于其他处理器子单元的存储器组中的数据，提供总线，该总线的时序系使用由每一处理器子单元个别地执行的代码动态地执行。此情形允许消除如常规地所使用的大部分(若非全部)总线管理硬件。此外，这些总线上的直接传送替换复杂的高速缓存机制，以缩减在存储器读取及写入期间的延时时间。Instead, to allow each processor subunit to access data computed by the other processor subunits and/or stored in memory banks dedicated to the other processor subunits, a bus is provided whose timing is used by each processor subunit. Code executed individually by processor subunits is executed dynamically. This situation allows the elimination of most, if not all, bus management hardware as conventionally used. In addition, direct transfers on these buses replace complex caching mechanisms to reduce latency during memory reads and writes.

基于存储器的处理阵列memory-based processing array

如图7A及图7B中所描绘，本公开的存储器芯片可独立地操作。替代地，本公开的存储器芯片可与诸如存储器设备(例如，一个或多个DRAM组)、系统单芯片、场可编程门阵列(FPGA)或其他处理和/或存储器芯片的一个或多个额外集成电路可操作地连接。在这些实施例中，由该架构执行的一系列指令中的任务可在存储器芯片的处理器子单元与额外集成电路的任何处理器子单元之间进行划分(例如，通过编译程序，如下文所描述)。例如，其他集成电路可包含将指令和/或数据输入至存储器芯片且从其接收输出的主机(例如，图3A的主机350)。As depicted in Figures 7A and 7B, the memory chips of the present disclosure can operate independently. Alternatively, the memory chips of the present disclosure may be combined with one or more additional devices such as memory devices (eg, one or more DRAM banks), system-on-chip, field-programmable gate arrays (FPGA), or other processing and/or memory chips The integrated circuits are operably connected. In these embodiments, tasks in a series of instructions executed by the architecture may be divided between the processor sub-unit of the memory chip and any processor sub-unit of the additional integrated circuit (eg, by compiling a program, as described below). describe). For example, other integrated circuits may include a host (eg, host 350 of FIG. 3A ) that inputs instructions and/or data to and receives output from the memory chip.

为了将本公开的存储器芯片与一个或多个额外集成电路互连，存储器芯片可包括存储器接口，诸如遵从联合电子设备工程委员会(Joint Electron Device EngineeringCouncil；JEDEC)标准或其变体中的任一者的存储器接口。一个或多个额外集成电路接着可连接至该存储器接口。因此，若该一个或多个额外集成电路连接至本公开的多个存储器芯片，则数据可经由该一个或多个额外集成电路在存储器芯片之间共享。另外或替代地，该一个或多个额外集成电路可包括用以连接至本公开的存储器芯片上的总线的总线，使得该一个或多个额外集成电路可与本公开的存储器芯片串接地执行代码。在这些实施例中，该一个或多个额外集成电路进一步辅助分布式处理，即使该额外集成电路可与本公开的存储器芯片在不同基板上亦如此。To interconnect the memory chips of the present disclosure with one or more additional integrated circuits, the memory chips may include a memory interface, such as compliance with any of the Joint Electron Device Engineering Council (JEDEC) standards or variants thereof memory interface. One or more additional integrated circuits may then be connected to the memory interface. Thus, if the one or more additional integrated circuits are connected to multiple memory chips of the present disclosure, data may be shared among the memory chips via the one or more additional integrated circuits. Additionally or alternatively, the one or more additional integrated circuits may include a bus to connect to a bus on the memory chip of the present disclosure such that the one or more additional integrated circuits may execute code in tandem with the memory chip of the present disclosure . In these embodiments, the one or more additional integrated circuits further facilitate distributed processing, even though the additional integrated circuits may be on different substrates from the memory chips of the present disclosure.

此外，本公开的存储器芯片可排成阵列以便形成分布式处理器的阵列。例如，一个或多个总线可将存储器芯片770a连接至额外存储器芯片770b，如图7C中所描绘。在图7C的实施例中，存储器芯片770a包括处理器子单元与专用于每一处理器子单元的一个或多个对应的存储器组，例如：处理器子单元730a与存储器组720a相关联，处理器子单元730b与存储器组720b相关联，处理器子单元730e与存储器组720c相关联，且处理器子单元730f与存储器组720d相关联。总线将每一处理器子单元连接至其对应的存储器组。因此，总线740a将处理器子单元730a连接至存储器组720a，总线740b将处理器子单元730b连接至存储器组720b，总线740c将处理器子单元730e连接至存储器组720c，且总线740d将处理器子单元730f连接至存储器组720d。此外，总线750a将处理器子单元730a连接至处理器子单元750e，总线750b将处理器子单元730a连接至处理器子单元750b，总线750c将处理器子单元730b连接至处理器子单元750f，且总线750d将处理器子单元730e连接至处理器子单元750f。例如，如上文所描述，可使用存储器芯片770a的其他布置。Furthermore, the memory chips of the present disclosure may be arranged in an array to form an array of distributed processors. For example, one or more buses may connect memory chip 770a to additional memory chips 770b, as depicted in Figure 7C. In the embodiment of FIG. 7C, memory chip 770a includes processor subunits and one or more corresponding memory banks dedicated to each processor subunit, eg, processor subunit 730a is associated with memory bank 720a, processing Processor subunit 730b is associated with memory bank 720b, processor subunit 730e is associated with memory bank 720c, and processor subunit 730f is associated with memory bank 720d. A bus connects each processor subunit to its corresponding memory bank. Thus, bus 740a connects processor subunit 730a to memory bank 720a, bus 740b connects processor subunit 730b to memory bank 720b, bus 740c connects processor subunit 730e to memory bank 720c, and bus 740d connects processor subunit 730b to memory bank 720c. Subunit 730f is connected to memory bank 720d. Additionally, bus 750a connects processor subunit 730a to processor subunit 750e, bus 750b connects processor subunit 730a to processor subunit 750b, bus 750c connects processor subunit 730b to processor subunit 750f, And bus 750d connects processor subunit 730e to processor subunit 750f. For example, as described above, other arrangements of memory chips 770a may be used.

类似地，存储器芯片770b包括处理器子单元与专用于每一处理器子单元的一个或多个对应的存储器组，例如：处理器子单元730c与存储器组720e相关联，处理器子单元730d与存储器组720f相关联，处理器子单元730g与存储器组720g相关联，且处理器子单元730h与存储器组720h相关联。总线将每一处理器子单元连接至其对应的存储器组。因此，总线740e将处理器子单元730c连接至存储器组720e，总线740f将处理器子单元730d连接至存储器组720f，总线740g将处理器子单元730g连接至存储器组720g，且总线740h将处理器子单元730h连接至存储器组720h。此外，总线750g将处理器子单元730c连接至处理器子单元750g，总线750h将处理器子单元730d连接至处理器子单元750h，总线750i将处理器子单元730c连接至处理器子单元750d，且总线750j将处理器子单元730g连接至处理器子单元750h。例如，如上文所描述，可使用存储器芯片770b的其他布置。Similarly, memory chip 770b includes processor subunits and one or more corresponding memory banks dedicated to each processor subunit, eg, processor subunit 730c is associated with memory bank 720e, processor subunit 730d is associated with Memory bank 720f is associated, processor sub-unit 730g is associated with memory bank 720g, and processor sub-unit 730h is associated with memory bank 720h. A bus connects each processor subunit to its corresponding memory bank. Thus, bus 740e connects processor subunit 730c to memory bank 720e, bus 740f connects processor subunit 730d to memory bank 720f, bus 740g connects processor subunit 730g to memory bank 720g, and bus 740h connects processor subunit 730d to memory bank 720g Subunit 730h is connected to memory bank 720h. Additionally, bus 750g connects processor subunit 730c to processor subunit 750g, bus 750h connects processor subunit 730d to processor subunit 750h, bus 750i connects processor subunit 730c to processor subunit 750d, And bus 750j connects processor sub-unit 730g to processor sub-unit 750h. For example, as described above, other arrangements of memory chips 770b may be used.

存储器芯片770a及770b的处理器子单元可使用一个或多个总线来连接。因此，在图7C的实施例中，总线750e可将存储器芯片770a的处理器子单元730b与存储器芯片770b的处理器子单元730c连接，且总线750f可将存储器芯片770a的处理器子单元730f与存储器770b的处理器子单元730c连接。例如，总线750e可用作至存储器芯片770b的输入总线(且因此用作存储器芯片770a的输出总线)，而总线750f可用作至存储器芯片770a的输入总线(且因此用作存储器芯片770b的输出总线)，或反之亦然。替代地，总线750e及750f均可用作存储器芯片770a与770b之间的双向总线。The processor subunits of memory chips 770a and 770b may be connected using one or more buses. Thus, in the embodiment of FIG. 7C, bus 750e may connect processor subunit 730b of memory chip 770a with processor subunit 730c of memory chip 770b, and bus 750f may connect processor subunit 730f of memory chip 770a with The processor subunit 730c of the memory 770b is connected. For example, bus 750e may be used as an input bus to memory chip 770b (and thus as an output bus of memory chip 770a), while bus 750f may be used as an input bus to memory chip 770a (and thus as an output of memory chip 770b) bus), or vice versa. Alternatively, both buses 750e and 750f may function as bidirectional buses between memory chips 770a and 770b.

总线750e及750f可包括直接导线或可在高速连接上交错，以便缩减用于存储器芯片770a与集成电路770b之间的芯片间接口的接脚。此外，用于存储器芯片本身中的上文所描述的连接布置中的任一者可用以将存储器芯片连接至一个或多个额外集成电路。例如，存储器芯片770a及770b可使用完全块或部分块连接而非如图7C所展示仅使用两个总线来连接。Buses 750e and 750f may include direct wires or may be interleaved on high-speed connections in order to reduce the pins used for the inter-chip interface between memory chip 770a and integrated circuit 770b. Furthermore, any of the connection arrangements described above for use in the memory chip itself may be used to connect the memory chip to one or more additional integrated circuits. For example, memory chips 770a and 770b may be connected using full block or partial block connections rather than using only two buses as shown in Figure 7C.

因此，尽管使用总线750e及750f来描绘，但架构760可包括更少总线或额外总线。例如，可使用处理器子单元730b与730c之间或处理器子单元730f与730c之间的单一总线。替代地，可使用例如处理器子单元730b与730d之间、处理器子单元730f与730d之间或其类似物的额外总线。Thus, although depicted using buses 750e and 750f, architecture 760 may include fewer or additional buses. For example, a single bus between processor subunits 730b and 730c or between processor subunits 730f and 730c may be used. Alternatively, additional buses may be used, eg, between processor subunits 730b and 730d, between processor subunits 730f and 730d, or the like.

此外，尽管描绘为使用单一存储器芯片及一额外集成电路，但多个存储器芯片可使用如上文所解释的总线来连接。例如，如图7C的实施例中所描绘，存储器芯片770a、770b、770c及770d连接成一阵列。类似于上文所描述的存储器芯片，每个存储器芯片包括处理器子单元及专用存储器组。因此，此处不重复对这些组件的描述。Furthermore, although depicted as using a single memory chip and an additional integrated circuit, multiple memory chips may be connected using a bus as explained above. For example, as depicted in the embodiment of Figure 7C, memory chips 770a, 770b, 770c, and 770d are connected in an array. Similar to the memory chips described above, each memory chip includes a processor subunit and a dedicated memory bank. Therefore, descriptions of these components are not repeated here.

在图7C的实施例中，存储器芯片770a、770b、770c及770d连接成一回路。因此，总线750a连接存储器芯片770a与770d，总线750c连接存储器芯片770a与770b，总线750e连接存储器芯片770b与770c，且总线750g连接存储器芯片770c与770d。尽管存储器芯片770a、770b、770c及770d可利用完全块连接、部分块连接或其他连接布置来连接，但图7C的实施例允许存储器芯片770a、770b、770c及770d之间的更少接脚连接。In the embodiment of FIG. 7C, memory chips 770a, 770b, 770c, and 770d are connected in a loop. Thus, bus 750a connects memory chips 770a and 770d, bus 750c connects memory chips 770a and 770b, bus 750e connects memory chips 770b and 770c, and bus 750g connects memory chips 770c and 770d. While memory chips 770a, 770b, 770c, and 770d may be connected using full block connections, partial block connections, or other connection arrangements, the embodiment of Figure 7C allows fewer pin connections between memory chips 770a, 770b, 770c, and 770d .

相对较大的存储器relatively large memory

本公开的实施例可使用大小与常规处理器的共享存储器相比相对较大的专用存储器。使用专用存储器而非共享存储器允许继续获得效率增益而不会随着存储器增大而逐渐缩减。此允许诸如神经网络处理及数据库查询的存储器密集型任务比在常规处理器中更高效地执行，在常规处理器中，增大共享存储器的效率增益由于冯诺伊曼瓶颈而逐渐缩减。Embodiments of the present disclosure may use dedicated memory that is relatively large in size compared to the shared memory of conventional processors. Using dedicated memory rather than shared memory allows continued efficiency gains without tapering as memory increases. This allows memory-intensive tasks such as neural network processing and database queries to be performed more efficiently than in conventional processors, where the efficiency gains of increasing shared memory taper off due to the von Neumann bottleneck.

例如，在本公开的分布式处理器中，安置于分布式处理器的基板上的存储器阵列可包括多个离散存储器组。离散存储器组中的每个可具有大于一兆字节(megabyte)的容量；以及安置于该基板上的处理阵列，该处理阵列包括多个处理器子单元。如上文所解释，该处理器子单元中的每个可与多个离散存储器组中的对应的专用存储器组相关联。在一些实施例中，多个处理器子单元可在空间上分布于存储器阵列内的多个离散存储器组当中。通过将至少一兆字节的专用存储器而非几兆字节的共享高速缓存用于大型CPU或GPU，本公开的分布式处理器获得在常规系统中由于CPU及GPU中的冯诺依曼瓶颈而不可能达成的效率。For example, in a distributed processor of the present disclosure, a memory array disposed on a substrate of the distributed processor may include a plurality of discrete memory banks. Each of the discrete memory banks may have a capacity greater than one megabyte; and a processing array disposed on the substrate, the processing array including a plurality of processor subunits. As explained above, each of the processor sub-units may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. In some embodiments, multiple processor sub-units may be spatially distributed among multiple discrete memory banks within the memory array. By using at least one megabyte of dedicated memory for a large CPU or GPU instead of several megabytes of shared cache, the distributed processor of the present disclosure achieves the von Neumann bottleneck in conventional systems due to the CPU and GPU impossible to achieve efficiency.

不同存储器可用作专用存储器。例如，每一专用存储器组可包含至少一个DRAM组。替代地，每一专用存储器组可包含至少一个静态随机存取存储器组。在其他实施例中，不同类型的存储器可在单一硬件芯片上组合。Different memories can be used as dedicated memories. For example, each dedicated memory bank may include at least one DRAM bank. Alternatively, each dedicated memory bank may include at least one static random access memory bank. In other embodiments, different types of memory may be combined on a single hardware chip.

如上文所解释，每一专用存储器可为至少一兆字节。因此，每一专用存储器组可大小相同，或多个存储器组中的至少两个存储器组可具有不同大小。As explained above, each dedicated memory may be at least one megabyte. Thus, each dedicated memory bank may be the same size, or at least two memory banks of the plurality of memory banks may be of different sizes.

此外，如上文所描述，该分布式处理器可包括：第一多个总线，其各将多个处理器子单元中的一个连接至一对应的专用存储器组；及第二多个总线，其各将多个处理器子单元中的一个连接至多个处理器子单元中的另一个。Furthermore, as described above, the distributed processor may include: a first plurality of buses each connecting one of the plurality of processor subunits to a corresponding dedicated memory bank; and a second plurality of buses, which Each connects one of the plurality of processor subunits to another of the plurality of processor subunits.

使用软件的同步Synchronization using software

如上文所解释，本公开的硬件芯片可使用软件而非硬件来管理数据传送。具体而言，因为总线上的传送、对存储器进行的读取及写入以及处理器子单元的计算的时序通过处理器子单元所执行的指令的子系列设定，所以本公开的硬件芯片可执行代码以防止总线上的冲突。因此，本公开的硬件芯片可避免常规地用以管理数据传送的硬件机构(诸如，芯片内的网络控制器、处理器子单元之间的封包剖析器及封包传送器、总线仲裁器、用以避免仲裁的多个总线，或其类似物)。As explained above, the hardware chip of the present disclosure may manage data transfer using software rather than hardware. Specifically, because the timing of transfers on the bus, reads and writes to memory, and calculations by the processor subunits are set by the subseries of instructions executed by the processor subunits, the hardware chip of the present disclosure can Execute code to prevent collisions on the bus. Therefore, the hardware chip of the present disclosure can avoid hardware mechanisms conventionally used to manage data transfer, such as an on-chip network controller, a packet parser and packet transmitter between processor sub-units, a bus arbiter, a Avoid arbitrating multiple buses, or the like).

若本公开的硬件芯片常规地传送数据，则利用总线连接N个处理器子单元将需要由仲裁器控制的总线仲裁或宽MUX。反而，如上文所描述，本公开的实施例可在处理器子单元之间使用仅为导线、光学缆线或其类似物的总线，其中该处理器子单元个别地执行代码以避免总线上的冲突。因此，本公开的实施例可节省基板上的空间以及材料成本及效率损失(例如，由于仲裁导致的功率及时间消耗)。相较于使用先进先出(FIFO)控制器和/或信箱的其他架构，效率及空间增益甚至更大。If the hardware chips of the present disclosure routinely transfer data, connecting the N processor sub-units using a bus would require bus arbitration or a wide MUX controlled by an arbiter. Instead, as described above, embodiments of the present disclosure may use a bus between processor subunits that are merely wires, optical cables, or the like, where the processor subunits execute code individually to avoid conflict. Accordingly, embodiments of the present disclosure may save space on the substrate as well as material cost and efficiency losses (eg, power and time consumption due to arbitration). The efficiency and space gains are even greater compared to other architectures using first-in-first-out (FIFO) controllers and/or letter boxes.

此外，如上文所解释，除一个或多个处理元件以外，每一处理器子单元也可包括一个或多个加速器。在一些实施例中，加速器可从总线而非从处理元件进行读取及写入。在这些实施例中，可通过允许加速器在处理元件执行一个或多个计算的相同循环期间传输数据来获得额外效率。然而，这些实施例需要用于加速器的额外材料。例如，可能需要额外晶体管以用于制造加速器。Furthermore, as explained above, in addition to one or more processing elements, each processor sub-unit may also include one or more accelerators. In some embodiments, the accelerator may read and write from the bus rather than from the processing elements. In these embodiments, additional efficiency may be gained by allowing the accelerator to transfer data during the same cycle in which the processing element performs one or more computations. However, these embodiments require additional materials for the accelerator. For example, additional transistors may be required for making accelerators.

代码也可考虑处理器子单元(例如，包括形成处理器子单元的部分的处理元件和/或加速器)的内部行为，包括时序及延时。例如，编译程序(如下文所描述)可执行当产生控制数据传送的指令子系列时考虑时序及延时的预处理。The code may also take into account the internal behavior of the processor subunit (eg, including processing elements and/or accelerators that form part of the processor subunit), including timing and latency. For example, a compiler (as described below) may perform preprocessing that considers timing and latency when generating the sub-series of instructions that control the transfer of data.

在一个实施例中，多个处理器子单元可经指派计算神经网络层的任务，该神经网络层含有全部连接至较大多个神经元的前一层的多个神经元。假设前一层的数据均匀地散布在多个处理器子单元之间，执行该计算的一种方式可为配置每一处理器子单元，以依次将前一层的数据传输至主总线，且接着每一处理器子单元将此数据乘以子单元实施的对应神经元的权重。因为每一处理器子单元计算多于一个神经元，所以每一处理器子单元将数次传输前一层的数据，该次数等于神经元的数量。因此，每一处理器子单元的代码与用于其他处理器子单元的代码不相同，这是因为该子单元将在不同时间进行传输。In one embodiment, multiple processor sub-units may be tasked with computing a neural network layer containing multiple neurons that are all connected to a previous layer of a larger multiple of neurons. Assuming that the data of the previous layer is spread evenly among the multiple processor sub-units, one way to perform this calculation is to configure each processor sub-unit to transfer the data of the previous layer to the main bus in turn, and Each processor subunit then multiplies this data by the weight of the corresponding neuron implemented by the subunit. Because each processor subunit computes more than one neuron, each processor subunit will transmit data from the previous layer a number of times equal to the number of neurons. Therefore, the code for each processor subunit is not the same as the code for the other processor subunits because the subunits will transmit at different times.

在一些实施例中，分布式处理器可包含基板(例如，诸如硅的半导体基板和/或诸如可挠性电路板的电路板)，该基板具有：安置于该基板上的存储器阵列，该存储器阵列包括多个离散存储器组；及安置于该基板上的处理阵列，该处理阵列包括多个处理器子单元，如描绘于例如图7A及图7B中。如上文所解释，该处理器子单元中的每个可与多个离散存储器组中的对应的专用存储器组相关联。此外，如描绘于例如图7A及图7B中，分布式处理器还可包含多个总线，多个总线中的每个将多个处理器子单元中的一个连接至多个处理器子单元中的至少另一者。In some embodiments, a distributed processor may include a substrate (eg, a semiconductor substrate such as silicon and/or a circuit board such as a flexible circuit board) having a memory array disposed on the substrate, the memory The array includes a plurality of discrete memory banks; and a processing array disposed on the substrate, the processing array including a plurality of processor subunits, as depicted, for example, in Figures 7A and 7B. As explained above, each of the processor sub-units may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. Furthermore, as depicted in, eg, Figures 7A and 7B, a distributed processor may also include multiple buses, each of the multiple buses connecting one of the multiple processor sub-units to one of the multiple processor sub-units at least the other.

如上文所解释，多个总线可用软件来控制。因此，多个总线可能不含时序硬件逻辑组件，使得在处理器子单元之间及跨越多个总线中的对应者的数据传送不受时序硬件逻辑组件控制。在一个实施例中，多个总线可能不含总线仲裁器，使得在处理器子单元之间及跨越多个总线中的对应者的数据传送不受总线仲裁器控制。As explained above, multiple buses can be controlled by software. Thus, multiple buses may not contain sequential hardware logic components, such that data transfers between processor subunits and across corresponding ones of the multiple buses are not controlled by sequential hardware logic components. In one embodiment, multiple buses may not contain a bus arbiter, such that data transfers between processor subunits and across corresponding ones of the multiple buses are not controlled by the bus arbiter.

在一些实施例中，如描绘于例如图7A及图7B中，分布式处理器还可包含第二多个总线，该第二多个总线将多个处理器子单元中的一个连接至一对应的专用存储器组。类似于上文所描述的多个总线，第二多个总线可能不含时序硬件逻辑组件，使得处理器子单元与对应的专用存储器组之间的数据传送不受时序硬件逻辑组件控制。在一个实施例中，第二多个总线可能不含总线仲裁器，使得处理器子单元与对应的专用存储器组之间的数据传送不受总线仲裁器控制。In some embodiments, as depicted in, eg, Figures 7A and 7B, the distributed processor may also include a second plurality of buses connecting one of the plurality of processor sub-units to a corresponding dedicated memory bank. Similar to the plurality of buses described above, the second plurality of buses may be free of sequential hardware logic components such that data transfers between processor subunits and corresponding dedicated memory banks are not controlled by sequential hardware logic components. In one embodiment, the second plurality of buses may not contain a bus arbiter, such that data transfers between the processor subunits and the corresponding dedicated memory banks are not controlled by the bus arbiter.

如本文中所使用，词组「不含」未必暗示诸如时序硬件逻辑组件(例如，总线仲裁器、仲裁树、FIFO控制器、信箱或其类似物)的组件的绝对不存在。这些组件仍可包括于描述为「不含」这些组件的硬件芯片中。反而，词组「不含」指硬件芯片的功能；也即，「不含」时序硬件逻辑组件的硬件芯片控制其数据传送的时序而不使用包括于其中的时序硬件逻辑组件(若存在)。例如，硬件芯片执行包括指令的子系列的代码，该指令控制该硬件芯片的处理器子单元之间的数据传送，即使该硬件芯片包括时序硬件逻辑组件作为防范由于所执行代码中的错误的冲突的辅助预防措施亦如此。As used herein, the phrase "without" does not necessarily imply the absolute absence of components such as sequential hardware logic components (eg, bus arbiters, arbitration trees, FIFO controllers, mailboxes, or the like). These components may still be included in hardware chips that are described as "without" these components. Rather, the phrase "excluding" refers to the function of a hardware chip; that is, a hardware chip "excluding" sequential hardware logic elements controls the timing of its data transfers without using the sequential hardware logic elements included therein, if any. For example, a hardware chip executes code that includes a sub-series of instructions that control data transfers between processor subunits of the hardware chip, even though the hardware chip includes sequential hardware logic components as a precaution against conflicts due to errors in the executed code The same is true for supplementary preventive measures.

如上文所解释，多个总线可包含介于多个处理器子单元中的对应者之间的导线或光纤中的至少一个。因此，在一个实施例中，不含时序硬件逻辑组件的分布式处理器可仅包括导线或光纤，而无总线仲裁器、仲裁树、FIFO控制器、信箱或其类似物。As explained above, the plurality of buses may include at least one of wires or optical fibers between corresponding ones of the plurality of processor subunits. Thus, in one embodiment, a distributed processor without sequential hardware logic components may include only wires or fibers, but no bus arbiters, arbitration trees, FIFO controllers, mailboxes, or the like.

在一些实施例中，多个处理器子单元被配置为根据由多个处理器子单元执行的代码跨越多个总线中的至少一个传送数据。因此，如下文所解释，编译程序可组织指令的子系列，每一子系列包含由单处理器子单元执行的代码。该子系列指令可指示处理器子单元何时将数据传送至总线中的一个上及何时从总线取回数据。当该子系列以串接方式跨越分布式处理器执行时，处理器子单元之间的传送的时序可通过包括于该子系列中的用以传送及取回的指令来控制。因此，代码规定跨越多个总线中的至少一个的数据传送的时序。编译程序可产生要由单处理器子单元执行的代码。另外，编译程序可产生要由处理器子单元的群组执行的代码。在一些状况下，编译程序可将所有处理器子单元共同视为该处理器子单元系一个超处理器(例如，分布式处理器)，且编译程序可产生用于由其定义的超处理器/分布式处理器执行的代码。In some embodiments, the plurality of processor subunits are configured to transfer data across at least one of the plurality of buses in accordance with code executed by the plurality of processor subunits. Thus, as explained below, a compiler may organize sub-series of instructions, each sub-series containing code to be executed by a uniprocessor subunit. The subseries of instructions may instruct the processor subunits when to transfer data onto one of the buses and when to retrieve data from the bus. When the sub-series executes in tandem across distributed processors, the timing of transfers between processor sub-units can be controlled by the instructions included in the sub-series to transfer and retrieve. Accordingly, the code specifies the timing of data transfers across at least one of the plurality of buses. The compiler may generate code to be executed by the uniprocessor subunit. Additionally, a compiler may generate code to be executed by a group of processor subunits. In some cases, the compiler may treat all processor sub-units collectively as if the processor sub-unit is one superprocessor (eg, a distributed processor), and the compiler may generate a superprocessor for the superprocessor defined by it /Code executed by distributed processors.

如上文所解释且如图7A及图7B中所描绘，多个处理器子单元可在空间上分布于存储器阵列内的多个离散存储器组当中。替代地，多个处理器子单元可聚集在基板的一个或多个区中，且多个存储器组可聚集在基板的一个或多个其他区中。在一些实施例中，可使用空间分布与聚集的组合，如上文所解释。As explained above and depicted in Figures 7A and 7B, multiple processor sub-units may be spatially distributed among multiple discrete memory banks within a memory array. Alternatively, multiple processor subunits may be grouped in one or more regions of the substrate, and multiple memory banks may be grouped in one or more other regions of the substrate. In some embodiments, a combination of spatial distribution and aggregation may be used, as explained above.

在一些实施例中，分布式处理器可包含基板(例如，包括硅的半导体基板和/或诸如可挠性电路板的电路板)，该基板具有安置于其上的存储器阵列，该存储器阵列包括多个离散存储器组。处理阵列也可安置于基板上，该处理阵列包括多个处理器子单元，如描绘于例如图7A及图7B中。如上文所解释，该处理器子单元中的每个可与多个离散存储器组中的对应的专用存储器组相关联。此外，如描绘于例如图7A及图7B中，该分布式处理器还可包含多个总线，多个总线中的每个将多个处理器子单元中的一个连接至多个离散存储器组中的对应的专用存储器组。In some embodiments, a distributed processor may include a substrate (eg, a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board) having a memory array disposed thereon, the memory array including Multiple discrete memory banks. Also disposed on the substrate is a processing array that includes a plurality of processor subunits, as depicted, for example, in Figures 7A and 7B. As explained above, each of the processor sub-units may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. Furthermore, as depicted in, eg, Figures 7A and 7B, the distributed processor may also include multiple buses, each of which connects one of the multiple processor sub-units to one of the multiple discrete memory banks The corresponding dedicated memory bank.

如上文所解释，多个总线可用软件来控制。因此，多个总线可能不含时序硬件逻辑组件，使得处理器子单元与多个离散存储器组中的对应的专用离散存储器组之间及跨越多个总线中的对应者的数据传送不受时序硬件逻辑组件控制。在一个实施例中，多个总线可能不含总线仲裁器，使得在处理器子单元之间及跨越多个总线中的对应者的数据传送不受总线仲裁器控制。As explained above, multiple buses can be controlled by software. Thus, multiple buses may not contain sequential hardware logic components such that data transfers between processor subunits and corresponding dedicated discrete memory banks of the multiple discrete memory banks and across corresponding ones of the multiple busses are not subject to sequential hardware Logic component control. In one embodiment, multiple buses may not contain a bus arbiter, such that data transfers between processor subunits and across corresponding ones of the multiple buses are not controlled by the bus arbiter.

在一些实施例中，如描绘于例如图7A及图7B中，分布式处理器还可包含第二多个总线，该第二多个总线将多个处理器子单元中的一个连接至多个处理器子单元中的至少另一者。类似于上文所描述的多个总线，第二多个总线可能不含时序硬件逻辑组件，使得处理器子单元与对应的专用存储器组之间的数据传送不受时序硬件逻辑组件控制。在一个实施例中，第二多个总线可能不含总线仲裁器，使得处理器子单元与对应的专用存储器组之间的数据传送不受总线仲裁器控制。In some embodiments, as depicted in, eg, Figures 7A and 7B, the distributed processor may also include a second plurality of buses connecting one of the plurality of processor sub-units to the plurality of processes at least one other of the device subunits. Similar to the plurality of buses described above, the second plurality of buses may be free of sequential hardware logic components such that data transfers between processor subunits and corresponding dedicated memory banks are not controlled by sequential hardware logic components. In one embodiment, the second plurality of buses may not contain a bus arbiter, such that data transfers between the processor subunits and the corresponding dedicated memory banks are not controlled by the bus arbiter.

在一些实施例中，分布式处理器可使用软件时序组件与硬件时序组件的组合。例如，分布式处理器可包含基板(例如，包括硅的半导体基板和/或诸如可挠性电路板的电路板)，该基板安置于其上的存储器阵列，该存储器阵列包括多个离散存储器组。处理阵列也可安置于基板上，该处理阵列包括多个处理器子单元，如描绘于例如图7A及图7B中。如上文所解释，该处理器子单元中的每个可与多个离散存储器组中的对应的专用存储器组相关联。此外，如描绘于例如图7A及图7B中，分布式处理器还可包含多个总线，多个总线中的每个将多个处理器子单元中的一个连接至多个处理器子单元中的至少另一者。此外，如上文所解释，多个处理器子单元可被配置为执行软件，该软件控制跨越多个总线的数据传送的时序，以避免与多个总线中的至少一个上的数据传送冲突。在此实施例中，软件可控制数据传送的时序，但传送本身可至少部分地由一个或多个硬件组件控制。In some embodiments, a distributed processor may use a combination of software timing components and hardware timing components. For example, a distributed processor may include a substrate (eg, a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board) upon which a memory array is disposed, the memory array including a plurality of discrete memory banks . Also disposed on the substrate is a processing array that includes a plurality of processor subunits, as depicted, for example, in Figures 7A and 7B. As explained above, each of the processor sub-units may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. Furthermore, as depicted in, eg, Figures 7A and 7B, a distributed processor may also include multiple buses, each of the multiple buses connecting one of the multiple processor sub-units to one of the multiple processor sub-units at least the other. Furthermore, as explained above, the plurality of processor sub-units may be configured to execute software that controls the timing of data transfers across the plurality of buses to avoid conflicts with data transfers on at least one of the plurality of buses. In this embodiment, software may control the timing of data transfers, but the transfers themselves may be controlled, at least in part, by one or more hardware components.

在这些实施例中，分布式处理器还可包含第二多个总线，该第二多个总线将多个处理器子单元中的一个连接至一对应的专用存储器组。类似于上文所描述的多个总线，多个处理器子单元可被配置为执行软件，该软件控制跨越该第二多个总线的数据传送的时序，以避免与该第二多个总线中的至少一个上的数据传送冲突。在此实施例中，如上文所解释，软件可控制数据传送的时序，但传送本身可至少部分地由一个或多个硬件组件控制。In these embodiments, the distributed processor may also include a second plurality of buses connecting one of the plurality of processor subunits to a corresponding dedicated memory bank. Similar to the plurality of buses described above, the plurality of processor subunits may be configured to execute software that controls the timing of data transfers across the second plurality of buses to avoid interference with the second plurality of buses A data transfer conflict on at least one of the . In this embodiment, as explained above, software may control the timing of data transfers, but the transfers themselves may be controlled, at least in part, by one or more hardware components.

代码的划分code division

如上文所解释，本公开的硬件芯片可跨越包括于形成硬件芯片的基板上的处理器子单元平行地执行代码。另外，本公开的硬件芯片可执行多任务处理。例如，本公开的硬件芯片可执行区域多任务处理，其中硬件芯片的处理器子单元的一个群组执行一个任务(例如，音频处理)，而硬件芯片的处理器子单元的另一群组执行另一任务(例如，图像处理)。在另一实施例中，本公开的硬件芯片可执行时序多任务处理，其中硬件芯片的一个或多个处理器子单元在第一时间段期间执行一个任务且在第二时间段期间执行另一任务。也可使用区域多任务处理与时序多任务处理的组合，使得一个任务可在第一时间段期间指派给处理器子单元的第一群组，而另一任务可在第一时间段期间指派给处理器子单元的第二群组，此后，第三任务可在第二时间段期间指派给包括于第一群组及第二群组中的处理器子单元。As explained above, the hardware chip of the present disclosure may execute code in parallel across processor subunits included on a substrate forming the hardware chip. In addition, the hardware chip of the present disclosure may perform multitasking. For example, the hardware chips of the present disclosure may perform regional multitasking, where one group of processor subunits of the hardware chip performs one task (eg, audio processing) while another group of processor subunits of the hardware chip performs Another task (eg image processing). In another embodiment, the hardware chip of the present disclosure may perform sequential multitasking, wherein one or more processor sub-units of the hardware chip perform one task during a first time period and another during a second time period Task. A combination of regional multitasking and sequential multitasking may also be used, such that one task may be assigned to a first group of processor subunits during a first time period, while another task may be assigned to a first group of processor subunits during the first time period. A second group of processor subunits, thereafter, a third task may be assigned to the processor subunits included in the first group and the second group during the second time period.

为了组织供在本公开的存储器芯片上执行的机器码，机器码可在存储器芯片的处理器子单元之间进行划分。例如，存储器芯片上的处理器可包含基板及安置于该基板上的多个处理器子单元。该存储器芯片还可包含安置于该基板上的对应的多个存储器组，多个处理器子单元中的每个连接至不被多个处理器子单元中的任何其他处理器子单元共享的至少一个专用存储器组。该存储器芯片上的每一处理器子单元可被配置为独立于其他处理器子单元执行一系列指令。每一系列指令可通过以下操作执行：根据定义该系列指令的代码而配置处理器子单元的一个或多个一般处理元件和/或根据在定义该系列指令的该代码中所提供的序列而启动处理器子单元的一个或多个特殊处理元件(例如，一个或多个加速器)。To organize machine code for execution on a memory chip of the present disclosure, the machine code may be divided among processor subunits of the memory chip. For example, a processor on a memory chip may include a substrate and a plurality of processor subunits disposed on the substrate. The memory chip may also include a corresponding plurality of memory banks disposed on the substrate, each of the plurality of processor subunits connected to at least one that is not shared by any other processor subunits of the plurality of processor subunits A dedicated memory bank. Each processor subunit on the memory chip can be configured to execute a series of instructions independently of the other processor subunits. Each series of instructions may be executed by configuring one or more general processing elements of a processor subunit according to the code that defines the series of instructions and/or starting according to the sequences provided in the code defining the series of instructions One or more special processing elements (eg, one or more accelerators) of a processor subunit.

因此，每一系列指令可定义要由单处理器子单元执行的一系列任务。单一任务可包含在由处理器子单元中的一个或多个处理元件的架构定义的指令集内的指令。例如，该处理器子单元可包括特定寄存器，且单一任务可将数据推送至寄存器上，从寄存器提取数据，对寄存器内的数据执行算术函数，对寄存器内的数据执行逻辑运算，或其类似物。此外，处理器子单元可针对任何数量的操作数来配置，诸如0操作数处理器子单元(也被称作「堆叠机」)、1操作数处理器子单元(也被称作累加机)、2操作数处理器子单元(诸如，RISC)、3操作数处理器子单元(诸如，复杂指令集计算机(CISC))或其类似物。在另一实施例中，处理器子单元可包括一个或多个加速器，且单一任务可启动一加速器以执行特定功能，诸如MAC功能、MAX功能、MAX-0功能或其类似物。Thus, each series of instructions may define a series of tasks to be performed by the uniprocessor subunit. A single task may contain instructions within an instruction set defined by the architecture of one or more processing elements in a processor subunit. For example, the processor subunit may include specific registers, and a single task may push data onto registers, extract data from registers, perform arithmetic functions on data within registers, perform logical operations on data within registers, or the like . Additionally, processor subunits may be configured for any number of operands, such as 0-operand processor subunits (also known as "stackers"), 1-operand processor subunits (also known as accumulators) , a 2-operand processor subunit such as a RISC, a 3-operand processor subunit such as a complex instruction set computer (CISC), or the like. In another embodiment, the processor subunit may include one or more accelerators, and a single task may activate an accelerator to perform a specific function, such as a MAC function, MAX function, MAX-0 function, or the like.

该系列指令还可以包括用于对存储器芯片的专用存储器组进行读取及写入的任务。例如，一任务可包括将一段数据写入至专用于执行该任务的处理器子单元的存储器组、从专用于执行该任务的处理器子单元的存储器组读取一段数据，或其类似物。在一些实施例中，读取及写入可由处理器子单元与存储器组的控制器串接地执行。例如，处理器子单元可通过将控制信号发送至控制器以执行读取或写入来执行读取或写入任务。在一些实施例中，该控制信号可包括用于读取及写入的特定地址。替代地，处理器子单元可听从存储器控制器以选择可用于读取及写入的地址。The series of instructions may also include tasks for reading and writing to dedicated memory banks of the memory chip. For example, a task may include writing a piece of data to a memory bank of a processor subunit dedicated to performing the task, reading a piece of data from a memory bank of the processor subunit dedicated to performing the task, or the like. In some embodiments, reading and writing may be performed by the processor subunit in tandem with the controller of the memory bank. For example, the processor subunit may perform a read or write task by sending control signals to the controller to perform the read or write. In some embodiments, the control signal may include specific addresses for reading and writing. Alternatively, the processor subunit may listen to the memory controller to select addresses available for reading and writing.

另外或替代地，读取及写入可由一个或多个加速器与存储器组的控制器串接地执行。例如，该加速器可产生用于存储器控制器的控制信号，此类似于处理器子单元如何产生控制信号，如上文所描述。Additionally or alternatively, reading and writing may be performed by one or more accelerators in tandem with the controller of the memory bank. For example, the accelerator may generate control signals for the memory controller, similar to how processor subunits generate control signals, as described above.

在上文所描述的实施例中的任一者中，地址生成器也可用以引导对存储器组的特定地址的读取及写入。例如，该地址生成器可包含被配置为产生用于读取及写入的存储器地址的处理元件。该地址生成器可被配置为产生地址以便提高效率，例如通过将稍后计算的结果写入至与先前计算的不再需要的结果相同的地址。因此，地址生成器可响应于来自处理器子单元(例如，来自包括于其中的处理元件或来自其中的一个或多个加速器)的命令抑或与处理器子单元串接地产生用于存储器控制器的控制信号。另外或替代地，地址生成器可基于一些配置或寄存器产生地址，例如产生巢套循环结构，从而以某一图案在存储器中的某些地址上进行反复。In any of the embodiments described above, an address generator may also be used to direct reads and writes to specific addresses of a memory bank. For example, the address generator may include processing elements configured to generate memory addresses for reading and writing. The address generator may be configured to generate addresses for increased efficiency, eg by writing a result of a later calculation to the same address as a result of a previous calculation that is no longer needed. Thus, the address generator may generate an address for the memory controller in response to a command from the processor subunit (eg, from a processing element included therein or from one or more accelerators therein) or in tandem with the processor subunit. control signal. Additionally or alternatively, the address generator may generate addresses based on some configuration or register, such as generating a nested loop structure to iterate over certain addresses in memory in a pattern.

在一些实施例中，每一系列指令可包含定义对应的一系列任务的一组机器码。因此，上文所描述的该系列任务可囊封于包含该系列指令的机器码内。在一些实施例中，如下文关于图8所解释，该系列任务可由编译程序定义，该编译程序被配置为将较高阶系列的任务作为多个系列的任务分布于多个逻辑电路当中。例如，编译程序可基于较高阶系列的任务产生多个系列的任务，使得串接地执行对应的每一系列任务的处理器子单元执行与由较高阶系列的任务所概述的功能相同的功能。In some embodiments, each series of instructions may contain a set of machine code that defines a corresponding series of tasks. Thus, the series of tasks described above can be encapsulated within machine code containing the series of instructions. In some embodiments, as explained below with respect to FIG. 8 , the series of tasks may be defined by a compiler that is configured to distribute the higher-order series of tasks as multiple series of tasks among multiple logic circuits. For example, a compiler may generate multiple series of tasks based on the higher-order series of tasks such that processor subunits executing the corresponding series of tasks in tandem perform the same functions as outlined by the higher-order series of tasks .

如下文进一步所解释，较高阶系列的任务可包含用人类可读程序设计语言编写的一组指令。对应地，每一处理器子单元的该系列任务可包含较低阶系列任务，该任务中的每个包含以机器码编写的一组指令。As explained further below, the higher-order series of tasks may comprise a set of instructions written in a human-readable programming language. Correspondingly, the series of tasks for each processor subunit may include a lower-order series of tasks, each of the tasks including a set of instructions written in machine code.

如上文关于图7A及图7B所解释，存储器芯片还可包含多个总线，每个总线将多个处理器子单元中的一个连接至多个处理器子单元中的至少另一者。此外，如上文所解释，多个总线上的数据传送可使用软件来控制。因此，跨越多个总线中的至少一个的数据传送可通过包括于连接至多个总线中的至少一个的处理器子单元中的该系列指令预定义。因此，包括于该系列指令中的任务中的一个可包括将数据输出至总线中的一个或从总线中的一个提取数据。这些任务可由处理器子单元的处理元件或由包括于处理器子单元中的一个或多个加速器执行。在后一实施例中，处理器子单元可执行计算或在相同循环中将控制信号发送至对应存储器组，在该循环期间，加速器从总线中的一个提取数据或将数据置放于总线中的一个上。As explained above with respect to Figures 7A and 7B, the memory chip may also include a plurality of buses, each bus connecting one of the plurality of processor sub-units to at least another of the plurality of processor sub-units. Furthermore, as explained above, the transfer of data on the multiple buses may be controlled using software. Thus, data transfers across at least one of the plurality of buses may be predefined by the series of instructions included in a processor subunit connected to at least one of the plurality of buses. Thus, one of the tasks included in the series of instructions may include outputting data to or fetching data from one of the buses. These tasks may be performed by the processing elements of the processor subunit or by one or more accelerators included in the processor subunit. In the latter embodiment, the processor subunits may perform computations or send control signals to corresponding memory banks in the same cycle during which the accelerator fetches data from or places data on one of the buses. one on.

在一个实施例中，包括于连接至多个总线中的至少一个的处理器子单元中的该系列指令可包括发送任务，该发送任务包含针对连接至多个总线中的至少一个的处理器子单元的用以将数据写入至多个总线中的至少一个的命令。另外或替代地，包括于连接至多个总线中的至少一个的处理器子单元中的该系列指令可包括接收任务，该接收任务包含针对连接至多个总线中的至少一个的处理器子单元的用以从多个总线中的至少一个读取数据的命令。In one embodiment, the series of instructions included in a processor subunit connected to at least one of the plurality of buses may include an issue task that includes an issue for the processor subunit connected to at least one of the plurality of buses A command to write data to at least one of the plurality of buses. Additionally or alternatively, the series of instructions included in a processor subunit connected to at least one of the plurality of buses may include a receive task that includes a function for the processor subunit connected to at least one of the plurality of buses. command to read data from at least one of the multiple buses.

除将代码分布在处理器子单元当中以外或替代将代码分布在处理器子单元当中，可在存储器芯片的存储器组之间划分数据。例如，如上文所解释，存储器芯片上的分布式处理器可包含安置于存储器芯片上的多个处理器子单元及安置于存储器芯片上的多个存储器组。多个存储器组中的每个可被配置为储存独立于储存在多个存储器组的其他者中的数据的数据，且多个处理器子单元中的一个可连接至多个存储器组当中的至少一个专用存储器组。例如，每一处理器子单元可存取专用于该处理器子单元的一个或多个对应存储器组的一个或多个存储器控制器，且其他处理器子单元不可存取这些对应的一个或多个存储器控制器。因此，储存于每个存储器组中的数据对于专用处理器子单元可为唯一的。此外，储存于每个存储器组中的数据可独立于储存在其他存储器组中的存储器，这是因为无存储器控制器可在存储器组之间共享。In addition to or instead of distributing code among processor subunits, data may be divided among memory banks of a memory chip. For example, as explained above, a distributed processor on a memory chip may include multiple processor sub-units disposed on the memory chip and multiple memory banks disposed on the memory chip. Each of the plurality of memory banks may be configured to store data independent of data stored in others of the plurality of memory banks, and one of the plurality of processor sub-units may be connected to at least one of the plurality of memory banks Dedicated memory bank. For example, each processor subunit may access one or more memory controllers dedicated to one or more corresponding memory banks of that processor subunit, and other processor subunits may not have access to these corresponding one or more memory banks memory controller. Thus, the data stored in each memory bank may be unique to a dedicated processor subunit. Furthermore, the data stored in each memory bank can be independent of the memory stored in the other memory banks, since no memory controller can be shared among the memory banks.

在一些实施例中，如下文关于图8所描述，储存于多个存储器组中的每个中的数据可由编译程序定义，该编译程序被配置为将数据分布于多个存储器组当中。此外，该编译程序可被配置为使用分布于对应处理器子单元当中的多个较低阶任务将定义于较高阶系列的任务中的数据分布于多个存储器组当中。In some embodiments, as described below with respect to FIG. 8, the data stored in each of the plurality of memory banks may be defined by a compiler that is configured to distribute the data among the plurality of memory banks. Furthermore, the compiler may be configured to distribute data defined in higher-order series of tasks among multiple memory banks using multiple lower-order tasks distributed among corresponding processor subunits.

如上文关于图7A及图7B所解释，存储器芯片还可包含多个总线，每个总线将多个处理器子单元中的一个连接至多个存储器组当中的一个或多个对应的专用存储器组。此外，如上文所解释，多个总线上的数据传送可使用软件来控制。因此，跨越多个总线中的特定总线的数据传送可由连接至多个总线中的该特定总线的对应处理器子单元来控制。因此，包括于该系列指令中的任务中的一个可包括将数据输出至总线中的一个或从总线中的一个提取数据。如上文所解释，这些任务可由(i)处理器子单元的处理元件或(ii)包括于处理器子单元中的一个或多个加速器执行。在后一实施例中，处理器子单元可执行计算或在相同循环中使用将该处理器子单元连接至其他处理器子单元的总线，在该循环期间，加速器从连接至一个或多个对应的专用存储器组的总线中的一个提取数据或将数据置放于该总线中的一个上。As explained above with respect to Figures 7A and 7B, the memory chip may also include multiple buses, each bus connecting one of the multiple processor subunits to one or more corresponding dedicated memory banks of the multiple memory banks. Furthermore, as explained above, the transfer of data on the multiple buses may be controlled using software. Thus, the transfer of data across a particular bus of the plurality of buses may be controlled by a corresponding processor sub-unit connected to the particular bus of the plurality of buses. Thus, one of the tasks included in the series of instructions may include outputting data to or fetching data from one of the buses. As explained above, these tasks may be performed by (i) the processing elements of the processor subunit or (ii) one or more accelerators included in the processor subunit. In the latter embodiment, a processor subunit may perform computations or use a bus connecting the processor subunit to other processor subunits in the same cycle during which the accelerator is connected from one or more corresponding fetches data or places data on one of the buses of the dedicated memory bank.

因此，在一个实施例中，包括于连接至多个总线中的至少一个的处理器子单元中的该系列指令可包括发送任务。该发送任务可包含针对连接至多个总线中的至少一个的处理器子单元的用以将数据写入至多个总线中的至少一个以供储存于一个或多个对应的专用存储器组中的命令。另外或替代地，包括于连接至多个总线中的至少一个的处理器子单元中的该系列指令可包括接收任务。该接收任务可包含针对连接至多个总线中的至少一个的处理器子单元的用以从多个总线中的至少一个读取数据以供储存于一个或多个对应的专用存储器组中的命令。因此，这些实施例中的发送任务及接收任务可包含控制信号，该控制信号沿着多个总线中的至少一个发送至一个或多个对应的专用存储器组中的一个或多个存储器控制器。此外，发送任务及接收任务可与由处理子单元的另一部分(例如，由处理子单元的一个或多个不同加速器)执行的计算或其他任务并发地由处理子单元的一个部分(例如，由处理子单元的一个或多个加速器)执行。此并发执行的实施例可包括MAC中继命令，其中接收、相乘及发送被串接地执行。Thus, in one embodiment, the series of instructions included in a processor subunit connected to at least one of the plurality of buses may include issue tasks. The send task may include a command for a processor subunit connected to at least one of the plurality of buses to write data to at least one of the plurality of buses for storage in one or more corresponding dedicated memory banks. Additionally or alternatively, the series of instructions included in a processor subunit coupled to at least one of the plurality of buses may include receive tasks. The receive task may include a command for a processor subunit connected to at least one of the plurality of buses to read data from at least one of the plurality of buses for storage in one or more corresponding dedicated memory banks. Accordingly, transmit tasks and receive tasks in these embodiments may include control signals that are sent along at least one of a plurality of buses to one or more memory controllers in one or more corresponding dedicated memory banks. Furthermore, send and receive tasks may be performed by one portion of the processing subunit (eg, by one or more different accelerators of the processing subunit) concurrently with computations or other tasks performed by another portion of the processing subunit (eg, by one or more different accelerators of the processing subunit). One or more accelerators of the processing subunit) execute. Embodiments of such concurrent execution may include MAC relay commands, where receive, multiply and transmit are performed in tandem.

除将数据分布于存储器组当中以外，也可跨越不同存储器组复制数据的特定部分。例如，如上文所解释，存储器芯片上的分布式处理器可包含安置于存储器芯片上的多个处理器子单元及安置于存储器芯片上的多个存储器组。多个处理器子单元中的每个可连接至多个存储器组当中的至少一个专用存储器组，且多个存储器组中的每个存储器组可被配置为储存独立于储存在多个存储器组的其他者中的数据的数据。此外，储存于多个存储器组当中的一个特定存储器组中的数据中的至少一些可包含储存于多个存储器组中的至少另一存储器组中的数据的复制者。例如，该系列指令中所使用的数字、字符串或其他类型的数据可储存于专用于不同处理器子单元的多个存储器组中，而非从一个存储器组传送至存储器芯片中的其他处理器子单元。In addition to distributing data among memory banks, specific portions of data may also be replicated across different memory banks. For example, as explained above, a distributed processor on a memory chip may include multiple processor sub-units disposed on the memory chip and multiple memory banks disposed on the memory chip. Each of the plurality of processor sub-units can be connected to at least one dedicated memory bank of the plurality of memory banks, and each memory bank of the plurality of memory banks can be configured to store independently of other memory banks stored in the plurality of memory banks. the data in the data. Furthermore, at least some of the data stored in a particular memory bank of the plurality of memory banks may include replicas of data stored in at least another memory bank of the plurality of memory banks. For example, numbers, strings, or other types of data used in the series of instructions may be stored in multiple memory banks dedicated to different processor subunits, rather than being transferred from one memory bank to other processors in the memory chip subunit.

在一个实施例中，平行字符串匹配可使用上文所描述的数据复制。例如，可将多个字符串与相同字符串进行比较。常规处理器可依序将多个字符串中的每一字符串与相同字符串进行比较。在本公开的硬件芯片上，可跨越存储器组复制相同字符串，使得处理器子单元可平行地将多个字符串中的分开字符串与所复制字符串进行比较。In one embodiment, parallel string matching may use data replication as described above. For example, multiple strings can be compared to the same string. Conventional processors may sequentially compare each of the plurality of strings to the same string. On the hardware chip of the present disclosure, the same character string can be replicated across memory banks, so that the processor subunits can compare separate character strings of the plurality of character strings with the replicated character string in parallel.

在一些实施例中，如下文关于图8所描述，跨越多个存储器组当中的一个特定存储器组及多个存储器组中的至少另一存储器组复制的至少一些数据由编译程序定义，该编译程序被配置为跨越存储器组复制数据。此外，该编译程序可被配置为使用分布于对应处理器子单元当中的多个较低阶任务来复制至少一些数据。In some embodiments, as described below with respect to FIG. 8, at least some of the data replicated across a particular memory bank of the plurality of memory banks and at least another memory bank of the plurality of memory banks is defined by a compiler, the compiler Configured to replicate data across storage groups. Furthermore, the compiler may be configured to copy at least some data using a plurality of lower order tasks distributed among corresponding processor subunits.

数据的复制可适用于跨越不同计算重复使用数据的相同部分的特定任务。通过复制数据的这些部分，不同计算可分布于存储器芯片的处理器子单元当中以用于平行执行，而每一处理器子单元可将数据的该部分储存于专用存储器组中且从专用存储器组存取所储存部分(而非跨越连接处理器子单元的总线推送及提取数据的该部分)。在一个实施例中，跨越多个存储器组当中的一个特定存储器组及多个存储器组中的至少另一存储器组复制的至少一些数据可包含神经网络的权重。在此实施例中，该神经网络中的每一节点可由多个处理器子单元当中的至少一个处理器子单元定义。例如，每一节点可包含由定义该节点的至少一个处理器子单元执行的机器码。在此实施例中，权重的复制可允许每一处理器子单元执行机器码以至少部分地实现对应节点，同时仅存取一个或多个专用存储器组(而非与其他处理器子单元执行数据传送)。因为对专用存储器组进行的读取及写入的时序独立于其他处理器子单元，而处理器子单元之间的数据传送的时序需要时序同步(例如，使用软件，如上文所解释)，所以复制存储器以避免处理器子单元之间的数据传送可进一步提高总体执行的效率。Replication of data may be suitable for specific tasks that reuse the same portion of data across different computations. By duplicating these portions of data, different computations can be distributed among the processor subunits of the memory chip for parallel execution, and each processor subunit can store the portion of the data in and from a dedicated memory bank The stored portion is accessed (rather than pushing and fetching that portion of the data across the bus connecting the processor subunits). In one embodiment, at least some of the data replicated across a particular one of the plurality of memory banks and at least another memory bank of the plurality of memory banks may include the weights of the neural network. In this embodiment, each node in the neural network may be defined by at least one processor subunit among a plurality of processor subunits. For example, each node may contain machine code executed by at least one processor subunit that defines the node. In this embodiment, the replication of weights may allow each processor subunit to execute machine code to at least partially implement the corresponding node, while only accessing one or more dedicated memory banks (rather than executing data with other processor subunits) transmission). Because the timing of reads and writes to dedicated memory banks is independent of other processor subunits, and the timing of data transfers between processor subunits requires timing synchronization (eg, using software, as explained above), the Duplicating memory to avoid data transfers between processor subunits can further improve overall execution efficiency.

如上文关于图7A及图7B所解释，存储器芯片还可包含多个总线，每个总线将多个处理器子单元中的一个连接至多个存储器组当中的一个或多个对应的专用存储器组。此外，如上文所解释，多个总线上的数据传送可使用软件来控制。因此，跨越多个总线中的特定总线的数据传送可由连接至所述多个总线中的该特定总线的对应处理器子单元来控制。因此，包括于该系列指令中的任务中的一个可包括将数据输出至总线中的一个或从总线中的一个提取数据。如上文所解释，这些任务可由(i)处理器子单元的处理元件或(ii)包括于处理器子单元中的一个或多个加速器执行。如上文进一步所解释，这些任务可包括包含控制信号的发送任务和/或接收任务，该控制信号沿着多个总线中的至少一个发送至一个或多个对应的专用存储器组中的一个或多个存储器控制器。As explained above with respect to Figures 7A and 7B, the memory chip may also include multiple buses, each bus connecting one of the multiple processor subunits to one or more corresponding dedicated memory banks of the multiple memory banks. Furthermore, as explained above, the transfer of data on the multiple buses may be controlled using software. Thus, the transfer of data across a particular bus of the plurality of buses may be controlled by a corresponding processor sub-unit connected to the particular bus of the plurality of buses. Thus, one of the tasks included in the series of instructions may include outputting data to or fetching data from one of the buses. As explained above, these tasks may be performed by (i) the processing elements of the processor subunit or (ii) one or more accelerators included in the processor subunit. As explained further above, these tasks may include send tasks and/or receive tasks that include control signals that are sent along at least one of the plurality of buses to one or more of the one or more corresponding dedicated memory banks memory controller.

图8描绘用于编译一系列指令以供在例如如图7A及图7B中所描绘的本公开的示例性存储器芯片上执行的方法800的流程图。方法800可通过任何常规处理器(无论系通用抑或专用的)实施。Figure 8 depicts a flow diagram of a method 800 for compiling a series of instructions for execution on an exemplary memory chip of the present disclosure, eg, as depicted in Figures 7A and 7B. Method 800 may be implemented by any conventional processor, whether general purpose or special purpose.

方法800可作为形成编译程序的计算机程序的一部分执行。如本文中所使用，「编译程序」指将较高级语言(例如，程序性语言，诸如C、FORTRAN、BASIC或其类似物；面向对象式语言，诸如Java、C++、Pascal、Python或其类似物；等等)转换成较低级语言(例如，组合代码、目标代码、机器码或其类似物)的任何计算机程序。编译程序可允许人类以人类可读语言来程序设计一系列指令，接着将该人类可读语言转换成机器可执行语言。The method 800 may be performed as part of a computer program forming a compiled program. As used herein, a "compiler" refers to a compilation of higher-level languages (eg, procedural languages such as C, FORTRAN, BASIC, or the like; object-oriented languages such as Java, C++, Pascal, Python, or the like) ; etc.) into a lower-level language (eg, composite code, object code, machine code, or the like). A compiler may allow a human to program a series of instructions in a human-readable language, and then convert the human-readable language into a machine-executable language.

在步骤810处，处理器可将与该系列指令相关联的任务指派给处理器子单元中的不同处理器子单元。例如，该系列指令可分成子群组，该子群组要跨越处理器子单元平行地执行。在一个实施例中，可将神经网络分成其节点，且可将一个或多个节点指派给分开的处理器子单元。在此实施例中，每一子群组可包含跨越不同层连接的多个节点。因此，处理器子单元可实施来自神经网络的第一层的节点、来自连接至由相同处理器子单元实施的来自第一层的节点的第二层的节点，及类似节点。通过基于节点的连接来指派节点，可缩减处理器子单元之间的数据传送，此可导致效率提高，如上文所解释。At step 810, the processor may assign tasks associated with the series of instructions to different ones of the processor subunits. For example, the series of instructions may be divided into subgroups to be executed in parallel across processor subunits. In one embodiment, a neural network may be divided into its nodes, and one or more nodes may be assigned to separate processor subunits. In this embodiment, each subgroup may include multiple nodes connected across different layers. Thus, a processor subunit may implement a node from a first layer of a neural network, a node from a second layer connected to a node from the first layer implemented by the same processor subunit, and the like. By assigning nodes based on their connections, data transfers between processor subunits can be reduced, which can lead to increased efficiency, as explained above.

如上文图7A及图7B中所描绘而解释，处理器子单元可在空间上分布于安置于存储器芯片上的多个存储器组当中。因此，任务的指派可至少部分地为空间划分以及逻辑划分。As explained above as depicted in Figures 7A and 7B, the processor sub-units may be spatially distributed among multiple memory banks disposed on a memory chip. Thus, the assignment of tasks may be, at least in part, a spatial as well as a logical division.

在步骤820处，处理器可产生用以在存储器芯片的成对的处理器子单元之间传送数据的任务，每一对处理器子单元由一总线连接。例如，如上文所解释，该数据传送可使用软件来控制。因此，处理器子单元可被配置为在同步时间将数据推送于总线上及提取总线上的数据。所产生的任务可因此包括用于执行数据的此同步推送及提取的任务。At step 820, the processor may generate tasks to transfer data between pairs of processor subunits of the memory chip, each pair of processor subunits being connected by a bus. For example, as explained above, this data transfer may be controlled using software. Thus, the processor subunit can be configured to push data on the bus and fetch data on the bus at synchronous times. The resulting tasks may thus include tasks for performing this synchronous push and pull of data.

如上文所解释，步骤820可包括预处理以考虑处理器子单元的内部行为，包括时序及延时。例如，处理器可使用处理器子单元的已知时间及延时(例如，将数据推送至总线的时间、从总线提取数据的时间、计算与推送或提取之间的延时，或其类似物)以确保所产生的任务同步。因此，包含由一个或多个处理器子单元进行的至少一次推送及由一个或多个处理器子单元进行的至少一次提取的数据传送可同时发生，而不会由于处理器子单元之间的时序差、处理器子单元的延时或其类似物而引起延迟。As explained above, step 820 may include preprocessing to take into account the internal behavior of the processor sub-units, including timing and latency. For example, the processor may use known times and delays of processor subunits (eg, time to push data to the bus, time to pull data from the bus, delay between computation and push or pull, or the like) ) to ensure that the resulting tasks are synchronized. Thus, data transfers involving at least one push by one or more processor subunits and at least one fetch by one or more processor subunits can occur simultaneously without Delays are caused by timing differences, delays in processor subunits, or the like.

在步骤830处，处理器可将所指派及产生的任务分组成子系列指令的多个群组。例如，该子系列指令可各包含供单处理器子单元执行的一系列任务。因此，子系列指令的多个群组中的每个可对应于多个处理器子单元中的不同处理器子单元。因此，步骤810、820及830可导致将该系列指令分成子系列指令的多个群组。如上文所解释，步骤820可确保不同群组之间的任何数据传送同步。At step 830, the processor may group the assigned and generated tasks into groups of sub-series instructions. For example, the sub-series of instructions may each contain a sequence of tasks for execution by a uniprocessor sub-unit. Thus, each of the multiple groups of sub-series instructions may correspond to a different processor sub-unit of the multiple processor sub-units. Thus, steps 810, 820 and 830 may result in dividing the series of instructions into groups of sub-series instructions. As explained above, step 820 may ensure synchronization of any data transfers between the different groups.

在步骤840处，处理器可产生对应于子系列指令的多个群组中的每个的机器码。例如，可将表示子系列指令的较高阶代码转换成可由对应处理器子单元执行的较低阶代码，诸如机器码。At step 840, the processor may generate machine code corresponding to each of the plurality of groups of the sub-series of instructions. For example, higher-order code representing a sub-series of instructions may be converted into lower-order code, such as machine code, executable by corresponding processor sub-units.

在步骤850处，处理器可根据划分将对应于子系列指令的多个群组中的每个的所产生机器码指派给多个处理器子单元中的对应处理器子单元。例如，处理器可用对应处理器子单元的识别符来标记每一子系列指令。因此，当将子系列指令上传至存储器芯片以供执行(例如，由图3A的主机350)时，每一子系列可配置一正确的处理器子单元。At step 850, the processor may assign the generated machine code corresponding to each of the plurality of groups of the subseries of instructions to a corresponding processor subunit of the plurality of processor subunits according to the partitioning. For example, the processor may tag each sub-series of instructions with the identifier of the corresponding processor sub-unit. Thus, when the sub-series of instructions are uploaded to the memory chip for execution (eg, by the host 350 of Figure 3A), each sub-series may be configured with a correct processor sub-unit.

在一些实施例中，将与该系列指令相关联的任务指派给处理器子单元中的不同处理器子单元可至少部分地取决于存储器芯片上的处理器子单元中的两者或多于两者之间的空间接近性。例如，如上文所解释，可通过缩减处理器子单元之间的数据传送的数量来提高效率。因此，处理器可将跨越处理器子单元中的多于两者移动数据的数据传送减至最少。因此，处理器可结合一个或多个优化算法(诸如，贪婪算法)使用存储器芯片的已知布局，以便将子系列指派给处理器子单元，其指派方式使邻近传送达至最大(至少区域地)且使至非相邻处理器子单元的传送减至最少(至少区域地)。In some embodiments, assigning tasks associated with the series of instructions to different ones of the processor subunits may depend, at least in part, on two or more of the processor subunits on the memory chip the spatial proximity between them. For example, as explained above, efficiency may be improved by reducing the number of data transfers between processor subunits. Thus, the processor can minimize data transfers that move data across more than two of the processor subunits. Thus, the processor may use the known layout of the memory chip in conjunction with one or more optimization algorithms, such as a greedy algorithm, to assign sub-series to processor sub-units in a manner that maximizes (at least regionally) contiguous transfers ) and minimize transfers (at least regionally) to non-adjacent processor subunits.

方法800可包括针对本公开的存储器芯片的进一步优化。例如，处理器可基于划分将与该系列指令相关联的数据分组且根据该分组将数据指派给存储器组。因此，该存储器组可保存用于指派给每个存储器组所专用于的每一处理器子单元的子系列指令的数据。The method 800 may include further optimization for the memory chips of the present disclosure. For example, the processor may group the data associated with the series of instructions based on the partitioning and assign the data to the memory banks according to the grouping. Thus, the memory bank may hold data for the sub-series of instructions assigned to each processor subunit to which each memory bank is dedicated.

在一些实施例中，将数据分组可包括判定在存储器组中的两者或多于两者中复制的数据的至少一部分。例如，如上文所解释，可跨越多于一个子系列指令使用一些数据。此数据可跨越专用于经指派不同子系列指令的多个处理器子单元的存储器组复制。此优化可进一步缩减跨越处理器子单元的数据传送。In some embodiments, grouping the data may include determining at least a portion of the data to be replicated in two or more of the memory banks. For example, as explained above, some data may be used across more than one sub-series of instructions. This data may be replicated across memory banks dedicated to multiple processor sub-units assigned different sub-series of instructions. This optimization can further reduce data transfers across processor subunits.

可将方法800的输出输入至本公开的存储器芯片以供执行。例如，一存储器芯片可包含多个处理器子单元及对应的多个存储器组，每一处理器子单元连接至专用于该处理器子单元的至少一个存储器组，且该存储器芯片的该处理器子单元可被配置为执行由方法800产生的机器码。如上文关于图3A所解释，主机350可将由方法800产生的机器码输入至处理器子单元以供执行。The output of method 800 may be input to a memory chip of the present disclosure for execution. For example, a memory chip may include multiple processor subunits and corresponding multiple memory banks, each processor subunit connected to at least one memory bank dedicated to the processor subunit, and the processor of the memory chip The subunit may be configured to execute the machine code generated by method 800 . As explained above with respect to FIG. 3A, the host 350 may input the machine code generated by the method 800 to the processor subunit for execution.

子组及子控制器Subgroups and Subcontrollers

在常规存储器组中，控制器设置在组层级处。每一组包括多个垫，所述多个垫通常以矩形方式布置，但可按任何几何形状布置。每一垫包括多个存储器胞元，所述多个存储器胞元还通常以矩形方式布置，但可按任何几何形状布置。每一胞元可储存单一数据比特(例如，取决于该胞元保持在高电压抑或低电压下)。In a conventional memory bank, the controller is placed at the bank level. Each set includes a plurality of pads, which are generally arranged in a rectangular fashion, but may be arranged in any geometric shape. Each pad includes a plurality of memory cells, which are also typically arranged in a rectangular fashion, but may be arranged in any geometric shape. Each cell can store a single bit of data (eg, depending on whether the cell is held at a high voltage or a low voltage).

此常规架构的实施例描绘于图9及图10中。如图9中所展示，在组层级处，多个垫(例如，垫930-1、930-2、940-1及940-2)可形成组900。在常规矩形组织中，可跨越全局字线(例如，字线950)及全局比特线(例如，比特线960)控制组900。因此，行解码器910可基于传入控制信号(例如，对从地址读取的请求、对写入至地址的请求或其类似物)选择正确字线，且全局感测放大器920(和/或全局列解码器，图9中未展示)可基于该控制信号选择正确比特线。放大器920也可在读取操作期间放大来自选定组的任何电压电平。尽管描绘为将行解码器用于初始选择且沿着列执行放大，但组可另外或替代地将列解码器用于初始选择且沿着行执行放大。Embodiments of this conventional architecture are depicted in FIGS. 9 and 10 . As shown in FIG. 9, at the group level, a plurality of pads (eg, pads 930-1, 930-2, 940-1, and 940-2) may form a group 900. As shown in FIG. In a conventional rectangular organization, group 900 may be controlled across global word lines (eg, word line 950 ) and global bit lines (eg, bit line 960 ). Thus, row decoder 910 can select the correct word line based on incoming control signals (eg, a request to read from an address, a request to write to an address, or the like), and the global sense amplifier 920 (and/or A global column decoder, not shown in Figure 9) can select the correct bit line based on this control signal. Amplifier 920 may also amplify any voltage levels from the selected group during read operations. Although depicted as using row decoders for initial selection and performing upscaling along columns, groups may additionally or alternatively use column decoders for initial selection and perform upscaling along rows.

图10描绘垫1000的实施例。例如，垫1000可形成诸如图9的组900的存储器组的一部分。如图10中所描绘，多个胞元(例如，胞元1030-1、1030-2及1030-3)可形成垫1000。每一胞元可包含储存至少一个数据比特的电容器、晶体管或其他电路系统。例如，一胞元可包含电容器或可包含触发器(flip-flop)，该电容器经充电以表示「1」且放电以表示「0」，该触发器具有表示「1」的第一状态及表示「0」的第二状态。常规垫可包含例如512个比特×512个比特。在垫1000形成MRAM、ReRAM或其类似物的一部分的实施例中，一胞元可包含晶体管、电阻器、电容器或用于隔离储存至少一个数据比特的材料的离子或一部分的其他机构。例如，一胞元可包含具有表示「1」的第一状态及表示「0」的第二状态的电解质离子、硫族化物玻璃的一部分，或其类似物。FIG. 10 depicts an embodiment of a pad 1000 . For example, pad 1000 may form part of a memory bank such as bank 900 of FIG. 9 . As depicted in FIG. 10, a plurality of cells (eg, cells 1030-1, 1030-2, and 1030-3) may form pad 1000. Each cell may include a capacitor, transistor, or other circuitry that stores at least one data bit. For example, a cell may include a capacitor or may include a flip-flop that is charged to represent a "1" and discharged to represent a "0", the flip-flop having a first state representing a "1" and representing The second state of "0". A conventional pad may contain, for example, 512 bits by 512 bits. In embodiments where pad 1000 forms part of an MRAM, ReRAM, or the like, a cell may include transistors, resistors, capacitors, or other mechanisms for isolating ions or portions of the material storing at least one data bit. For example, a cell may contain electrolyte ions having a first state representing a "1" and a second state representing a "0", a portion of a chalcogenide glass, or the like.

如图10中进一步所描绘，在常规矩形组织中，可跨越区域字线(例如，字线1040)及区域比特线(例如，比特线1050)控制垫1000。因此，字线驱动器(例如，字线驱动器1020-1、1020-2、……、1020-x)可基于来自与存储器组(垫1000形成该存储器组的一部分)相关联的控制器的控制信号(例如，对从地址读取的请求、对写入至地址的请求、刷新信号)而控制选定字线以执行读取、写入或刷新。此外，区域感测放大器(例如，区域放大器1010-1、1010-2、……、1010-x)和/或区域列解码器(图10中未展示)可控制选定比特线以执行读取、写入或刷新。该区域感测放大器也可在读取操作期间放大来自选定胞元的任何电压电平。尽管描绘为将字线驱动器用于初始选择且沿着列执行放大，但垫可替代地将比特线驱动器用于初始选择且沿着行执行放大。As further depicted in FIG. 10, in a conventional rectangular organization, pad 1000 may be controlled across regional word lines (eg, word line 1040) and regional bit lines (eg, bit line 1050). Thus, word line drivers (eg, word line drivers 1020-1, 1020-2, . (eg, request to read from address, request to write to address, refresh signal) controls the selected word line to perform a read, write, or refresh. Additionally, regional sense amplifiers (eg, regional amplifiers 1010-1, 1010-2, ..., 1010-x) and/or regional column decoders (not shown in Figure 10) can control selected bit lines to perform reads , write or refresh. The area sense amplifier can also amplify any voltage level from the selected cell during a read operation. Although depicted as using word line drivers for initial selection and amplification along columns, pads may alternatively use bit line drivers for initial selection and amplification along rows.

如上文所解释，复制大量垫以形成存储器组。可将存储器组群聚以形成存储器芯片。例如，存储器芯片可包含八个至三十二个存储器组。因此，使处理器子单元与常规存储器芯片上的存储器组配对可产生仅八个至三十二个处理器子单元。因此，本公开的实施例可包括具有额外子组阶层的存储器芯片。本公开的这些存储器芯片可接着包括具有用作与处理器子单元配对的专用存储器组的存储器子组的处理器子单元，以允许较大数量的子处理器，此可接着达成存储器内计算的较高平行性及效能。As explained above, a large number of pads are replicated to form memory banks. Memory groups can be grouped to form memory chips. For example, a memory chip may contain eight to thirty-two memory banks. Thus, pairing a processor subunit with a memory bank on a conventional memory chip can yield only eight to thirty-two processor subunits. Accordingly, embodiments of the present disclosure may include memory chips with additional subgroup hierarchies. These memory chips of the present disclosure can then include processor subunits with memory subunits serving as dedicated memory banks paired with processor subunits to allow a larger number of subprocessors, which can then enable in-memory computing High parallelism and performance.

在本公开的一些实施例中，组900的全局行解码器及全局感测放大器可用子组控制器来替换。因此，存储器组的控制器可将控制信号引导至适当的子组控制器，而非将控制信号发送至存储器组的全局行解码器及全局感测放大器。引导可动态地加以控制或可为硬联机的(例如，经由一个或多个逻辑门)。在一些实施例中，熔断器可用以指示每一子组或垫的控制器是否阻断控制信号或传递控制信号至适当的子组或垫。在这些实施例中，可因此使用熔断器来撤销启动故障子组。In some embodiments of the present disclosure, the global row decoders and global sense amplifiers of group 900 may be replaced with subgroup controllers. Thus, the controller of the memory bank may direct the control signal to the appropriate sub-bank controller instead of sending the control signal to the global row decoder and global sense amplifier of the memory bank. Steering may be dynamically controlled or may be hardwired (eg, via one or more logic gates). In some embodiments, fuses may be used to indicate whether the controller of each subgroup or pad blocks or passes control signals to the appropriate subgroup or pad. In these embodiments, a fuse may thus be used to deactivate the faulty subgroup.

在这些实施例中的一个实施例中，一存储器芯片可包括多个存储器组，每个存储器组具有一组控制器及多个存储器子组，每个存储器子组具有一子组行解码器及一子组列解码器以允许对该存储器子组上的位置进行读取及写入。每一子组可包含多个存储器垫，每个存储器垫具有多个存储器胞元且可具有在内部的区域行解码器、列解码器和/或区域感测放大器。该子组行解码器及该子组列解码器可处理用于子组存储器上的存储器内计算的来自组控制器或来自子组处理器子单元的读取及写入请求，如下文所描述。另外，每个存储器子组可进一步具有一控制器，该控制器被配置为判定处理来自组控制器的读取请求及写入请求和/或将读取请求及写入请求转送至下一层级(例如，垫上的行解码器及列解码器的下一层级)，抑或阻断该请求，例如以允许内部处理元件或处理器子单元存取存储器。在一些实施例中，该组控制器可同步至系统时钟。然而，该子组控制器可不同步至系统时钟。In one of these embodiments, a memory chip may include a plurality of memory banks, each memory bank having a set of controllers and a plurality of memory banks, each memory bank having a bank of row decoders and A subset of column decoders to allow reading and writing of locations on the memory subset. Each subset may include multiple memory pads, each memory pad having multiple memory cells and may have internal regional row decoders, column decoders, and/or regional sense amplifiers. The subgroup row decoder and the subgroup column decoder can process read and write requests from the bank controller or from the subgroup processor subunits for in-memory computations on the subgroup memory, as described below . Additionally, each memory bank may further have a controller configured to decide to process read and write requests from the bank controller and/or to forward read and write requests to the next level (eg, the next level of row decoder and column decoder on the pad), or block the request, eg, to allow internal processing elements or processor subunits to access memory. In some embodiments, the set of controllers may be synchronized to the system clock. However, the subgroup controller may not be synchronized to the system clock.

如上文所解释，子组的使用可允许在存储器芯片中包括比在处理器子单元与常规芯片的存储器组配对的情况下更大数量的处理器子单元。因此，每一子组可进一步具有使用子组作为专用存储器的处理器子单元。如上文所解释，该处理器子单元可包含RISC、CISC或其他通用处理子单元和/或可包含一个或多个加速器。另外，该处理器子单元可包括地址生成器，如上文所解释。在上文所描述的实施例中的任一者中，每一处理器子单元可被配置为使用专用于该处理器子单元的子组的行解码器及列解码器而不使用组控制器来存取该子组。与子组相关联的处理器子单元也可处置存储器垫(包括下文所描述的解码器及存储器冗余机构)和/或判定是否转送且因此处置来自上部层级(例如，组层级或存储器层级)的读取或写入请求。As explained above, the use of subgroups may allow a larger number of processor subunits to be included in a memory chip than would be the case if the processor subunits were paired with the memory banks of a conventional chip. Thus, each subgroup may further have a processor subunit that uses the subgroup as dedicated memory. As explained above, the processor subunit may include a RISC, CISC, or other general purpose processing subunit and/or may include one or more accelerators. Additionally, the processor sub-unit may include an address generator, as explained above. In any of the embodiments described above, each processor sub-unit may be configured to use row decoders and column decoders dedicated to the sub-group of that processor sub-unit without using a group controller to access the subgroup. The processor sub-units associated with the subgroups may also handle memory pads (including the decoders and memory redundancy mechanisms described below) and/or determine whether to forward and thus handle from the upper level (eg, group level or memory level) read or write requests.

在一些实施例中，子组控制器还可以包括储存子组的状态的寄存器。因此，在该寄存器指示该子组处于使用中时，若该子组控制器接收到来自存储器控制器的控制信号，则该子组控制器可传回错误。在每一子组还包括一处理器子单元的实施例中，若该子组中的该处理器子单元正存取与来自存储器控制器的外部请求冲突的存储器，则该寄存器可指示错误。In some embodiments, the subgroup controller may also include a register that stores the state of the subgroup. Thus, if the subgroup controller receives a control signal from the memory controller when the register indicates that the subgroup is in use, the subgroup controller may return an error. In embodiments where each subgroup also includes a processor subunit, the register may indicate an error if the processor subunit in the subgroup is accessing memory that conflicts with an external request from a memory controller.

图11展示使用子组控制器的存储器组的另一实施例的实施例。在图11的实施例中，组1100具有行解码器1110、列解码器1120，及具有子组控制器(例如，控制器1130a、1130b及1130c)的多个存储器子组(例如，子组1170a、1170b及1170c)。该子组控制器可包括地址解算器(例如，解算器1140a、1140b及1140c)，该地址解算器可判定是否将请求传递至由子组控制器控制的一个或多个子组。Figure 11 shows an embodiment of another embodiment of a memory bank using a bank controller. In the embodiment of FIG. 11, bank 1100 has row decoders 1110, column decoders 1120, and multiple memory banks (eg, bank 1170a) with bank controllers (eg, controllers 1130a, 1130b, and 1130c). , 1170b and 1170c). The subgroup controller can include address resolvers (eg, resolvers 1140a, 1140b, and 1140c) that can determine whether to pass the request to one or more subgroups controlled by the subgroup controller.

该子组控制器还可以包括一个或多个逻辑电路(例如，逻辑1150a、1150b及1150c)。例如，包含一个或多个处理元件的逻辑电路可允许执行诸如刷新子组中的胞元、清除子组中的胞元或其类似物的一个或多个操作而无需来自组1100外部的处理请求。替代地，逻辑电路可包含处理器子单元，如上文所解释，使得处理器子单元具有由子组控制器控制的任何子组作为对应的专用存储器。在图11的实施例中，逻辑1150a可具有子组1170a作为对应的专用存储器，逻辑1150b可具有子组1170b作为对应的专用存储器，且逻辑1150c可具有子组1170c作为对应的专用存储器。在上文所描述的实施例中的任一者中，逻辑电路可具有至子组的总线，例如，总线1131a、1131b或1131c。如图11中进一步所描绘，该子组控制器可各包括多个解码器，诸如子组行解码器及子组列解码器，以允许处理元件或处理器子单元或发布命令的较高阶存储器控制器对存储器子组上的地址进行读取及写入。例如，子组控制器1130a包括解码器1160a、1160b及1160c，子组控制器1130b包括解码器1160d、1160e及1160f，且子组控制器1130c包括解码器1160g、1160h及1160i。基于来自组行解码器1110的请求，子组控制器可使用包括于子组控制器中的解码器来选择字线。所描述系统可允许子组的处理元件或处理器子单元在不中断其他组及甚至其他子组的情况下存取存储器，藉此允许每一子组处理器子单元与其他子组处理器子单元平行地执行存储器计算。The subset controller may also include one or more logic circuits (eg, logic 1150a, 1150b, and 1150c). For example, logic circuitry including one or more processing elements may allow one or more operations to be performed, such as flushing cells in a subgroup, clearing cells in a subgroup, or the like, without requiring a processing request from outside of group 1100 . Alternatively, the logic circuit may include a processor sub-unit, as explained above, such that the processor sub-unit has as corresponding dedicated memory any sub-group controlled by the sub-group controller. 11, logic 1150a may have subset 1170a as the corresponding dedicated memory, logic 1150b may have subset 1170b as the corresponding dedicated memory, and logic 1150c may have subset 1170c as the corresponding dedicated memory. In any of the embodiments described above, the logic circuit may have a bus to a subgroup, eg, bus 1131a, 1131b, or 1131c. As further depicted in FIG. 11, the subgroup controllers may each include multiple decoders, such as subgroup row decoders and subgroup column decoders, to allow processing elements or processor subunits or higher orders to issue commands The memory controller reads and writes addresses on the memory bank. For example, subgroup controller 1130a includes decoders 1160a, 1160b, and 1160c, subgroup controller 1130b includes decoders 1160d, 1160e, and 1160f, and subgroup controller 1130c includes decoders 1160g, 1160h, and 1160i. Based on the request from the group row decoder 1110, the subgroup controller may select a word line using a decoder included in the subgroup controller. The described system may allow a subset of processing elements or processor subunits to access memory without interrupting other groups and even other subsets, thereby allowing each subset of processor subunits to communicate with other subsets of processor subunits. The cells perform memory computations in parallel.

此外，每一子组可包含多个存储器垫，每个存储器垫具有多个存储器胞元。例如，子组1170a包括垫1190a-1、1190a-2、……、1190a-x；子组1170b包括垫1190b-1、1190b-2、……、1190b-x；且子组1170c包括垫1190c-1、1190c-2、……、1190c-3。如图11中进一步所描绘，每一子组可包括至少一个解码器。例如，子组1170a包括解码器1180a，子组1170b包括解码器1180b，且子组1170c包括解码器1180c。因此，组列解码器1120可基于外部请求而选择全局比特线(例如，比特线1121a或1121b)，而由组行解码器1110选择的子组可使用其列解码器基于来自子组所专用于的逻辑电路的区域请求而选择区域比特线(例如，比特线1181a或1181b)。因此，每一处理器子单元可被配置为使用子组的行解码器及列解码器来存取专用于该处理器子单元的子组而无需使用组行解码器及组列解码器。因此，每一处理器子单元可存取对应子组而不会中断其他子组。此外，当对子组的请求在处理器子单元外时，子组解码器可向组解码器反映所存取的数据。替代地，在每一子组仅具有一行存储器垫的实施例中，区域比特线可为垫的比特线，而非子组的比特线。Additionally, each subset may include multiple memory pads, each memory pad having multiple memory cells. For example, subgroup 1170a includes pads 1190a-1, 1190a-2, ..., 1190a-x; subgroup 1170b includes pads 1190b-1, 1190b-2, ..., 1190b-x; and subgroup 1170c includes pads 1190c- 1, 1190c-2, ..., 1190c-3. As further depicted in FIG. 11, each subset may include at least one decoder. For example, subset 1170a includes decoder 1180a, subset 1170b includes decoder 1180b, and subset 1170c includes decoder 1180c. Thus, the group column decoder 1120 may select a global bit line (eg, bit line 1121a or 1121b) based on an external request, while the subgroup selected by the group row decoder 1110 may use its column decoder based on the A region bit line (eg, bit line 1181a or 1181b) is selected according to the region request of the logic circuit. Thus, each processor sub-unit can be configured to use the row and column decoders of the subset to access the subset dedicated to that processor sub-unit without using the row and column decoders. Thus, each processor subunit can access the corresponding subgroup without interrupting the other subgroups. Additionally, when a request for a subgroup is outside the processor subunit, the subgroup decoder may reflect the accessed data to the group decoder. Alternatively, in embodiments where each subset has only one row of memory pads, the regional bit lines may be the bit lines of the pads rather than the bit lines of the subset.

可使用以下实施例的组合：使用子组行解码器及子组列解码器的实施例；及图11中所描绘的实施例。例如，可消除组行解码器，但保留组列解码器且使用区域比特线。A combination of the following embodiments may be used: an embodiment using a subset row decoder and a subset column decoder; and the embodiment depicted in FIG. 11 . For example, group row decoders can be eliminated, but group column decoders are retained and regional bit lines are used.

图12展示具有多个垫的存储器子组1200的实施例的实施例。例如，子组1200可表示图11的子组1100的一部分或可表示存储器组的替代实施。在图12的实施例中，子组1200包括多个垫(例如，垫1240a及1240b)。此外，每一垫可包括多个胞元。例如，垫1240a包括胞元1260a-1、1260a-2、……、1260a-x，且垫1240b包括胞元1260b-1、1260b-2、……、1260b-x。FIG. 12 shows an embodiment of an embodiment of a memory subset 1200 with multiple pads. For example, subgroup 1200 may represent a portion of subgroup 1100 of FIG. 11 or may represent an alternate implementation of a memory bank. In the embodiment of FIG. 12, subset 1200 includes a plurality of pads (eg, pads 1240a and 1240b). Additionally, each pad may include multiple cells. For example, pad 1240a includes cells 1260a-1, 1260a-2, ..., 1260a-x, and pad 1240b includes cells 1260b-1, 1260b-2, ..., 1260b-x.

每一垫可经指派将指派给垫的存储器胞元的地址的范围。这些地址可在生产时配置，使得垫可到处移动且使得故障垫可被撤销启动且保持未使用(例如，使用一个或多个熔断器，如下文进一步所解释)。Each pad may be assigned a range of addresses to be assigned to the memory cells of the pad. These addresses can be configured at production time so that the pads can be moved around and so that faulty pads can be deactivated and left unused (eg, using one or more fuses, as explained further below).

子组1200接收来自存储器控制器1210的读取及写入请求。尽管图12中未描绘，但来自存储器控制器1210的请求可经由子组1200的控制器来筛选且引导至子组1200的适当垫以进行地址解算。替代地，来自存储器控制器1210的请求的地址的至少一部分(例如，较高比特)可传输至子组1200的所有垫(例如，垫1240a及1240b)，使得仅当垫的经指派地址范围包括命令中所指定的地址时，每一垫方可处理完整地址及与该地址相关联的请求。类似于上文所描述的子组引导，垫判定可动态地加以控制或可为硬联机的。在一些实施例中，熔断器可用以判定每一垫的地址范围，以还允许通过指派不合法地址范围来停用故障垫。垫可另外或替代地通过其他常用方法或熔断器的连接来停用。Subgroup 1200 receives read and write requests from memory controller 1210 . Although not depicted in Figure 12, requests from memory controller 1210 may be filtered by the controller of subgroup 1200 and directed to the appropriate pads of subgroup 1200 for address resolution. Alternatively, at least a portion (eg, higher bits) of the requested address from memory controller 1210 may be transferred to all pads of subgroup 1200 (eg, pads 1240a and 1240b ), such that only if the assigned address range of the pads includes When the address is specified in the command, each pad can process the full address and the request associated with that address. Similar to the subgroup guidance described above, pad determination may be dynamically controlled or may be hard-wired. In some embodiments, fuses may be used to determine the address range of each pad to also allow for deactivation of faulty pads by assigning invalid address ranges. The pads may additionally or alternatively be deactivated by other common methods or the connection of fuses.

在上文所描述的实施例中的任一者中，子组的每一垫可包括用于选择垫中的字线的行解码器(例如，行解码器1230a或1230b)。在一些实施例中，每一垫还可以包括熔断器及比较器(例如，1220a及1220b)。如上文所描述，比较器可允许每一垫判定是否处理传入请求，且熔断器可允许每一垫在发生故障的情况下撤销启动。替代地，可使用组和/或子组的行解码器，而非使用每一垫中的行解码器。In any of the embodiments described above, each pad of the subset may include a row decoder (eg, row decoder 1230a or 1230b) for selecting a word line in the pad. In some embodiments, each pad may also include fuses and comparators (eg, 1220a and 1220b). As described above, comparators may allow each pad to determine whether to process an incoming request, and fuses may allow each pad to be deactivated in the event of a fault. Alternatively, groups and/or subgroups of row decoders may be used instead of row decoders in each pad.

此外，在上文所描述的实施例中的任一者中，包括于适当垫中的列解码器(例如，列解码器1250a或1250b)可选择区域比特线(例如，比特线1251或1253)。区域比特线可连接至存储器组的全局比特线。在子组具有其自身的区域比特线的实施例中，胞元的区域比特线可进一步连接至子组的区域比特线。因此，可经由胞元的列解码器(和/或感测放大器)、接着经由子组的列解码器(和/或感测放大器)(在包括子组列解码器和/或感测放大器的实施例中)且接着经由组的列解码器(和/或感测放大器)来读取选定胞元中的数据。Furthermore, in any of the embodiments described above, a column decoder (eg, column decoder 1250a or 1250b) included in the appropriate pad may select a regional bit line (eg, bit line 1251 or 1253) . The local bit lines can be connected to the global bit lines of the memory bank. In embodiments where a subgroup has its own local bit line, the cell's local bit line may be further connected to the subgroup's local bit line. Thus, the column decoders (and/or sense amplifiers) of the cell may be passed through, followed by the column decoders (and/or sense amplifiers) of the subsets (in the embodiments) and then read the data in the selected cells via the group's column decoders (and/or sense amplifiers).

垫1200可经复制及排成阵列以形成存储器组(或存储器子组)。例如，本公开的存储器芯片可包含多个存储器组，每个存储器组具有多个存储器子组，且每个存储器子组具有用于处理对存储器子组上的位置进行的读取及写入的子组控制器。此外，每个存储器子组可包含多个存储器垫，每个存储器垫具有多个存储器胞元且具有一垫行解码器及一垫列解码器(例如，如图12中所描绘)。该垫行解码器及该垫列解码器可处理来自子组控制器的读取及写入请求。例如，该垫解码器可接收所有请求且基于每一垫的已知地址范围判定(例如，使用比较器)是否处理请求，或该垫解码器可基于子组(或组)控制器对垫的选择而仅接收在已知地址范围内的请求。The pads 1200 may be replicated and arrayed to form memory groups (or memory subgroups). For example, a memory chip of the present disclosure may include multiple memory banks, each memory bank having a plurality of memory subsets, and each memory bank having a Subgroup Controller. Furthermore, each memory subset may include multiple memory pads, each memory pad having multiple memory cells and having a pad row decoder and a pad column decoder (eg, as depicted in Figure 12). The pad row decoder and the pad column decoder can handle read and write requests from the subgroup controller. For example, the pad decoder may receive all requests and decide (eg, using a comparator) whether to process the request based on a known address range for each pad, or the pad decoder may be based on the subgroup (or group) controller's response to the pads Select to only receive requests within a known address range.

控制器数据传送Controller data transfer

除使用处理子单元来共享数据以外，本公开的存储器芯片中的任一者也可使用存储器控制器(或子组控制器或垫控制器)来共享数据。例如，本公开的存储器芯片可包含：多个存储器组(例如，SRAM组、DRAM组或其类似物)，每个存储器组具有一组控制器、一行解码器及一列解码器，以允许对该存储器组上的位置进行读取及写入；以及多个总线，其将多个组控制器中的每一控制器连接至多个组控制器中的至少一个其他控制器。所述多个总线可类似于如上文所描述的连接处理子单元的总线，但所述多个总线直接地而非经由处理子单元来连接该组控制器。此外，尽管描述为连接组控制器，但总线可另外或替代地连接符组控制器和/或垫控制器。In addition to using processing subunits to share data, any of the memory chips of the present disclosure may also use a memory controller (or subgroup controller or pad controller) to share data. For example, a memory chip of the present disclosure may include multiple memory banks (eg, SRAM banks, DRAM banks, or the like), each memory bank having a set of controllers, a row of decoders, and a column of decoders to allow the locations on the memory bank for reading and writing; and a plurality of buses connecting each controller of the plurality of bank controllers to at least one other controller of the plurality of bank controllers. The plurality of buses may be similar to the buses connecting the processing subunits as described above, but the plurality of buses connect the set of controllers directly rather than via the processing subunits. Furthermore, although described as connecting a group controller, the bus may additionally or alternatively connect a group controller and/or a pad controller.

在一些实施例中，可在不中断连接至一个或多个处理器子单元的存储器组的主总线上的数据传送的情况下存取所述多个总线。因此，存储器组(或子组)可在与将数据传输至不同存储器组(或子组)或从不同存储器组(或子组)传输数据相同的时钟循环中将数据传输至对应处理器子单元或从对应处理器子单元传输数据。在每一控制器连接至多个其他控制器的实施例中，该控制器可能可配置以用于选择其他控制器中的另一个用于发送或接收数据。在一些实施例中，每一控制器可连接至至少一个相邻控制器(例如，空间邻近控制器对可彼此连接)。In some embodiments, the plurality of buses may be accessed without interrupting data transfers on the main bus of the memory bank connected to the one or more processor subunits. Thus, a memory bank (or sub-bank) can transfer data to the corresponding processor sub-units in the same clock cycle as data is transferred to or from a different memory bank (or sub-bank) Or transfer data from the corresponding processor subunit. In embodiments where each controller is connected to multiple other controllers, the controller may be configurable to select another of the other controllers for sending or receiving data. In some embodiments, each controller may be connected to at least one adjacent controller (eg, pairs of spatially adjacent controllers may be connected to each other).

存储器电路中的冗余逻辑Redundant logic in memory circuits

本公开大体上系有关于具有用于芯片上数据处理的主要逻辑部分的存储器芯片。该存储器芯片可包括冗余逻辑部分，该冗余逻辑部分可替换有缺陷的主要逻辑部分以提高芯片的制造良率。因此，该芯片可包括片上组件，该片上组件允许基于对该逻辑部分的个别测试来配置存储器芯片中的逻辑区块。该芯片的此特征可提高良率，这是因为具有专用于逻辑部分的较大面积的存储器芯片更容易发生制造故障。例如，具有大冗余逻辑部分的DRAM存储器芯片可容易发生制造问题，此降低良率。然而，实施冗余逻辑部分可导致提高良率及可靠性，这是因为该实施使DRAM存储器芯片的制造商或使用者能够在维持高平行性的同时接通或断开全部逻辑部分。应注意，在此处及贯穿本公开，可识别某些存储器类型(诸如，DRAM)的实施例，以便促进解释所公开实施例。然而，应理解，在这些情况下，识别的存储器类型并不意欲为限制性的。确切而言，诸如DRAM、快闪存储器、SRAM、ReRAM、PRAM、MRAM、ROM或任何其他存储器的存储器类型可与所公开实施例共同使用，即使在本公开的某一章节中特定地识别较少实施例亦如此。The present disclosure generally relates to memory chips having primary logic for on-chip data processing. The memory chip can include redundant logic portions that can replace defective primary logic portions to improve chip manufacturing yield. Thus, the chip may include on-chip components that allow the logic blocks in the memory chip to be configured based on individual testing of the logic portion. This feature of the chip can improve yield because memory chips with larger areas dedicated to logic are more prone to manufacturing failures. For example, DRAM memory chips with large redundant logic sections can be prone to manufacturing problems that reduce yield. However, implementing redundant logic sections can result in improved yield and reliability because the implementation enables manufacturers or users of DRAM memory chips to switch all logic sections on or off while maintaining high parallelism. It should be noted that here and throughout this disclosure, embodiments of certain memory types, such as DRAM, may be identified in order to facilitate explanation of the disclosed embodiments. It should be understood, however, that the identified memory types are not intended to be limiting in these cases. Rather, memory types such as DRAM, flash memory, SRAM, ReRAM, PRAM, MRAM, ROM, or any other memory may be used with the disclosed embodiments, even if a relatively The same is true for few embodiments.

图13为符合所公开实施例的示例性存储器芯片1300的功能方块图。存储器芯片1300可实施为DRAM存储器芯片。存储器芯片1300也可实施为任何类型之易失性或非易失性存储器，诸如快闪存储器、SRAM、ReRAM、PRAM和/或MRAM等。存储器芯片1300可包括基板1301，该基板中布置有地址管理器1302、包括多个存储器组1304(a,a)至1304(z,z)的存储器阵列1304、存储器逻辑1306、商业逻辑1308及冗余商业逻辑1310。存储器逻辑1306及商业逻辑1308可构成主要逻辑区块，而冗余商业逻辑1310可构成冗余区块。此外，存储器芯片1300可包括配置开关，该配置开关可包括撤销启动开关1312及启动开关1314。撤销启动开关1312及启动开关1314也可安置于基板1301中。在本申请案中，存储器逻辑1306、商业逻辑1308及冗余商业逻辑1310也可统称为「逻辑区块」。13 is a functional block diagram of an exemplary memory chip 1300 consistent with the disclosed embodiments. The memory chip 1300 may be implemented as a DRAM memory chip. The memory chip 1300 may also be implemented as any type of volatile or non-volatile memory, such as flash memory, SRAM, ReRAM, PRAM, and/or MRAM, among others. The memory chip 1300 may include a substrate 1301 having disposed therein an address manager 1302, a memory array 1304 including a plurality of memory banks 1304(a,a) to 1304(z,z), memory logic 1306, business logic 1308, and redundancy. I Business Logic 1310. Memory logic 1306 and business logic 1308 may constitute a primary logic block, while redundant business logic 1310 may constitute a redundant block. Additionally, the memory chip 1300 may include configuration switches, which may include a deactivation switch 1312 and an activation switch 1314 . The deactivation switch 1312 and the activation switch 1314 may also be disposed in the substrate 1301 . In this application, memory logic 1306, business logic 1308, and redundant business logic 1310 may also be collectively referred to as "logical blocks."

地址管理器1302可包括行和列解码器或其他类型的存储器辅助设备。替代地或另外，地址管理器1302可包括微控制器或处理单元。Address manager 1302 may include row and column decoders or other types of memory aids. Alternatively or additionally, the address manager 1302 may comprise a microcontroller or processing unit.

在一些实施例中，如图13中所展示，存储器芯片1300可包括单一存储器阵列1304，该存储器阵列可将多个存储器区块以二维阵列布置在基板1301上。然而，在其他实施例中，存储器芯片1300可包括多个存储器阵列1304，且存储器阵列1304中的每个可按不同配置布置存储器区块。例如，存储器阵列中的至少一个中的存储器区块(也被称为存储器组)可按径向分布配置以促进地址管理器1302或存储器逻辑1306至存储器区块之间的路由。In some embodiments, as shown in FIG. 13 , a memory chip 1300 can include a single memory array 1304 that can arrange a plurality of memory blocks on a substrate 1301 in a two-dimensional array. However, in other embodiments, the memory chip 1300 may include multiple memory arrays 1304, and each of the memory arrays 1304 may arrange memory blocks in different configurations. For example, memory blocks (also referred to as memory banks) in at least one of the memory arrays may be configured in a radial distribution to facilitate routing between address manager 1302 or memory logic 1306 to memory blocks.

商业逻辑1308可用以进行与用以管理存储器本身的逻辑无关的应用程序的存储器内计算。例如，商业逻辑1308可实施与AI相关的功能，诸如用作启动功能的浮点、整数或MAC运算。此外，商业逻辑1308可实施数据库相关功能，如最小值、最大值、排序、计数以及其他。存储器逻辑1306可执行与存储器管理相关的任务，包括(但不限于)读取、写入及刷新操作。因此，可在组层级、垫层级或垫群组层级中的一个或多个中添加商业逻辑。商业逻辑1308可具有一个或多个地址输出及一个或多个数据输入/输出。例如，商业逻辑1308可通过至地址管理器1302的行\列线来寻址。然而，在某些实施例中，逻辑区块可另外或替代地经由数据输入\输出来寻址。Business logic 1308 may be used to perform in-memory computations for applications independent of the logic used to manage the memory itself. For example, business logic 1308 may implement AI-related functions, such as floating point, integer, or MAC operations used as initiating functions. Additionally, business logic 1308 may implement database related functions such as min, max, sorting, counting, and others. Memory logic 1306 may perform tasks related to memory management including, but not limited to, read, write, and refresh operations. Thus, business logic may be added at one or more of the group level, the pad level, or the pad group level. Business logic 1308 may have one or more address outputs and one or more data inputs/outputs. For example, business logic 1308 may be addressed via row\column lines to address manager 1302. However, in some embodiments, logic blocks may additionally or alternatively be addressed via data input\output.

冗余商业逻辑1310可为商业逻辑1308的再制品。此外，冗余商业逻辑1310可连接至撤销启动开关1312和/或启动开关1314，其可包括小的熔断器\反熔断器，且用于逻辑停用或启用实例中的一个(例如，预设连接的实例)且启用其他逻辑区块中的一个(例如，预设断开的实例)。在一些实施例中，如关于图15进一步所描述，区块的冗余在诸如商业逻辑1308的逻辑区块内可为区域的。Redundant business logic 1310 may be a reproduction of business logic 1308 . Additionally, redundant business logic 1310 may be connected to deactivation switch 1312 and/or activation switch 1314, which may include small fuses\antifuses, and used to logically disable or enable one of the instances (eg, a preset connected instance) and enable one of the other logical blocks (eg, the default disconnected instance). In some embodiments, as further described with respect to FIG. 15 , the redundancy of blocks may be regional within a logical block such as business logic 1308 .

在一些实施例中，存储器芯片1300中的逻辑区块可通过专用总线连接至存储器阵列1304的子集。例如，存储器逻辑1306、商业逻辑1308及冗余商业逻辑1310的集合可连接至存储器阵列1304中的第一行存储器区块(也即，存储器区块1304(a,a)至1304(a,z))。专用总线可允许相关联逻辑区块快速地存取存储器区块的数据，而不要求经由例如地址管理器1302开放通信线。In some embodiments, logic blocks in memory chip 1300 may be connected to a subset of memory array 1304 through dedicated buses. For example, the set of memory logic 1306, business logic 1308, and redundant business logic 1310 may be connected to the first row of memory blocks in memory array 1304 (ie, memory blocks 1304(a,a) through 1304(a,z) )). A dedicated bus may allow the associated logical block to quickly access the data of the memory block without requiring the opening of communication lines via, for example, the address manager 1302.

多个主要逻辑区块中的每个可连接至多个存储器组1304中的至少一个。另外，诸如冗余商业区块1310的冗余区块可连接至存储器实例1304(a,a)至1304(z,z)中的至少一个。冗余区块可再制多个主要逻辑区块中的至少一个，诸如存储器逻辑1306或商业逻辑1308。撤销启动开关1312可连接至所述多个主要逻辑区块中的至少一个，且启动开关1314可连接至所述多个冗余区块中的至少一个。Each of the plurality of primary logical blocks may be connected to at least one of the plurality of memory banks 1304 . Additionally, redundant blocks, such as redundant business block 1310, may be connected to at least one of memory instances 1304(a,a)-1304(z,z). A redundant block may replicate at least one of multiple primary logical blocks, such as memory logic 1306 or business logic 1308 . The deactivation switch 1312 may be connected to at least one of the plurality of primary logical blocks, and the activation switch 1314 may be connected to at least one of the plurality of redundant blocks.

在这些实施例中，在侦测到与多个主要逻辑区块中的一个(存储器逻辑1306和/或商业逻辑1308)相关联的故障后，撤销启动开关1312可被配置为停用多个主要逻辑区块中的该者。同时，启动开关1314可被配置为启用多个冗余区块中的再制多个主要逻辑区块中的一个的冗余区块，诸如冗余逻辑区块1310。In these embodiments, upon detection of a failure associated with one of the plurality of primary logic blocks (memory logic 1306 and/or business logic 1308 ), the deactivation switch 1312 may be configured to disable the plurality of primary logic blocks The one in the logical block. Meanwhile, the enable switch 1314 may be configured to enable a redundant block of the plurality of redundant blocks that reproduces one of the plurality of primary logical blocks, such as the redundant logical block 1310 .

此外，可统称为「配置开关」的启动开关1314及撤销启动开关1312可包括用以配置开关的状态的外部输入。例如，启动开关1314可被配置为使得外部输入中的启动信号产生闭合开关条件，而撤销启动开关1312可被配置为使得外部输入中的撤销启动信号产生断开开关条件。在一些实施例中，1300中的所有配置开关可默认为撤销启动，且在测试指示相关联逻辑区块起作用且信号施加于外部输入中之后变得被启动或启用。替代地，在一些状况下，1300中的所有配置开关可默认为经启用，且可在测试指示相关联逻辑区块不起作用且撤销启动信号施加于外部输入中之后被撤销启动或停用。In addition, activation switch 1314 and deactivation switch 1312, which may be collectively referred to as "configuration switches," may include external inputs to configure the state of the switches. For example, enable switch 1314 may be configured such that an enable signal in the external input generates a closed switch condition, while deactivation switch 1312 may be configured such that a deactivate signal in the external input generates an open switch condition. In some embodiments, all configuration switches in 1300 may be deactivated by default and become activated or enabled after a test indicates that the associated logic block is functional and a signal is applied to an external input. Alternatively, in some cases, all configuration switches in 1300 may be enabled by default, and may be deactivated or deactivated after testing indicates that the associated logic block is inoperative and the deactivation signal is applied to the external input.

无关于最初启用抑或停用配置开关，在侦测到与相关联逻辑区块相关联的故障后，配置开关可停用相关联逻辑区块。在最初启用配置开关的状况下，配置开关的状态可改变至停用，以便停用相关联逻辑区块。在最初停用配置开关的状况下，配置开关的状态可保持在其停用状态中，以便停用相关联逻辑区块。例如，可操作性测试的结果可指示，某一逻辑区块不操作或该逻辑区块不能在某些规格内操作。在这些状况下，可停用逻辑区块，可能不启用其对应配置开关。Regardless of whether the configuration switch is initially enabled or disabled, upon detection of a failure associated with the associated logic block, the configuration switch may disable the associated logic block. With the configuration switch initially enabled, the state of the configuration switch may be changed to disabled in order to disable the associated logic block. In the event that the configuration switch is initially deactivated, the state of the configuration switch may remain in its deactivated state in order to deactivate the associated logical block. For example, the results of an operability test may indicate that a logic block does not operate or that the logic block cannot operate within certain specifications. Under these conditions, logical blocks may be disabled, possibly without their corresponding configuration switches enabled.

在一些实施例中，配置开关可连接至两个或多于两个逻辑区块，且可被配置为在不同逻辑区块之间进行选择。例如，配置开关可连接至商业逻辑区块1308及冗余逻辑区块1310两者。配置开关可启用冗余逻辑区块1310，同时停用商业逻辑1308。In some embodiments, a configuration switch may be connected to two or more logic blocks, and may be configured to select between different logic blocks. For example, a configuration switch may be connected to both business logic block 1308 and redundant logic block 1310. A configuration switch can enable redundant logic blocks 1310 while disabling business logic 1308.

替代地或另外，多个主要逻辑区块中的至少一个(存储器逻辑1306和/或商业逻辑1308)可通过第一专用连接件连接至多个存储器组或存储器实例1304的子集。接着，多个冗余区块中的再制多个主要逻辑区块中的至少一个的至少一个冗余区块(诸如，冗余商业逻辑1310)可通过第二专用连接件连接至相同多个存储器组或实例1304的子集。Alternatively or additionally, at least one of the plurality of primary logic blocks (memory logic 1306 and/or business logic 1308 ) may be connected to a plurality of memory banks or subsets of memory instances 1304 through a first dedicated connection. Next, at least one redundant block of the plurality of redundant blocks that reproduces at least one of the plurality of primary logic blocks, such as redundant business logic 1310, may be connected to the same plurality by a second dedicated connection A subset of memory banks or instances 1304.

此外，存储器逻辑1306可具有不同于商业逻辑1308的功能及能力。例如，虽然存储器逻辑1306可经设计以实现存储器组1304中的读取及写入操作，但商业逻辑1308可经设计以执行存储器内计算。因此，若商业逻辑1308包括第一商业逻辑区块且商业逻辑1308包括第二商业逻辑区块(如冗余商业逻辑1310)，则有可能将有缺陷的商业逻辑1308断开且重新连接冗余商业逻辑1310使得不会失去任何能力。Furthermore, memory logic 1306 may have different functions and capabilities than business logic 1308 . For example, while memory logic 1306 may be designed to implement read and write operations in memory bank 1304, business logic 1308 may be designed to perform in-memory computations. Thus, if business logic 1308 includes a first block of business logic and business logic 1308 includes a second block of business logic (eg, redundant business logic 1310 ), it is possible to disconnect and reconnect redundant business logic 1308 to the defective business logic 1308 Business logic 1310 is such that no capability is lost.

在一些实施例中，配置开关(包括撤销启动开关1312及启动开关1314)可用熔断器、反熔断器或可编程设备(包括可一次性可编程设备)或其他形式的非易失性存储器来实施。In some embodiments, configuration switches (including deactivation switch 1312 and activation switch 1314 ) may be implemented with fuses, antifuses, or programmable devices (including one-time programmable devices) or other forms of non-volatile memory .

图14为符合所公开实施例的示例性冗余逻辑区块集合1400的功能方块图。在一些实施例中，冗余逻辑区块集合1400可安置于基板1301中。冗余逻辑区块集合1400可包括分别连接至开关1312及1314的商业逻辑1308及冗余商业逻辑1310中的至少一个。此外，商业逻辑1308及冗余商业逻辑1310可连接至地址总线1402及数据总线1404。14 is a functional block diagram of an exemplary redundant logical block set 1400 consistent with disclosed embodiments. In some embodiments, the set of redundant logical blocks 1400 may be disposed in the substrate 1301 . Redundant logic block set 1400 may include at least one of business logic 1308 and redundant business logic 1310 connected to switches 1312 and 1314, respectively. Additionally, business logic 1308 and redundant business logic 1310 may be connected to address bus 1402 and data bus 1404 .

在一些实施例中，如图14中所展示，开关1312及1314可将逻辑区块连接至时钟节点。以此方式，配置开关可将逻辑区块与时钟信号接合或脱离，以有效地启动或撤销启动逻辑区块。然而，在其他实施例中，开关1312及1314可将逻辑区块连接至其他节点以用于启动或撤销启动。例如，配置开关可将逻辑区块连接至电压供应节点(例如，VCC)或连接至接地节点(例如，GND)或时钟信号。以此方式，逻辑区块可由配置开关启用或停用，这是因为该配置开关可产生开路或截断逻辑区块供电。In some embodiments, as shown in FIG. 14, switches 1312 and 1314 may connect logic blocks to clock nodes. In this way, the configuration switch can engage or disengage the logic block from the clock signal to effectively activate or deactivate the logic block. However, in other embodiments, switches 1312 and 1314 may connect logical blocks to other nodes for activation or deactivation. For example, a configuration switch may connect the logic block to a voltage supply node (eg, VCC) or to a ground node (eg, GND) or a clock signal. In this way, a logic block can be enabled or disabled by a configuration switch because the configuration switch can create an open circuit or cut off power to the logic block.

在一些实施例中，如图14中所展示，地址总线1402及数据总线1404可在逻辑区块的相对侧中，该逻辑区块并联地连接至该总线中的每个。以此方式，可通过逻辑区块集合1400促进不同片上组件的路由。In some embodiments, as shown in FIG. 14, the address bus 1402 and the data bus 1404 may be in opposite sides of a logic block connected in parallel to each of the buses. In this manner, routing of different on-chip components may be facilitated through logical block set 1400 .

在一些实施例中，多个撤销启动开关1312中的每个将多个主要逻辑区块中的至少一个与时钟节点耦接，且多个启动开关1314中的每个可将多个冗余区块中的至少一个与时钟节点耦接，以允许连接\断开时钟以作为简单的启动\撤销启动机制。In some embodiments, each of the plurality of deactivation switches 1312 couples at least one of the plurality of primary logic blocks to the clock node, and each of the plurality of activation switches 1314 may couple the plurality of redundant blocks At least one of the blocks is coupled to a clock node to allow connecting/disconnecting the clock as a simple enable/disable mechanism.

冗余逻辑区块集合1400的冗余商业逻辑1310允许设计者基于面积及路由而选择值得复制的区块。例如，芯片设计者可选择较大区块进行复制，这是因为较大区块可更容易出错。因此，芯片设计者可决定复制大的逻辑区块。另一方面，设计者可偏好复制较小逻辑区块，这是因为较小逻辑区块容易复制而无显着的空间损失。此外，使用图14中的配置，设计者可容易取决于每个区域的错误的统计数据来选择复制逻辑区块。The redundant business logic 1310 of the redundant logical block set 1400 allows the designer to select blocks worth replicating based on area and routing. For example, chip designers may choose larger blocks to replicate because larger blocks can be more prone to errors. Therefore, chip designers may decide to replicate large logic blocks. On the other hand, designers may prefer to copy smaller logical blocks because smaller logical blocks are easy to copy without significant space loss. Furthermore, using the configuration in FIG. 14, the designer can easily select replicated logical blocks depending on the statistics of errors for each region.

图15为符合所公开实施例的示例性逻辑区块1500的功能方块图。该逻辑区块可为商业逻辑1308和/或冗余商业逻辑1310。然而，在其他实施例中，示例性逻辑区块可描述存储器逻辑1306或存储器芯片1300的其他组件。15 is a functional block diagram of an exemplary logic block 1500 consistent with the disclosed embodiments. The logic block may be business logic 1308 and/or redundant business logic 1310. However, in other embodiments, the example logic blocks may describe memory logic 1306 or other components of memory chip 1300 .

逻辑区块1500呈现在小型处理器管线内使用逻辑冗余的又一实施例。逻辑区块1500可包括寄存器1508、取得电路1504、解码器1506及写回电路1518。此外，逻辑区块1500可包括计算单元1510及复制计算单元1512。然而，在其他实施例中，逻辑区块1500可包括其他单元，该其他单元不包含控制器管线，但包括包含所需商业逻辑的分散的处理元件。Logic block 1500 presents yet another embodiment of using logical redundancy within a small processor pipeline. Logic block 1500 may include registers 1508 , fetch circuits 1504 , decoders 1506 , and write-back circuits 1518 . Additionally, logical block 1500 may include computing unit 1510 and replicate computing unit 1512 . However, in other embodiments, logic block 1500 may include other units that do not include a controller pipeline, but include discrete processing elements that include the required business logic.

计算单元1510及复制计算单元1512可包括能够执行数字计算的数字电路。例如，计算单元1510及复制计算单元1512可包括算术逻辑单元(ALU)以对二进制数执行算术及逐比特操作。替代地，计算单元1510及复制计算单元1512可包括对浮点数进行操作的浮点单元(FPU)。此外，在一些实施例中，计算单元1510及复制计算单元1512可实施数据库相关功能，如最小值、最大值、计数及比较操作以及其他。Computing unit 1510 and replica computing unit 1512 may include digital circuits capable of performing digital computations. For example, compute unit 1510 and replicate compute unit 1512 may include arithmetic logic units (ALUs) to perform arithmetic and bit-wise operations on binary numbers. Alternatively, compute unit 1510 and replicate compute unit 1512 may include floating point units (FPUs) that operate on floating point numbers. Furthermore, in some embodiments, the computation unit 1510 and the replicate computation unit 1512 may implement database related functions, such as min, max, count and compare operations, and others.

在一些实施例中，如图15中所展示，计算单元1510及复制计算单元1512可连接至开关电路1514及1516。当经启动时，该开关电路可启用或停用该计算单元。In some embodiments, as shown in FIG. 15 , computing unit 1510 and replica computing unit 1512 may be connected to switching circuits 1514 and 1516 . When activated, the switch circuit can enable or disable the computing unit.

在逻辑区块1500中，复制计算单元1512可再制计算单元1510。此外，在一些实施例中，寄存器1508、取得电路1504、解码器1506及写回电路1518(统称为区域逻辑单元)的大小可小于计算单元1510。因为较大元件更容易在制造期间出现问题，所以设计者可决定复制较大单元(诸如，计算单元1510)而非复制较小单元(诸如，区域逻辑单元)。然而，取决于历史良率及错误率，除复制大单元(或整个区块)以外或替代复制大单元(或整个区块)，设计者也可选择复制区域逻辑单元。例如，计算单元1510可比寄存器1508、取得电路1504、解码器1506及写回电路1518大，且因此更容易出错。设计者可选择复制计算单元1510而非复制逻辑区块1500中的其他元件或整个区块。In logical block 1500, a duplicate computing unit 1512 may replicate the computing unit 1510. Furthermore, in some embodiments, registers 1508 , fetch circuits 1504 , decoders 1506 , and write-back circuits 1518 (collectively referred to as local logic units) may be smaller in size than compute unit 1510 . Because larger elements are more prone to problems during manufacture, a designer may decide to duplicate larger cells (such as compute unit 1510) rather than copy smaller cells (such as regional logic cells). However, depending on historical yield and error rates, the designer may also choose to copy regional logic cells in addition to or instead of copying a large cell (or an entire block). For example, computation unit 1510 may be larger than register 1508, fetch circuit 1504, decoder 1506, and write-back circuit 1518, and thus be more prone to errors. Instead of duplicating other elements in logical block 1500 or the entire block, the designer may choose to duplicate computing unit 1510 .

逻辑区块1500可包括多个区域配置开关，所述多个区域配置开关中的每个连接至计算单元1510或复制计算单元1512中的至少一个中的至少一者。当侦测到计算单元1510中的故障时，区域配置开关可被配置为停用计算单元1510且启用复制计算单元1512。The logic block 1500 may include a plurality of regional configuration switches, each of the plurality of regional configuration switches being connected to at least one of the computing unit 1510 or at least one of the replica computing units 1512 . When a failure in compute unit 1510 is detected, the regional configuration switch may be configured to disable compute unit 1510 and enable duplicate compute unit 1512.

图16展示符合所公开实施例的与总线连接的示例性逻辑区块的功能方块图。在一些实施例中，逻辑区块1602(其可表示存储器逻辑1306、商业逻辑1308或冗余商业逻辑1310)可彼此独立，可经由总线连接，且可通过特定地寻址该逻辑区块而在外部启动。例如，存储器芯片1300可包括许多逻辑区块，每个逻辑区块具有一ID号。然而，在其他实施例中，逻辑区块1602可表示由存储器逻辑1306、商业逻辑1308或冗余商业逻辑1310中的若干个(一个或多个)构成的较大单元。16 shows a functional block diagram of an exemplary logic block connected to a bus in accordance with the disclosed embodiments. In some embodiments, logic blocks 1602 (which may represent memory logic 1306, business logic 1308, or redundant business logic 1310) may be independent of each other, may be connected via a bus, and may be External start. For example, the memory chip 1300 may include many logical blocks, each logical block having an ID number. However, in other embodiments, logic block 1602 may represent a larger unit comprised of several (one or more) of memory logic 1306 , business logic 1308 , or redundant business logic 1310 .

在一些实施例中，逻辑区块1602中的每个可与其他逻辑区块1602冗余。所有区块可作为主要或冗余区块来操作的此完全冗余性可改良制造良率，这是因为设计者可断开故障单元同时维持整个芯片的功能性。例如，设计者可能够停用容易出错但维持类似计算能力的逻辑区域，这是因为所有复制区块可连接至相同的地址总线及数据总线。例如，逻辑区块1602的初始数量可大于目标容量。因而，停用一些逻辑区块1602将不会影响目标容量。In some embodiments, each of logical blocks 1602 may be redundant with other logical blocks 1602 . This full redundancy that all blocks can operate as primary or redundant blocks can improve manufacturing yields because designers can disconnect faulty cells while maintaining the functionality of the entire chip. For example, a designer may be able to disable logic regions that are error-prone but maintain similar computing power because all replicated blocks can be connected to the same address bus and data bus. For example, the initial number of logical blocks 1602 may be greater than the target capacity. Thus, disabling some logical blocks 1602 will not affect the target capacity.

连接至逻辑区块的总线可包括地址总线1614、命令线1616及数据线1618。如图16中所展示，逻辑区块中的每个可独立于总线中的每一线而连接。然而，在某些实施例中，逻辑区块1602可按阶层式结构连接以促进路由。例如，总线中的每一线可连接至将该线路由至不同逻辑区块1602的多任务器。Buses connected to the logic blocks may include address bus 1614 , command lines 1616 and data lines 1618 . As shown in FIG. 16, each of the logic blocks may be connected independently of each line in the bus. However, in some embodiments, logic blocks 1602 may be connected in a hierarchical structure to facilitate routing. For example, each line in the bus may be connected to a multiplexer that routes the line to a different logic block 1602.

在一些实施例中，为了在不知晓内部芯片结构(其可能由于启用及停用单元而改变)的情况下允许外部存取，逻辑区块中的每个可包括熔断ID，诸如熔断标识1604。熔断标识1604可包括判定ID的开关(如熔断器)的阵列，且可连接至管理电路。例如，熔断标识1604可连接至地址管理器1302。替代地，熔断标识1604可连接至较高存储器地址单元。在这些实施例中，熔断标识1604可能可配置以用于特定地址。例如，熔断标识1604可包括可编程的非易失性设备，其基于从管理电路接收到的指令而判定最终ID。In some embodiments, to allow external access without knowledge of the internal chip structure (which may change due to enabling and disabling cells), each of the logic blocks may include a blow ID, such as a blow ID 1604 . Fuse identification 1604 may include an array of ID-determining switches (eg, fuses), and may be connected to management circuitry. For example, the circuit breaker 1604 may be connected to the address manager 1302. Alternatively, the blown flag 1604 may be connected to a higher memory address location. In these embodiments, the fuse flag 1604 may be configurable for a specific address. For example, the fuse identification 1604 may include a programmable non-volatile device that determines the final ID based on instructions received from the management circuit.

存储器芯片上的分布式处理器可设计成具有图16中所描绘的配置。在芯片唤醒时或在工厂测试时执行为BIST的测试程序可将运行ID号指派给通过测试协议的多个主要逻辑区块(存储器逻辑1306及商业逻辑1308)中的区块。测试程序也可将不合法ID号指派给未通过测试协议的多个主要逻辑区块中的区块。测试程序也可将运行ID号指派给通过测试协议的多个冗余区块中的区块(冗余逻辑区块1310)。因为冗余区块替换发生故障的主要逻辑区块，所以经指派运行ID号的多个冗余区块中的区块可等于或大于经指派不合法ID号的多个主要逻辑区块中的区块，藉此停用区块。此外，多个主要逻辑区块中的每个及多个冗余区块中的每个可包括至少一个熔断标识1604。另外，如图16中所展示，连接逻辑区块1602的总线可包括命令线、数据线及地址线。A distributed processor on a memory chip can be designed with the configuration depicted in FIG. 16 . A test program executed as a BIST at chip wake-up or at factory test can assign run ID numbers to blocks of a number of primary logic blocks (memory logic 1306 and business logic 1308) that pass the test protocol. The test program may also assign invalid ID numbers to blocks of multiple primary logical blocks that fail the test protocol. The test program may also assign run ID numbers to blocks (redundant logic block 1310) of the plurality of redundant blocks that pass the test protocol. Because redundant blocks replace failed primary logical blocks, blocks in multiple redundant blocks assigned operational ID numbers may be equal to or greater than blocks in multiple primary logical blocks assigned illegal ID numbers block, thereby deactivating the block. In addition, each of the plurality of primary logical blocks and each of the plurality of redundant blocks may include at least one fuse indicator 1604 . Additionally, as shown in FIG. 16, the bus connecting logic block 1602 may include command lines, data lines, and address lines.

然而，在其他实施例中，连接至总线的所有逻辑区块1602将开始被停用且不具有ID号。逐个地测试，每一良好逻辑区块将得到运行ID号，且不工作的这些逻辑区块将保留不合法ID，此将停用这些区块。以此方式，冗余逻辑区块可通过替换在测试处理程序期间已知有缺陷的区块来改良制造良率。However, in other embodiments, all logic blocks 1602 connected to the bus will initially be disabled and have no ID numbers. Tested one by one, each good logical block will get a running ID number, and the logical blocks that are not working will retain the invalid ID, which will disable the blocks. In this way, redundant logic blocks can improve manufacturing yield by replacing blocks that are known to be defective during the test process.

地址总线1614可将管理电路耦接至多个存储器组中的每个、多个主要逻辑区块中的每个及多个冗余区块中的每个。这些连接允许管理电路在侦测到与主要逻辑区块(诸如，商业逻辑1308)相关联的故障后将无效地址指派给多个主要逻辑区块中的一个且将有效地址指派给多个冗余区块中的一个。An address bus 1614 may couple management circuitry to each of the plurality of memory banks, each of the plurality of primary logical blocks, and each of the plurality of redundant blocks. These connections allow management circuitry to assign invalid addresses to one of multiple primary logical blocks and to assign valid addresses to multiple redundant logical blocks upon detection of a failure associated with a primary logical block, such as business logic 1308 one of the blocks.

例如，如图16A中所展示，不合法ID被配置至所有逻辑区块1602(a)至1602(c)(例如，地址0xFFF)。在测试之后，逻辑区块1602(a)及1602(c)经验证为起作用，而逻辑区块1602(b)不起作用。在图16A中，无阴影逻辑区块可表示成功地通过功能性测试的逻辑区块，而阴影逻辑区块可表示未通过功能性测试的逻辑区块。因而，测试程序针对起作用的逻辑区块将不合法ID改变为合法ID，而为不作用的逻辑区块保留不合法ID。作为一实施例，在图16A中，逻辑区块1602(a)及1602(c)的地址从0xFFF分别改变为0x001及0x002。相比之下，逻辑区块1602(b)的地址仍为不合法地址0xFFF。在一些实施例中，ID通过编程对应熔断标识1604来改变。For example, as shown in FIG. 16A, an invalid ID is configured to all logical blocks 1602(a)-1602(c) (eg, address 0xFFF). After testing, logic blocks 1602(a) and 1602(c) were verified to be functional, while logic block 1602(b) was not functional. In FIG. 16A, unshaded logical blocks may represent logical blocks that successfully passed functional testing, while shaded logical blocks may represent logical blocks that failed functional testing. Thus, the test program changes invalid IDs to valid IDs for functional logical blocks, while retaining invalid IDs for non-functional logical blocks. As an example, in FIG. 16A, the addresses of logical blocks 1602(a) and 1602(c) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of the logical block 1602(b) is still the illegal address 0xFFF. In some embodiments, the ID is changed by programming the corresponding fuse identification 1604.

来自逻辑区块1602的测试的不同结果可产生不同配置。例如，如图16B中所展示，地址管理器1302最初可将不合法ID指派给所有逻辑区块1602(也即，0xFFF)。然而，测试结果可指示两个逻辑区块1602(a)及1602(b)起作用。在这些状况下，对逻辑区块1602(c)的测试可能并非必要的，这是因为存储器芯片1300可能仅需要两个逻辑区块。因此，为了将测试资源减至最少，可仅根据1300的产品定义所需的起作用逻辑区块的最小数量来测试逻辑区块，以使其他逻辑区块未经测试。图16B还展示表示通过功能性测试的经测试逻辑区块的无阴影逻辑区块及表示未测试逻辑区块的阴影逻辑区块。Different results from the tests of logic block 1602 may result in different configurations. For example, as shown in Figure 16B, address manager 1302 may initially assign invalid IDs to all logical blocks 1602 (ie, 0xFFF). However, the test results may indicate that both logic blocks 1602(a) and 1602(b) are functional. Under these circumstances, testing of logic block 1602(c) may not be necessary since memory chip 1300 may only require two logic blocks. Therefore, in order to minimize testing resources, logical blocks may only be tested according to the minimum number of functional logical blocks required by the product definition of 1300, leaving other logical blocks untested. 16B also shows unshaded logic blocks representing tested logic blocks that passed functional testing and shaded logic blocks representing untested logic blocks.

在这些实施例中，在起动时执行BIST的生产测试器(外部或内部的，自动或人工的)或控制器可针对起作用的经测试逻辑区块将不合法ID改变为运行ID，而为未测试逻辑区块保留不合法ID。作为一实施例，在图16B中，逻辑区块1602(a)及1602(b)的地址从0xFFF分别改变为0x001及0x002。相比之下，未测试逻辑区块1602(c)的地址仍为不合法地址0xFFF。In these embodiments, a production tester (external or internal, automated or manual) or controller that performs BIST at startup can change the invalid ID to a run ID for the tested logic block that is functioning, but Untested logical blocks retain illegal IDs. As an example, in FIG. 16B, the addresses of logical blocks 1602(a) and 1602(b) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of the untested logical block 1602(c) is still the illegal address 0xFFF.

图17为符合所公开实施例的串联连接的示例性单元1702及1712的功能方块图。图17可表示整个系统或芯片。替代地，图17可表示含有其他起作用区块的芯片中的区块。17 is a functional block diagram of exemplary cells 1702 and 1712 connected in series in accordance with disclosed embodiments. Figure 17 may represent the entire system or chip. Alternatively, Figure 17 may represent a block in a chip containing other active blocks.

单元1702及1712可表示包括诸如存储器逻辑1306和/或商业逻辑1308的多个逻辑区块的完整单元。在这些实施例中，单元1702及1712也可包括执行操作所需的元件，诸如地址管理器1302。然而，在其他实施例中，单元1702及1712可表示诸如商业逻辑1308或冗余商业逻辑1310的逻辑单元。Cells 1702 and 1712 may represent complete cells that include multiple logical blocks, such as memory logic 1306 and/or business logic 1308 . In these embodiments, units 1702 and 1712 may also include elements required to perform operations, such as address manager 1302 . However, in other embodiments, units 1702 and 1712 may represent logical units such as business logic 1308 or redundant business logic 1310.

图17呈现单元1702及1712可能需要在其本身之间通信的实施例。在此类状况下，单元1702及1712可串联连接。然而，非工作单元可破坏逻辑区块之间的连续性。因此，当单元由于缺陷而需要被停用时，单元之间的连接可包括旁路选项。该旁路选项也可为旁路单元本身的部分。Figure 17 presents an embodiment in which units 1702 and 1712 may need to communicate between themselves. Under such conditions, cells 1702 and 1712 may be connected in series. However, non-working units can break the continuity between logical blocks. Thus, the connection between cells may include bypass options when cells need to be deactivated due to a defect. The bypass option can also be part of the bypass unit itself.

在图17中，单元可串联连接(例如，1702(a)至1702(c))，且发生故障的单元(例如，1702(b))可在其有缺陷时被绕过。该单元可进一步与开关电路并联地连接。例如，在一些实施例中，单元1702及1712可与开关电路1722及1728连接，如图17中所描绘。在图17中所描绘的实施例中，单元1702(b)有缺陷。例如，单元1702(b)未通过电路功能性测试。因此，可使用例如启动开关1314(图17中未展示)来停用单元1702(b)，和/或可启动开关电路1722(b)以绕过单元1702(b)且维持逻辑区块之间的连接性。In Figure 17, cells can be connected in series (eg, 1702(a) to 1702(c)), and a failed cell (eg, 1702(b)) can be bypassed if it is defective. The unit may further be connected in parallel with the switching circuit. For example, in some embodiments, cells 1702 and 1712 may be connected with switch circuits 1722 and 1728, as depicted in FIG. 17 . In the embodiment depicted in Figure 17, cell 1702(b) is defective. For example, cell 1702(b) fails the circuit functionality test. Thus, cell 1702(b) may be disabled using, for example, activation switch 1314 (not shown in FIG. 17 ), and/or switch circuit 1722(b) may be activated to bypass cell 1702(b) and maintain between logic blocks connectivity.

因此，当多个主要单元串联连接时，所述多个单元中的每个可与一并联开关并联地连接。在侦测到与多个单元中的一个相关联的故障后，可启动连接至所述多个单元中的该者的并联开关以连接所述多个单元中的两者。Thus, when multiple primary cells are connected in series, each of the multiple cells can be connected in parallel with a parallel switch. Upon detection of a fault associated with one of the plurality of cells, a parallel switch connected to the one of the plurality of cells may be activated to connect both of the plurality of cells.

在其他实施例中，如图17中所展示，开关电路1728可包括将致使一个或多个循环延迟的一或更多个取样点，以维持单元的不同线之间的同步。当停用一单元时，邻近逻辑区块之间的连接的短路可能会产生与其他计算的同步误差。例如，若一任务需要来自A线及B线两者的数据，且A及B中的每个系由独立的一系列单元承载，则停用一单元将导致将需要进一步数据管理的线之间的去同步。为了防止去同步，样本电路1730可仿真由经停用单元1712(b)引起的延迟。然而，在一些实施例中，并联开关可包括反熔断器而非取样电路1730。In other embodiments, as shown in FIG. 17, the switch circuit 1728 may include one or more sample points that will cause one or more cyclic delays to maintain synchronization between different lines of cells. When a cell is disabled, shorting of connections between adjacent logic blocks may create synchronization errors with other computations. For example, if a task requires data from both lines A and B, and each of A and B is carried by a separate series of cells, then deactivating a cell will result in further data management between lines that will be required desynchronization. To prevent desynchronization, sample circuit 1730 may simulate the delay caused by disabled cell 1712(b). However, in some embodiments, the parallel switch may include an antifuse instead of the sampling circuit 1730 .

图18为符合所公开实施例的成二维阵列连接的示例性单元的功能方块图。图18可表示整个系统或芯片。替代地，图18可表示含有其他起作用区块的芯片中的区块。18 is a functional block diagram of exemplary cells connected in a two-dimensional array in accordance with disclosed embodiments. Figure 18 may represent the entire system or chip. Alternatively, Figure 18 may represent a block in a chip containing other active blocks.

单元1806可表示包括诸如存储器逻辑1306和/或商业逻辑1308的多个逻辑区块的自主单元。然而，在其他实施例中，单元1806可表示诸如商业逻辑1308的逻辑单元。在方便时，图18的论述可参考图13(例如，存储器芯片1300)中所识别且上文所论述的元件。Unit 1806 may represent an autonomous unit that includes multiple logical blocks, such as memory logic 1306 and/or business logic 1308 . However, in other embodiments, unit 1806 may represent a logical unit such as business logic 1308 . Where convenient, the discussion of FIG. 18 may refer to elements identified in FIG. 13 (eg, memory chip 1300 ) and discussed above.

如图18中所展示，单元可布置成二维阵列，其中单元1806(其可包括或表示存储器逻辑1306、商业逻辑1308或冗余商业逻辑1310中的一个或多个)经由开关箱1808及连接箱1810互连。此外，为了控制二维阵列的配置，二维阵列可在二维阵列的周边中包括I/O区块1804。As shown in Figure 18, cells may be arranged in a two-dimensional array, wherein cells 1806 (which may include or represent one or more of memory logic 1306, business logic 1308, or redundant business logic 1310) are connected via switch boxes 1808 and Boxes 1810 are interconnected. Furthermore, to control the configuration of the two-dimensional array, the two-dimensional array may include I/O blocks 1804 in the perimeter of the two-dimensional array.

连接箱1810可为可编程且可重配置的设备，其可对从I/O区块1804输入的信号作出响应。例如，连接箱可包括来自单元1806的多个输入接脚且也可连接至开关箱1808。替代地，连接箱1810可包括将可编程逻辑胞元的接脚与路由轨线连接的开关的群组，而开关箱1808可包括连接不同轨线的开关的群组。Connection box 1810 can be a programmable and reconfigurable device that can respond to signals input from I/O block 1804 . For example, a junction box may include multiple input pins from unit 1806 and may also be connected to switch box 1808. Alternatively, connection box 1810 may include a group of switches that connect pins of programmable logic cells to routing tracks, and switch box 1808 may include a group of switches that connect different tracks.

在某些实施例中，连接箱1810及开关箱1808可通过诸如开关1312及1314的配置开关实施。在这些实施例中，连接箱1810及开关箱1808可由生产测试器或在芯片起动时所执行的BIST来配置。In certain embodiments, connection box 1810 and switch box 1808 may be implemented by configuration switches such as switches 1312 and 1314 . In these embodiments, junction box 1810 and switch box 1808 may be configured by a production tester or a BIST executed at chip startup.

在一些实施例中，连接箱1810及开关箱1808可在测试单元1806的电路功能性之后进行配置。在这些实施例中，I/O区块1804可用以将测试信号发送至单元1806。取决于测试结果，I/O区块1804可发送编程信号，该编程信号以停用未通过测试协议的单元1806且启用通过测试协议的单元1806的方式来配置连接箱1810及开关箱1808。In some embodiments, the junction box 1810 and the switch box 1808 may be configured after the circuit functionality of the unit 1806 is tested. In these embodiments, I/O block 1804 may be used to send test signals to unit 1806 . Depending on the test results, I/O block 1804 may send programming signals that configure junction box 1810 and switch box 1808 in a manner that disables cells 1806 that fail the test protocol and enables cells 1806 that pass the test protocol.

在这些实施例中，多个主要逻辑区块及多个冗余区块可成二维栅格安置于基板上。因此，多个主要单元1806中的每个及多个冗余区块中的每个(诸如，冗余商业逻辑1310)可用开关箱1808互连，且输入区块可安置于二维栅格的每一线及每一列的周边中。In these embodiments, the plurality of primary logic blocks and the plurality of redundant blocks may be disposed on the substrate in a two-dimensional grid. Thus, each of the plurality of primary cells 1806 and each of the plurality of redundant blocks, such as redundant business logic 1310, may be interconnected with switch boxes 1808, and the input blocks may be placed in a two-dimensional grid of in the perimeter of each line and each column.

图19为符合所公开实施例的处于复杂连接中的示例性单元的功能方块图。图19可表示整个系统。替代地，图19可表示含有其他起作用区块的芯片中的区块。19 is a functional block diagram of exemplary cells in complex connections consistent with disclosed embodiments. Figure 19 may represent the entire system. Alternatively, Figure 19 may represent a block in a chip containing other active blocks.

图19的复杂连接包括单元1902(a)至1902(f)及配置开关1904(a)至1904(f)。单元1902可表示包括诸如存储器逻辑1306和/或商业逻辑1308的多个逻辑区块的自主单元。然而，在其他实施例中，单元1902可表示诸如存储器逻辑1306、商业逻辑1308或冗余商业逻辑1310的逻辑单元。配置开关1904可包括撤销启动开关1312及启动开关1314中的任一者。The complex connection of Figure 19 includes cells 1902(a)-1902(f) and configuration switches 1904(a)-1904(f). Unit 1902 may represent an autonomous unit that includes multiple logical blocks such as memory logic 1306 and/or business logic 1308 . However, in other embodiments, unit 1902 may represent a logical unit such as memory logic 1306 , business logic 1308 , or redundant business logic 1310 . Configuring switches 1904 may include deactivating any of switch 1312 and enabling switch 1314 .

如图19中所展示，该复杂连接可包括两个平面中的单元1902。例如，复杂连接可包括在z轴上分开的两个独立基板。替代地或另外，单元1902可布置在基板的两个表面中。例如，出于缩减存储器芯片1300的面积的目的，基板1301可布置在两个重叠表面中且与在三维上布置的配置开关1904连接。配置开关可包括撤销启动开关1312和/或启动开关1314。As shown in Figure 19, the complex connection may include cells 1902 in two planes. For example, a complex connection may include two separate substrates separated in the z-axis. Alternatively or additionally, the cells 1902 may be arranged in both surfaces of the substrate. For example, for the purpose of reducing the area of the memory chip 1300, the substrate 1301 may be arranged in two overlapping surfaces and connected with configuration switches 1904 arranged in three dimensions. Configuration switches may include deactivating switch 1312 and/or enabling switch 1314 .

基板的第一平面可包括「主」单元1902。这些区块可预设为经启用。在这些实施例中，第二平面可包括「冗余」单元1902。这些单元可默认为经停用。The first plane of the substrate may include the "main" cell 1902. These blocks can be enabled by default. In these embodiments, the second plane may include "redundant" cells 1902 . These units can be disabled by default.

在一些实施例中，配置开关1904可包括反熔断器。因此，在测试单元1902之后，区块可通过将某些反熔断器切换至「始终接通」及停用选定单元1902来连接于起作用单元的块中，即使该单元在不同平面中亦如此。在图19中所呈现的实施例中，「主」单元中的一个(单元1902(e))不工作。图19可将不起作用区块或未测试区块表示为阴影区块，而经测试或起作用区块可为无阴影的。因此，配置开关1904被配置为使得不同平面中的逻辑区块中的一个(例如，单元1902(f))变为作用中。以此方式，即使主逻辑区块中的一个有缺陷，存储器芯片仍通过替换备用逻辑单元而工作。In some embodiments, the configuration switch 1904 may include an antifuse. Thus, after testing cell 1902, a block can be connected in a block of active cells by switching some antifuses to "always on" and disabling selected cells 1902, even if the cell is in a different plane in this way. In the embodiment presented in FIG. 19, one of the "main" cells (cell 1902(e)) is inactive. Figure 19 may represent inactive or untested blocks as shaded blocks, while tested or active blocks may be unshaded. Accordingly, configuration switch 1904 is configured to cause one of the logical blocks in a different plane (eg, cell 1902(f)) to become active. In this way, even if one of the main logic blocks is defective, the memory chip still functions by replacing the spare logic cells.

图19另外展示不测试或启用第二平面中的单元1902中的一个(也即，1902(c))，这是因为主逻辑区块起作用。例如，在图19中，两个主单元1902(a)及1902(d)通过功能性测试。因此，单元1902(c)未被测试或启用。因此，图19展示特定地选择取决于测试结果而变为在作用中的逻辑区块的能力。Figure 19 additionally shows that one of the cells 1902 in the second plane (ie, 1902(c)) is not tested or enabled because the main logic block is functional. For example, in FIG. 19, two main units 1902(a) and 1902(d) pass the functional test. Therefore, unit 1902(c) is not tested or enabled. Thus, Figure 19 shows the ability to specifically select logic blocks that become active depending on the test results.

在一些实施例中，如图19中所展示，并非第一平面中的所有单元1902均可具有对应的备用或冗余区块。然而，在其他实施例中，所有单元可彼此冗余以实现完全冗余，其中所有单元均为主要或冗余的。此外，虽然一些实施可遵循图19中所描绘的星形网络拓朴，但其他实施可使用并联连接、串联连接和/或将不同元件与配置开关并联地或串联地耦接。In some embodiments, as shown in Figure 19, not all cells 1902 in the first plane may have corresponding spare or redundant blocks. However, in other embodiments, all units may be redundant with each other for full redundancy, wherein all units are primary or redundant. Furthermore, while some implementations may follow the star network topology depicted in Figure 19, other implementations may use parallel connections, series connections, and/or coupling different elements in parallel or in series with configuration switches.

图20为说明符合所公开实施例的冗余区块启用处理程序2000的示例性流程图。可针对存储器芯片1300且特别地针对DRAM存储器芯片实施启用处理程序2000。在一些实施例中，处理程序2000可包括以下步骤：测试存储器芯片的基板上的多个逻辑区块中的每个的至少一个电路功能性；基于测试结果识别多个主要逻辑区块中的故障逻辑区块；测试存储器芯片的基板上的至少一个冗余或额外逻辑区块的至少一个电路功能性；通过将外部信号施加至撤销启动开关来停用至少一个故障逻辑区块；及通过将该外部信号施加至启动开关来启用该至少一个冗余区块，该启动开关与该至少一个冗余区块连接且安置于该存储器芯片的该基板上。以下图20的描述进一步详述处理程序2000的每一步骤。20 is an exemplary flow diagram illustrating a redundant block enablement process 2000 consistent with the disclosed embodiments. Enable handler 2000 may be implemented for memory chips 1300 and particularly for DRAM memory chips. In some embodiments, process 2000 may include the steps of: testing at least one circuit functionality of each of a plurality of logic blocks on a substrate of a memory chip; identifying faults in the plurality of primary logic blocks based on the test results logic blocks; testing at least one circuit functionality of at least one redundant or additional logic block on a substrate of a memory chip; deactivating at least one failed logic block by applying an external signal to a de-enable switch; and by applying the An external signal is applied to an enable switch to enable the at least one redundant block, the enable switch is connected to the at least one redundant block and is disposed on the substrate of the memory chip. The following description of FIG. 20 further details each step of the process 2000.

处理程序2000可包括测试诸如商业区块1308的多个逻辑区块(步骤2002)及多个冗余区块(例如，冗余商业区块1310)。测试可在封装之前使用例如用于晶圆上测试的探测站进行。然而，步骤2000也可在封装之后执行。Process 2000 may include testing multiple logical blocks such as business block 1308 (step 2002) and multiple redundant blocks (eg, redundant business block 1310). Testing may be performed prior to packaging using, for example, a probing station for on-wafer testing. However, step 2000 may also be performed after packaging.

步骤2002中的测试可包括将有限序列的测试信号施加至存储器芯片1300中的每个逻辑区块或存储器芯片1300中的逻辑区块的子集。该测试信号可包括请求预期得到0或1的计算。在其他实施例中，测试信号可请求读取存储器组中的特定地址或写入特定存储器组中。Testing in step 2002 may include applying a limited sequence of test signals to each logic block in memory chip 1300 or a subset of logic blocks in memory chip 1300 . The test signal may include requests for computations that are expected to result in 0 or 1. In other embodiments, the test signal may request to read a specific address in a memory bank or write to a specific memory bank.

可在步骤2002中实施测试技术以测试逻辑区块在反复处理程序下的响应。例如，该测试可涉及通过传输将数据写入存储器组中的指令及接着验证写入数据的完整性来测试逻辑区块。在一些实施例中，该测试可包括利用反转数据重复算法。Testing techniques may be implemented in step 2002 to test the response of the logic block under iterative processing procedures. For example, the testing may involve testing a logical block by transmitting an instruction to write data into a memory bank and then verifying the integrity of the written data. In some embodiments, the test may include utilizing an inverse data repetition algorithm.

在替代实施例中，步骤2002的测试可包括运行逻辑区块的模型以基于一组测试指令产生目标存储器图像。接着，可对存储器芯片中的逻辑区块执行相同序列的指令，且可记录结果。模拟的残余存储器图像也可与自测试获得的图像进行比较，且任何失配可标示为故障。In an alternate embodiment, the testing of step 2002 may include running a model of the logic block to generate a target memory image based on a set of test instructions. Then, the same sequence of instructions can be executed on the logic blocks in the memory chip, and the results can be recorded. The simulated residual memory image can also be compared to the image obtained from the self-test, and any mismatches can be flagged as faults.

替代地，在步骤2002中，该测试可包括阴影模型化，在阴影模型化中会产生诊断，但未必预测结果。反而，使用阴影模型化的测试可对存储器芯片及模拟两者平行地执行。例如，当存储器芯片中的逻辑区块完成指令或任务时，仿真可经发信以执行相同指令。一旦存储器芯片中的逻辑区块完成该指令，便可将两个模型的架构状态进行比较。若存在失配，则标示故障。Alternatively, in step 2002, the test may include shadow modeling in which a diagnosis may result, but not necessarily predict an outcome. Instead, testing using shadow modeling can be performed on both the memory chip and the simulation in parallel. For example, when a logic block in a memory chip completes an instruction or task, the emulation can be signaled to execute the same instruction. Once the logic block in the memory chip completes the instruction, the architectural states of the two models can be compared. If there is a mismatch, a fault is indicated.

在一些实施例中，可在步骤2002中测试所有逻辑区块(包括例如存储器逻辑1306、商业逻辑1308或冗余商业逻辑1310中的每个)。然而，在其他实施例中，可在不同测试回合中仅测试逻辑区块的子集。例如，在第一测试回合中，可仅测试存储器逻辑1306及相关联区块。在第二回合中，可仅测试商业逻辑1308及相关联区块。在第三回合中，取决于前两个回合的结果，可测试与冗余商业逻辑1310相关联的逻辑区块。In some embodiments, all logic blocks (including, for example, each of memory logic 1306 , business logic 1308 , or redundant business logic 1310 ) may be tested in step 2002 . However, in other embodiments, only a subset of logic blocks may be tested in different test rounds. For example, in a first test pass, only memory logic 1306 and associated blocks may be tested. In the second round, only the business logic 1308 and associated blocks may be tested. In the third round, depending on the results of the previous two rounds, logic blocks associated with redundant business logic 1310 may be tested.

处理程序2000可继续至步骤2004。在步骤2004中，可识别故障逻辑区块，且也可识别故障冗余区块。例如，未通过步骤2002的测试的逻辑区块可在步骤2004中识别为故障区块。然而，在其他实施例中，最初仅可识别某些故障逻辑区块。例如，在一些实施例中，仅可识别与商业逻辑1308相关联的逻辑区块，且仅在需要故障冗余区块以替代故障逻辑区块的情况下识别故障冗余区块。此外，识别故障区块可包括在存储器组或非易失性存储器上写入经识别故障区块的标识信息。Process 2000 may continue to step 2004 . In step 2004, a failed logical block can be identified, and a failed redundant block can also be identified. For example, logic blocks that fail the tests of step 2002 may be identified in step 2004 as failed blocks. However, in other embodiments, only certain failed logical blocks may initially be identified. For example, in some embodiments, only logical blocks associated with business logic 1308 may be identified, and a failed redundant block may only be identified if a failed redundant block is required to replace a failed logical block. Additionally, identifying the failed block may include writing identification information of the identified failed block on a memory bank or non-volatile memory.

在步骤2006中，可停用故障逻辑区块。例如，使用配置电路，可通过将故障逻辑区块与时钟、接地和/或电源节点断开来停用故障逻辑区块。替代地，可通过以避开逻辑区块的布置来配置连接箱来停用故障逻辑区块。另外，在其他实施例中，可通过从地址管理器1302接收不合法地址来停用故障逻辑区块。In step 2006, the failed logical block may be disabled. For example, using a configuration circuit, a failed logic block can be disabled by disconnecting it from the clock, ground, and/or power nodes. Alternatively, a failed logical block may be deactivated by configuring the junction box in an arrangement that avoids the logical block. Additionally, in other embodiments, the failed logical block may be disabled by receiving an invalid address from the address manager 1302.

在步骤2008中，可识别复制故障逻辑区块的冗余区块。即使一些逻辑区块已发生故障，为了支持存储器芯片的相同能力，在步骤2008中，可识别可用且可复制故障逻辑区块的冗余区块。例如，若执行向量的乘法的逻辑区块经判定为发生故障，则在步骤2008中，地址管理器1302或片上控制器可识别还执行向量的乘法的可用冗余逻辑区块。In step 2008, redundant blocks that replicate the failed logical block can be identified. In order to support the same capabilities of the memory chip even if some logical blocks have failed, in step 2008, redundant blocks of available and replicable failed logical blocks can be identified. For example, if the logical block performing the multiplication of the vector is determined to be faulty, in step 2008, the address manager 1302 or the on-chip controller may identify an available redundant logical block that also performs the multiplication of the vector.

在步骤2010中，可启用在步骤2008中所识别的冗余区块。与步骤2006的停用操作相比，在步骤2010中，可通过将经识别冗余区块连接至时钟、接地和/或电源节点来启用该经识别冗余区块。替代地，可通过以连接经识别冗余区块的布置来配置连接箱来启用经识别冗余区块。另外，在其他实施例中，可通过在测试程序运行时间接收运行地址来启用经识别冗余区块。In step 2010, the redundant blocks identified in step 2008 may be enabled. In contrast to the deactivation operation of step 2006, in step 2010, the identified redundant block may be enabled by connecting it to a clock, ground, and/or power node. Alternatively, the identified redundant blocks may be enabled by configuring the connection box in an arrangement that connects the identified redundant blocks. Additionally, in other embodiments, the identified redundant blocks may be enabled by receiving a run address at test program run time.

图21为说明符合所公开实施例的地址指派处理程序2100的示例性流程图。可针对存储器芯片1300且特别地针对DRAM存储器芯片实施地址指派处理程序2100。如关于图16所描述，在一些实施例中，存储器芯片1300中的逻辑区块可连接至数据总线且具有地址标识。处理程序2100描述地址指派方法，该地址指派方法停用故障逻辑区块且启用通过测试的逻辑区块。处理程序2100中所描述的步骤将描述为由生产测试器或在芯片起动时所执行的BIST执行；然而，存储器芯片1300的其他组件和/或外部设备也可执行处理程序2100的一个或多个步骤。21 is an exemplary flow diagram illustrating an address assignment handler 2100 consistent with the disclosed embodiments. Address assignment handler 2100 may be implemented for memory chips 1300 and particularly for DRAM memory chips. As described with respect to FIG. 16, in some embodiments, logic blocks in memory chip 1300 may be connected to a data bus and have address identifications. Process 2100 describes an address assignment method that disables failed logical blocks and enables logical blocks that pass the test. The steps described in handler 2100 will be described as being performed by a production tester or a BIST executed at chip startup; however, other components of memory chip 1300 and/or external devices may also execute one or more of handler 2100 step.

在步骤2102中，测试器可通过在芯片层级将不合法标识指派给每个逻辑区块来停用所有逻辑区块及冗余区块。In step 2102, the tester may disable all logical blocks and redundant blocks by assigning an invalid identification to each logical block at the chip level.

在步骤2104中，测试器可执行逻辑区块的测试协议。例如，测试器可针对存储器芯片1300中的逻辑区块中的一个或多个执行步骤2002中所描述的测试方法。In step 2104, the tester may execute a test protocol for the logic block. For example, the tester may perform the test method described in step 2002 for one or more of the logic blocks in the memory chip 1300 .

在步骤2106中，取决于步骤2104中的测试的结果，测试器可判定逻辑区块是否有缺陷。若逻辑区块无缺陷(步骤2106：否)，则地址管理器可在步骤2108中将运行ID指派给经测试逻辑区块。若逻辑区块有缺陷(步骤2106：是)，则地址管理器1302可在步骤2110中为有缺陷逻辑区块保留不合法ID。In step 2106, depending on the results of the test in step 2104, the tester may determine whether the logic block is defective. If the logical block is free of defects (step 2106: NO), the address manager may, in step 2108, assign a run ID to the tested logical block. If the logical block is defective (step 2106: Yes), the address manager 1302 may reserve an invalid ID for the defective logical block in step 2110.

在步骤2112中，地址管理器1302可选择复制有缺陷逻辑区块的冗余逻辑区块。在一些实施例中，复制有缺陷逻辑区块的冗余逻辑区块可具有与有缺陷逻辑区块相同的组件及连接。然而，在其他实施例中，冗余逻辑区块可具有不同于有缺陷逻辑区块的组件和/或连接，但能够执行等效操作。例如，若有缺陷逻辑区块经设计以执行向量的乘法，则选定冗余逻辑区块将能够执行向量的乘法，即使选定冗余逻辑区块不具有与有缺陷单元相同的架构亦如此。In step 2112, the address manager 1302 may select a redundant logical block that replicates the defective logical block. In some embodiments, a redundant logical block replicating a defective logical block may have the same components and connections as the defective logical block. However, in other embodiments, redundant logical blocks may have different components and/or connections than defective logical blocks, yet capable of performing equivalent operations. For example, if the defective logical block is designed to perform multiplication of vectors, the selected redundant logical block will be able to perform the multiplication of vectors even if the selected redundant logical block does not have the same architecture as the defective cell .

在步骤2114中，地址管理器1302可测试冗余区块。例如，测试器可将步骤2104中应用的测试技术应用于经识别冗余区块。In step 2114, the address manager 1302 may test for redundant blocks. For example, the tester may apply the testing techniques applied in step 2104 to the identified redundant blocks.

在步骤2116中，基于步骤2114中的测试的结果，测试器可判定冗余区块是否有缺陷。在步骤2118中，若冗余区块无缺陷(步骤2116：否)，则测试器可将运行ID指派给经识别冗余区块。在一些实施例中，处理程序2100可在步骤2118之后返回至步骤2104，以产生测试存储器芯片中的所有逻辑区块的反复循环。In step 2116, based on the results of the test in step 2114, the tester may determine whether the redundant block is defective. In step 2118, if the redundant block is free of defects (step 2116: NO), the tester may assign a run ID to the identified redundant block. In some embodiments, process 2100 may return to step 2104 after step 2118 to generate an iterative loop that tests all logic blocks in the memory chip.

若测试器判定冗余区块有缺陷(步骤2116：是)，则在步骤2120中，测试器可判定额外冗余区块是否可用。例如，测试器可向存储器组查询关于可用冗余逻辑区块的信息。若冗余逻辑区块可用(步骤2120：是)，则测试器可返回至步骤2112且识别再制有缺陷逻辑区块的新的冗余逻辑区块。若冗余逻辑区块不可用(步骤2120：否)，则在步骤2122中，测试器可产生错误信号。该错误信号可包括有缺陷逻辑区块及有缺陷冗余区块的信息。If the tester determines that the redundant block is defective (step 2116: Yes), then in step 2120, the tester may determine whether additional redundant blocks are available. For example, the tester may query the memory bank for information about available redundant logical blocks. If the redundant logical block is available (step 2120: Yes), the tester may return to step 2112 and identify a new redundant logical block that reproduces the defective logical block. If the redundant logic block is not available (step 2120: NO), then in step 2122, the tester may generate an error signal. The error signal may include information of defective logical blocks and defective redundant blocks.

耦接的存储器组coupled memory bank

本公开所公开的实施例还包括分布式高效能处理器。该处理器可包括介接存储器组及处理单元的存储器控制器。该处理器可能可配置以加快将数据递送至处理单元以用于计算。例如，若处理单元需要两个数据例项以执行任务，则存储器控制器可被配置为使得通信线独立地提供对来自两个数据例项的信息的存取。所公开的存储器架构试图将与复杂高速缓存及复杂寄存器文件方案相关联的硬件要求降至最低。通常，处理器芯片包括允许核心直接与寄存器一起工作的高速缓存阶层。然而，高速缓存操作需要相当大的晶粒面积且消耗额外功率。所公开的存储器架构通过在存储器中添加逻辑组件来避免使用高速缓存阶层。Embodiments disclosed in this disclosure also include distributed high performance processors. The processor may include a memory controller that interfaces the memory banks and the processing unit. The processor may be configurable to expedite the delivery of data to the processing unit for computation. For example, if the processing unit requires two data instances to perform a task, the memory controller may be configured such that the communication lines independently provide access to information from the two data instances. The disclosed memory architecture attempts to minimize the hardware requirements associated with complex cache and complex register file schemes. Typically, a processor chip includes a cache hierarchy that allows the core to work directly with registers. However, cache operations require considerable die area and consume additional power. The disclosed memory architecture avoids the use of cache hierarchies by adding logical components in memory.

所公开架构还实现数据在存储器组中的策略性(或甚至优化)置放。即使存储器组具有单一端口及高延时，所公开的存储器架构也可通过将数据策略性地定位于存储器组的不同区块中来实现高效能及避免存储器存取瓶颈。以将数据的连续串流提供至处理单元为目标，编译优化步骤可针对特定或一般任务判定数据应如何储存于存储器组中。接着，介接处理单元及存储器组的存储器控制器可被配置为在特定处理单元需要数据以执行操作时向该特定处理单元授权存取。The disclosed architecture also enables strategic (or even optimal) placement of data in memory banks. Even if the memory bank has a single port and high latency, the disclosed memory architecture can achieve high performance and avoid memory access bottlenecks by strategically locating data in different blocks of the memory bank. With the goal of providing a continuous stream of data to the processing unit, the compile optimization step can determine how the data should be stored in the memory bank for a specific or general task. Next, the memory controller that interfaces the processing unit and the memory bank can be configured to grant access to a particular processing unit when the particular processing unit needs data to perform an operation.

存储器芯片的配置可由处理单元(例如，配置管理者)或外部接口执行。该配置也可由编译程序或其他SW工具写入。此外，存储器控制器的配置可基于存储器组中的可用端口及存储器组中的数据的组织。因此，所公开架构可向处理单元提供来自不同存储器区块的恒定数据流或同时信息。以此方式，存储器内的计算任务可通过避免延时瓶颈或高速缓存要求来快速地处理。Configuration of the memory chips may be performed by a processing unit (eg, a configuration manager) or an external interface. This configuration can also be written by a compiler or other SW tool. Furthermore, the configuration of the memory controller may be based on the available ports in the memory bank and the organization of data in the memory bank. Thus, the disclosed architecture can provide a constant stream of data or simultaneous information from different memory blocks to the processing unit. In this way, in-memory computing tasks can be processed quickly by avoiding latency bottlenecks or cache requirements.

此外，储存于存储器芯片中的数据可基于编译优化步骤进行布置。编译可允许建置处理程序，其中处理器将任务高效地指派给处理单元而无存储器延时相关联的延迟。该编译可由编译程序执行且被传输至连接至基板中的外部接口的主机。通常，某些存取图案的高延时和/或少量端口将导致需要数据的处理单元的数据瓶颈。然而，所公开编译可按使得处理单元能够甚至在不利存储器类型的情况下仍连续地接收数据的方式将数据定位于存储器组中。Furthermore, the data stored in the memory chips can be arranged based on the compilation optimization steps. Compilation may allow building handlers in which processors efficiently assign tasks to processing units without the delays associated with memory delays. The compilation may be performed by a compiler and transmitted to a host computer connected to an external interface in the baseboard. Often, high latency and/or a small number of ports for certain access patterns will result in a data bottleneck for the processing units that require the data. However, the disclosed compilation may locate data in a memory bank in a manner that enables a processing unit to continuously receive data even with unfavorable memory types.

此外，在一些实施例中，配置管理器可基于任务所需的计算向所需处理单元发信。芯片中的不同处理单元或逻辑区块可具有针对不同任务的专门硬件或架构。因此，取决于将执行的任务，可选择处理单元或处理单元群组来执行任务。基板上的存储器控制器可能可配置以根据处理子单元的选择来投送数据或授权存取，以改良数据传输速度。例如，基于编译优化及存储器架构，当需要处理单元以执行任务时，可授权该处理单元对存储器组的存取。Additionally, in some embodiments, the configuration manager may signal the required processing units based on the computations required by the task. Different processing units or logic blocks in a chip may have specialized hardware or architectures for different tasks. Thus, depending on the task to be performed, a processing unit or group of processing units may be selected to perform the task. The memory controller on the substrate may be configurable to route data or authorize access according to the selection of processing subunits to improve data transfer speed. For example, based on compilation optimizations and memory architecture, when a processing unit is required to perform a task, access to a memory bank may be granted to the processing unit.

此外，芯片架构可包括片上组件，该片上组件通过缩减存取存储器组中的数据所需的时间来促进数据的传送。因此，本公开描述用于能够使用简单的存储器实例执行特定或一般任务的高效能处理器的芯片架构连同编译优化步骤。存储器实例可具有高的随机存取延时和/或少量端口，诸如DRAM设备或其他存储器定向技术中所使用的这些存储器实例，但所公开架构可通过实现从存储器组至处理单元的连续(或几乎连续)数据流来克服这些缺点。Additionally, the chip architecture may include on-chip components that facilitate the transfer of data by reducing the time required to access data in a memory bank. Accordingly, the present disclosure describes a chip architecture along with compilation optimization steps for a high performance processor capable of performing specific or general tasks using a simple memory instance. Memory instances may have high random access latencies and/or a small number of ports, such as those used in DRAM devices or other memory-oriented technologies, but the disclosed architecture may be implemented by enabling sequential (or almost continuous) data flow to overcome these shortcomings.

在本申请案中，同时通信可指一时钟循环内的通信。替代地，同时通信可指在预定时间量内发送信息。例如，同时通信可指在几奈秒内的通信。In this application, simultaneous communication may refer to communication within one clock cycle. Alternatively, simultaneous communication may refer to sending information within a predetermined amount of time. For example, simultaneous communication may refer to communication within a few nanoseconds.

图22提供符合所公开实施例的示例性处理设备的功能方块图。图22A展示处理设备2200的第一实施例，其中存储器控制器2210使用多任务器连接第一存储器区块2202及第二存储器区块2204。存储器控制器2210也可连接至少一配置管理器2212、一逻辑区块2214及多个加速器2216(a)至2216(n)。图22B展示处理设备2200的第二实施例，其中存储器控制器2210使用总线连接存储器区块2202及2204，该总线连接存储器控制器2210与至少一配置管理器2212、一逻辑区块2214及多个加速器2216(a)至2216(n)。此外，主机2230可在处理设备2200外部且经由例如外部接口连接至处理设备。22 provides a functional block diagram of an exemplary processing device consistent with the disclosed embodiments. 22A shows a first embodiment of a processing apparatus 2200 in which a memory controller 2210 connects a first memory bank 2202 and a second memory bank 2204 using a multiplexer. The memory controller 2210 may also be connected to at least a configuration manager 2212, a logic block 2214, and a plurality of accelerators 2216(a)-2216(n). 22B shows a second embodiment of processing apparatus 2200 in which memory controller 2210 connects memory blocks 2202 and 2204 using a bus that connects memory controller 2210 with at least one configuration manager 2212, a logical block 2214, and a plurality of Accelerators 2216(a) through 2216(n). Furthermore, the host 2230 may be external to the processing device 2200 and connected to the processing device via, for example, an external interface.

存储器区块2202及2204可包括DRAM垫或垫群组、DRAM组、MRAM\PRAM\RERAM\SRAM单元、快闪存储器垫或其他存储器技术。存储器区块2202及2204可替代地包括非易失性存储器、快闪存储器设备、电阻式随机存取存储器(ReRAM)设备或磁阻式随机存取存储器(MRAM)设备。Memory blocks 2202 and 2204 may include DRAM pads or groups of pads, DRAM banks, MRAM\PRAM\RERAM\SRAM cells, flash memory pads, or other memory technologies. Memory blocks 2202 and 2204 may alternatively include non-volatile memory, flash memory devices, resistive random access memory (ReRAM) devices, or magnetoresistive random access memory (MRAM) devices.

存储器区块2202及2204可另外包括多个存储器胞元，所述多个存储器胞元在多条字线(未示出)与多条比特线(未示出)之间布置成行和列。每一行存储器胞元的栅极可连接至多条字线中的各别者。每一列存储器胞元可连接至多条比特线中的各别者。Memory blocks 2202 and 2204 may additionally include a plurality of memory cells arranged in rows and columns between a plurality of word lines (not shown) and a plurality of bit lines (not shown). The gates of each row of memory cells may be connected to respective ones of a plurality of word lines. Each column of memory cells may be connected to a respective one of a plurality of bit lines.

在其他实施例中，存储器区域(包括存储器区块2202及2204)由简单的存储器实例建置。在本申请案中，术语「存储器实例」可与术语「存储器区块」互换使用。存储器实例(或区块)可具有不良特性。例如，存储器可为仅单端口存储器且可具有高随机存取延时。替代地或另外，存储器在列及线改变期间可能无法存取且面临与例如电容充电和/或电路系统设置相关的数据存取问题。然而，通过允许存储器实例与处理单元之间的专用连接及以考虑区块的特性的某一方式来布置数据，图22中所呈现的架构仍促进存储器设备中的平行处理。In other embodiments, memory regions, including memory blocks 2202 and 2204, are constructed from simple memory instances. In this application, the term "memory instance" is used interchangeably with the term "memory block." A memory instance (or block) can have undesirable characteristics. For example, the memory may be only single port memory and may have high random access latency. Alternatively or additionally, memory may be inaccessible during column and line changes and face data access issues related to, for example, capacitor charging and/or circuitry setup. However, the architecture presented in Figure 22 still facilitates parallel processing in memory devices by allowing dedicated connections between memory instances and processing units and arranging data in a manner that takes into account the characteristics of the blocks.

在一些设备架构中，存储器实例可包括若干端口，以促进平行操作。然而，在这些实施例中，当数据基于芯片架构来编译及组织时，芯片仍可达成改良效能。例如，编译程序可通过提供指令及组织数据置放来改良存储器区域中的存取的效率，因此即使使用单端口存储器，仍能够容易存取存储器区域。In some device architectures, a memory instance may include several ports to facilitate parallel operation. However, in these embodiments, the chip can still achieve improved performance when the data is compiled and organized based on the chip architecture. For example, compilers can improve the efficiency of accesses in memory regions by providing instructions and organizing data placement so that memory regions can be easily accessed even when single-ported memory is used.

此外，存储器区块2202及2204可为单一芯片中的存储器的多个类型。例如，存储器区块2202及2204可为eFlash及eDRAM。另外，存储器区块可包括具有ROM实例的DRAM。Furthermore, memory blocks 2202 and 2204 may be multiple types of memory in a single chip. For example, memory blocks 2202 and 2204 may be eFlash and eDRAM. Additionally, a memory block may include DRAM with an instance of ROM.

存储器控制器2210可包括用以处置存储器存取及将结果传回至模块的其余部分的逻辑电路。例如，存储器控制器2210可包括地址管理器及诸如多任务器的选择设备，以在存储器区块与处理单元之间投送数据或授权对存储器区块的存取。替代地，存储器控制器2210可包括用以驱动DDR SDRAM的双数据速率(DDR)存储器控制器，其中数据在系统的存储器时钟的上升边缘及下降边缘上传送。Memory controller 2210 may include logic to handle memory accesses and pass results back to the rest of the module. For example, memory controller 2210 may include an address manager and selection devices, such as multiplexers, to route data between memory blocks and processing units or to authorize access to memory blocks. Alternatively, memory controller 2210 may include a double data rate (DDR) memory controller to drive DDR SDRAM, where data is transferred on the rising and falling edges of the system's memory clock.

此外，存储器控制器2210可构成双通道存储器控制器。双通道存储器的并入可促进存储器控制器2210对平行存取线的控制。该平行存取线可被配置为具有相同长度，以在结合使用多个线时促进数据同步。替代地或另外，该平行存取线可允许存取存储器组的多个存储器端口。Also, the memory controller 2210 may constitute a dual-channel memory controller. The incorporation of dual channel memory may facilitate control of parallel access lines by the memory controller 2210. The parallel access lines can be configured to have the same length to facilitate data synchronization when multiple lines are used in combination. Alternatively or additionally, the parallel access lines may allow access to multiple memory ports of the memory bank.

在一些实施例中，处理设备2200可包括可连接至处理单元的一个或多个多任务器。该处理单元可包括可直接至多任务器的配置管理器2212、逻辑区块2214及加速器2216。另外，存储器控制器2210可包括来自多个存储器组或区块2202及2204的至少一个数据输入端，及连接至多个处理单元中的每个的至少一个数据输出端。通过此配置，存储器控制器2210可经由两个数据输入端同时从存储器组或存储器区块2202及2204接收数据，且经由两个数据输出端同时将经由接收的数据传输至至少一个选定处理单元。然而，在一些实施例中，至少一个数据输入端及至少一个数据输出端可实施于单一端口中，以允许仅读取或写入操作。在这些实施例中，单一端口可实施为包括数据线、地址线及命令线的数据总线。In some embodiments, the processing device 2200 may include one or more multiplexers connectable to the processing unit. The processing unit may include a configuration manager 2212, a logic block 2214, and an accelerator 2216, which may be directed to the multiplexer. Additionally, the memory controller 2210 can include at least one data input from the plurality of memory banks or blocks 2202 and 2204, and at least one data output connected to each of the plurality of processing units. With this configuration, the memory controller 2210 can simultaneously receive data from the memory banks or memory blocks 2202 and 2204 via the two data inputs and simultaneously transmit the received data to at least one selected processing unit via the two data outputs . However, in some embodiments, at least one data input and at least one data output may be implemented in a single port to allow read or write only operations. In these embodiments, a single port may be implemented as a data bus including data lines, address lines, and command lines.

存储器控制器2210可连接至多个存储器区块2202及2204中的每个，且也可经由例如选择开关连接至处理单元。基板上的处理单元(包括配置管理器2212、逻辑区块2214及加速器2216)也可独立地连接至存储器控制器2210。在一些实施例中，配置管理器2212可接收要执行的任务的指示，且作为响应，根据储存于存储器中或从外部供应的配置而配置存储器控制器2210、加速器2216和/或逻辑区块2214。替代地，存储器控制器2210可由外部接口配置。该任务可能需要可用以从多个处理单元选择至少一个选定处理单元的至少一次计算。替代地或另外，该选择可至少部分地基于选定处理单元执行至少一次计算的能力。作为响应，存储器控制器2210可授权对存储器组的存取，或使用专用总线和/或以管线式存储器存取在至少一个选定处理单元与至少两个存储器组之间投送数据。A memory controller 2210 may be connected to each of the plurality of memory banks 2202 and 2204, and may also be connected to a processing unit via, for example, a selection switch. The processing units on the substrate, including the configuration manager 2212, the logic blocks 2214, and the accelerators 2216, may also be independently connected to the memory controller 2210. In some embodiments, configuration manager 2212 may receive indications of tasks to perform and, in response, configure memory controller 2210, accelerator 2216, and/or logic block 2214 according to a configuration stored in memory or supplied externally . Alternatively, the memory controller 2210 may be configured by an external interface. The task may require at least one computation available to select at least one selected processing unit from the plurality of processing units. Alternatively or additionally, the selection may be based, at least in part, on the ability of the selected processing unit to perform at least one computation. In response, the memory controller 2210 may grant access to the memory banks or route data between the at least one selected processing unit and the at least two memory banks using a dedicated bus and/or with pipelined memory accesses.

在一些实施例中，至少两个存储器区块中的第一存储器区块2202可布置在多个处理单元的第一侧上；且至少两个存储器组中的第二存储器组2204可布置在所述多个处理单元的与该第一侧相对的第二侧上。另外，用以执行任务的选定处理单元(例如，加速器2216(n))可被配置为在至第一存储器组或第一存储器区块2202的通信线开放的时钟循环期间存取第二存储器组2204。替代地，该选定处理单元可被配置为在至第一存储器区块2202的通信线开放的时钟循环期间将数据传送至第二存储器区块2204。In some embodiments, a first memory bank 2202 of the at least two memory banks can be arranged on a first side of the plurality of processing units; and a second memory bank 2204 of the at least two memory banks can be arranged on all on a second side of the plurality of processing units opposite the first side. Additionally, selected processing units (eg, accelerators 2216(n)) to perform tasks may be configured to access the second memory during clock cycles when the communication lines to the first memory bank or first memory bank 2202 are open Group 2204. Alternatively, the selected processing unit may be configured to transfer data to the second memory bank 2204 during a clock cycle in which the communication line to the first memory bank 2202 is open.

在一些实施例中，存储器控制器2210可实施为独立元件，如图22中所展示。然而，在其他实施例中，存储器控制器2210可嵌入于存储器区域中或可沿着加速器2216(a)至2216(n)安置。In some embodiments, the memory controller 2210 may be implemented as a separate element, as shown in FIG. 22 . However, in other embodiments, the memory controller 2210 may be embedded in a memory region or may be positioned along the accelerators 2216(a)-2216(n).

处理设备2200中的处理区域可包括配置管理器2212、逻辑区块2214及加速器2216(a)至2216(n)。加速器2216可包括具有预定义功能的多个处理电路且可由特定应用程序定义。例如，加速器可为处置模块之间的存储器移动的向量乘法累加(MAC)单元或直接存储器存取(DMA)单元。加速器2216也可能够计算其自身地址且向存储器控制器2210请求数据或将数据写入至存储器控制器。例如，配置管理器2212可向加速器2216中的至少一个发信该加速器可存取存储器组。接着，加速器2216可配置存储器控制器2210以投送数据或向加速器本身授权存取。此外，加速器2216可包括至少一个算术逻辑单元、至少一个向量处置逻辑单元、至少一个字符串比较逻辑单元、至少一个寄存器及至少一个直接存储器存取件。The processing area in processing device 2200 may include configuration manager 2212, logic block 2214, and accelerators 2216(a)-2216(n). Accelerator 2216 may include a number of processing circuits with predefined functions and may be defined by a particular application. For example, the accelerator may be a vector multiply-accumulate (MAC) unit or a direct memory access (DMA) unit that handles memory movement between modules. The accelerator 2216 may also be able to calculate its own address and request data from or write data to the memory controller 2210. For example, configuration manager 2212 can signal at least one of accelerators 2216 the accelerator-accessible memory bank. Next, the accelerator 2216 may configure the memory controller 2210 to deliver data or grant access to the accelerator itself. In addition, accelerator 2216 may include at least one arithmetic logic unit, at least one vector processing logic unit, at least one string comparison logic unit, at least one register, and at least one direct memory access element.

配置管理器2212可包括用以配置加速器2216及指示任务的执行的数字处理电路。例如，配置管理器2212可连接至存储器控制器2210及多个加速器2216中的每个。配置管理器2212可具有其自身的专用存储器以保存加速器2216的配置。配置管理器2212可使用存储器组以经由存储器控制器2210取得命令及配置。替代地，配置管理器2212可经由外部接口来编程。在某些实施例中，配置管理器2212可用具有自身的高速缓存阶层的片上精简指令集计算机(RISC)或片上复杂CPU来实施。在一些实施例中，也可省略配置管理器2212，且加速器可经由外部接口来配置。Configuration manager 2212 may include digital processing circuitry to configure accelerators 2216 and to direct execution of tasks. For example, the configuration manager 2212 may be connected to the memory controller 2210 and each of the plurality of accelerators 2216. The configuration manager 2212 may have its own dedicated memory to hold the configuration of the accelerator 2216. Configuration manager 2212 may use memory banks to obtain commands and configurations via memory controller 2210. Alternatively, configuration manager 2212 may be programmed via an external interface. In some embodiments, configuration manager 2212 may be implemented with an on-chip reduced instruction set computer (RISC) or an on-chip complex CPU with its own cache hierarchy. In some embodiments, configuration manager 2212 may also be omitted, and the accelerator may be configured via an external interface.

处理设备2200也可包括外部接口(未示出)。该外部接口允许从上部层级(此存储器组控制器，其接收来自外部主机2230或片上主处理器的命令)对存储器进行存取，或从外部主机2230或片上主处理器对存储器进行存取。该外部接口可通过经由存储器控制器2210将配置或代码写入至存储器以供配置管理器2212或单元2214及2216本身稍后使用来允许对配置管理器2212及加速器2216进行编程。然而，该外部接口也可直接编程处理单元而不经由存储器控制器2210进行路由。在配置管理器2212为微控制器的状况下，配置管理器2212可允许经由外部接口将代码从主存储器加载至控制器区域存储器。存储器控制器2210可被配置为响应于接收到来自外部接口的请求而中断任务。Processing device 2200 may also include an external interface (not shown). The external interface allows access to the memory from the upper level (the memory bank controller, which receives commands from the external host 2230 or the on-chip main processor), or from the external host 2230 or the on-chip main processor. This external interface may allow programming of configuration manager 2212 and accelerator 2216 by writing configuration or code to memory via memory controller 2210 for later use by configuration manager 2212 or units 2214 and 2216 themselves. However, the external interface can also program the processing unit directly without routing through the memory controller 2210. Where configuration manager 2212 is a microcontroller, configuration manager 2212 may allow code to be loaded from main memory to controller area memory via an external interface. The memory controller 2210 may be configured to interrupt a task in response to receiving a request from an external interface.

该外部接口可包括与逻辑电路相关联的多个连接器，该连接器提供至处理设备上的多种元件的无胶合接口。该外部接口可包括：用于数据读取的数据I/O输入端及用于数据写入的输出端；外部地址输出端；外部CE0芯片选择接脚；低有效芯片选择器；字节启用接脚；用于存储器循环的等待状态的接脚；写入启用接脚；输出启用有效接脚；及读取写入启用接脚。因此，该外部接口具有所需输入端及输出端以控制处理程序且从处理设备获得信息。例如，该外部接口可符合JEDEC DDR标准。替代地或另外，外部接口可符合其他标准，诸如SPI\OSPI或UART。The external interface may include a plurality of connectors associated with the logic circuit that provide a glue-free interface to various elements on the processing device. The external interface may include: a data I/O input terminal for data reading and an output terminal for data writing; an external address output terminal; an external CE0 chip selection pin; a low-active chip selector; a byte enable connection pin; wait state pin for memory cycling; write enable pin; output enable valid pin; and read write enable pin. Therefore, the external interface has the necessary inputs and outputs to control the processing program and obtain information from the processing device. For example, the external interface may conform to the JEDEC DDR standard. Alternatively or additionally, the external interface may conform to other standards, such as SPI\OSPI or UART.

在一些实施例中，该外部接口可安置于芯片基板上且可连接外部主机2230。外部主机可经由外部接口存取存储器区块2202及2204、存储器控制器2210以及处理单元。替代地或另外，外部主机2230可对存储器进行读取及写入，或可经由读取及写入命令向配置管理器2212发信以执行操作，诸如开始处理程序和/或停止处理程序。此外，外部主机2230可直接配置加速器2216。在一些实施例中，外部主机2230能够直接对存储器区块2202及2204执行读取/写入操作。In some embodiments, the external interface can be disposed on the chip substrate and can be connected to the external host 2230 . An external host can access the memory blocks 2202 and 2204, the memory controller 2210, and the processing unit via the external interface. Alternatively or additionally, external host 2230 may read and write memory, or may signal configuration manager 2212 via read and write commands to perform operations, such as starting a handler and/or stopping a handler. Additionally, the external host 2230 may configure the accelerator 2216 directly. In some embodiments, the external host 2230 can directly perform read/write operations on the memory banks 2202 and 2204.

在一些实施例中，配置管理器2212及加速器2216可被配置为取决于目标任务而使用直接总线来连接设备区域与存储器区域。例如，当加速器2216的子集能够执行任务执行所需的计算时，加速器的该子集可与存储器实例2204连接。通过进行此分开，有可能确保专用加速器获得存储器区块2202及2204所需的带宽(BW)。此外，具有专用总线的此配置可允许将大存储器分裂成较小实例或区块，这是因为将存储器实例连接至存储器控制器2210允许甚至在具有高行延时时间的情况下也可快速存取不同存储器中的数据。为达成连接的平行化，存储器控制器2210可用数据总线、地址总线和/或控制总线连接至存储器实例中的每个。In some embodiments, configuration manager 2212 and accelerator 2216 may be configured to use a direct bus to connect device regions and memory regions depending on the target task. For example, a subset of accelerators 2216 may be connected to memory instance 2204 when the subset of accelerators is capable of performing computations required for task execution. By doing this separation, it is possible to ensure that dedicated accelerators get the bandwidth (BW) required by memory blocks 2202 and 2204. In addition, this configuration with a dedicated bus can allow large memory to be split into smaller instances or blocks, since connecting a memory instance to the memory controller 2210 allows fast memory even with high row latency Fetch data from different memories. To achieve parallelization of connections, memory controller 2210 may connect to each of the memory instances with a data bus, address bus, and/or control bus.

存储器控制器2210的上述包括可消除对处理设备中的高速缓存阶层或复杂寄存器文件的要求。尽管可添加高速缓存阶层以得到添加的能力，但处理设备处理设备2200中的架构可允许设计者基于处理操作而添加足够存储器区块或实例且在无高速缓存阶层的情况下相应地管理该实例。例如，处理设备处理设备2200中的架构可通过实施管线式存储器存取来消除对高速缓存阶层的需求。在管线式存储器存取中，处理单元可在某些数据线可开放(或启动)而其他数据线接收或传输数据的每个循环中接收持续数据流。使用独立通信线的持续数据流可能由于线改变而实现改良的执行速度及最小延时。The above-described inclusion of the memory controller 2210 may eliminate the requirement for a cache hierarchy or complex register file in the processing device. Although cache hierarchies may be added for added capabilities, the architecture in processing device processing device 2200 may allow designers to add enough memory blocks or instances based on processing operations and manage the instances accordingly without a cache hierarchy . For example, the architecture in processing device processing device 2200 may eliminate the need for a cache hierarchy by implementing pipelined memory accesses. In pipelined memory access, a processing unit may receive a continuous stream of data in each cycle where certain data lines may be opened (or enabled) while other data lines receive or transmit data. Continuous data flow using separate communication lines may allow for improved execution speed and minimal latency due to line changes.

此外，图22中的所公开架构实现管线式存储器存取，有可能将数据组织在少量存储器区块中且节省由线切换造成的功率损耗。例如，在一些实施例中，编译程序可向主机2230传达数据在存储器组中的组织或用以将数据组织在存储器组中的方法，以促进在给定任务期间存取数据。接着，配置管理器2212可定义哪些存储器组且在一些状况下存储器组的哪些端口可由加速器存取。存储器组中的数据的位置与数据访问方法之间的此同步通过以最小延时将数据馈入至加速器来改良计算任务。例如，在配置管理器2212包括RISC\CPU的实施例中，该方法可用脱机软件(SW)来实施，且接着配置管理器2212可经编程以执行该方法。该方法可用可由RISC/CPU计算机执行的任何语言来开发且可在任何平台上执行。该方法的输入可包括存储器控制器后方的存储器的配置及数据本身，连同存储器存取的图案。此外，该方法可用特定于实施例的语言或机器语言来实施，且也可仅为以二进制或文字表示的一系列配置值。Furthermore, the disclosed architecture in Figure 22 enables pipelined memory access, potentially organizing data in a small number of memory blocks and saving power consumption caused by line switching. For example, in some embodiments, the compiler may communicate to the host 2230 the organization of data in memory banks or a method to organize data in memory banks to facilitate accessing data during a given task. Next, the configuration manager 2212 can define which memory banks and in some cases which ports of the memory banks are accessible by the accelerator. This synchronization between the location of data in the memory bank and the method of data access improves computational tasks by feeding the data to the accelerator with minimal latency. For example, in embodiments where configuration manager 2212 includes a RISC\CPU, the method may be implemented with offline software (SW), and configuration manager 2212 may then be programmed to perform the method. The method can be developed in any language executable by a RISC/CPU computer and can be executed on any platform. Inputs to the method may include the configuration of the memory behind the memory controller and the data itself, as well as the pattern of memory accesses. Furthermore, the method may be implemented in an embodiment-specific language or machine language, and may also simply be a series of configuration values in binary or literal representation.

如上文所论述，在一些实施例中，编译程序可将指令提供至主机2230以用于在准备管线式存储器存取时将数据组织在存储器区块2202及2204中。该管线式存储器存取通常可包括以下步骤：接收多个存储器组或存储器区块2202及2204的多个地址；根据所接收的地址使用独立数据线存取所述多个存储器组；经由第一通信线将来自第一地址的数据供应至多个处理单元中的至少一个且开放至第二地址的第二通信线，该第一地址在所述多个存储器组中的第一存储器组中，该第二地址在所述多个存储器组中的第二存储器组2204中；及在第二时钟循环内，经由该第二通信线将来自该第二地址的数据供应至所述多个处理单元中的该至少一者且开放至第一线中的第一存储器组中的第三地址的第三通信线。在一些实施例中，该管线式存储器存取可在两个存储器区块连接至单一端口的情况下执行。在这些实施例中，存储器控制器2210可将两个存储器区块隐藏在单一端口后，但利用管线式存储器访问方法将数据传输至处理单元。As discussed above, in some embodiments, the compiler may provide instructions to host 2230 for organizing data in memory banks 2202 and 2204 in preparation for pipelined memory access. The pipelined memory access may generally include the steps of: receiving a plurality of addresses for a plurality of memory banks or memory banks 2202 and 2204; accessing the plurality of memory banks using independent data lines according to the received addresses; via a first A communication line supplies data from a first address to at least one of the plurality of processing units and is open to a second communication line at a second address in a first memory bank of the plurality of memory banks, the The second address is in a second memory bank 2204 of the plurality of memory banks; and within a second clock cycle, data from the second address is supplied into the plurality of processing units via the second communication line The at least one of the first lines is open to the third communication line of the third address in the first memory bank. In some embodiments, the pipelined memory access may be performed with two memory banks connected to a single port. In these embodiments, the memory controller 2210 may hide the two memory blocks behind a single port, but utilize a pipelined memory access method to transfer the data to the processing unit.

在一些实施例中，编译程序可在主机2230上执行，之后执行任务。在这些实施例中，编译程序可能够基于存储器设备的架构而判定数据流的配置，这是因为该配置将为编译程序已知的。In some embodiments, the compiler may execute on the host 2230, after which the task is performed. In these embodiments, the compiler may be able to determine the configuration of the data flow based on the architecture of the memory device, since this configuration will be known to the compiler.

在其他实施例中，若存储器区块2204及2202的配置在脱机时间系未知的，则管线式方法可在主机2230上执行，该主机可在开始计算之前将数据布置在存储器区块中。例如，主机2230可将数据直接写入存储器区块2204及2202中。在这些实施例中，诸如配置管理器2212及存储器控制器2210的处理单元在运行时间之前可能不会具有关于所需硬件的信息。接着，可能有必要延迟对加速器2216的选择，直至任务开始运行。在这些情形中，处理单元或存储器控制器2210可随机地选择加速器2216且产生测试数据存取图案，该存取图案可在执行任务时加以修改。In other embodiments, if the configuration of memory blocks 2204 and 2202 is unknown at offline time, the pipelined method may be performed on host 2230, which may arrange data in the memory blocks before starting computation. For example, host 2230 may write data directly into memory blocks 2204 and 2202. In these embodiments, processing units such as configuration manager 2212 and memory controller 2210 may not have information about the required hardware until runtime. Next, it may be necessary to delay the selection of accelerator 2216 until the task begins to run. In these cases, the processing unit or memory controller 2210 can randomly select the accelerator 2216 and generate a test data access pattern that can be modified as the task is performed.

然而，当任务预先已知时，编译程序可将数据及指令组织在存储器组中以供主机2230提供至诸如配置管理器2212的处理单元，以设定将存取延时减至最少的信号连接。例如，在一些状况下，加速器2216可能同时需要n个字。然而，每个存储器实例支持每次仅取回m个字，其中「m」及「n」为整数且m<n。因此，编译程序可跨越不同存储器实例或区块置放所需数据，以促进数据存取。另外，为了避免线错漏延时，在处理设备2200包括多个存储器存储器的情况下，主机可在不同存储器实例的不同线中分裂数据。数据的划分可允许存取下一实例中的下一数据线，同时仍使用来自当前实例的数据。However, when the tasks are known in advance, the compiler may organize data and instructions in memory banks for the host 2230 to provide to processing units such as the configuration manager 2212 to set signal connections that minimize access latency . For example, in some cases the accelerator 2216 may require n words simultaneously. However, each memory instance supports fetching only m words at a time, where "m" and "n" are integers and m<n. Thus, the compiler can place the required data across different memory instances or blocks to facilitate data access. Additionally, to avoid line error leakage delays, where processing device 2200 includes multiple memory memories, the host may split data in different lines of different memory instances. The partitioning of data may allow access to the next line of data in the next instance, while still using data from the current instance.

例如，加速器2216(a)可被配置为将两个向量相乘。向量中的每个可储存于诸如存储器区块2202及2204的独立存储器区块中，且每个向量可包括多个字。因此，为了完成需要加速器2216(a)进行乘法的任务，可能有必要存取两个存储器区块且取回多个字。然而，在一些实施例中，存储器区块仅允许每个时钟循环存取一个字。例如，存储器区块可具有单一端口。在这些状况下，为了在操作期间加快数据传输，编译程序可将构成向量的字组织于不同存储器区块中，以允许对字的平行和/或同时读取。在这些情形中，编译程序可将字储存于具有专用线的存储器区块中。例如，若每个向量包括两个字且存储器控制器能够直接存取四个存储器区块，则编译程序可将数据布置在四个存储器区块中，每个存储器区块传输一字且加快数据递送。此外，在实施例中，当存储器控制器2210可具有至每个存储器区块的多于单一连接时，编译程序可指示配置管理器2212(或其他处理单元)存取端口特定端口。以此方式，处理设备2200可执行管线式存储器存取，以通过同时在一些线中加载字及在其他线中传输数据来将数据连续地提供至处理单元。因此，此管线式存储器存取避免可避免延时问题。For example, accelerator 2216(a) may be configured to multiply two vectors. Each of the vectors may be stored in separate memory blocks, such as memory blocks 2202 and 2204, and each vector may include multiple words. Therefore, it may be necessary to access two memory banks and retrieve multiple words in order to complete a task that requires the accelerator 2216(a) to do the multiplication. However, in some embodiments, a memory bank only allows access to one word per clock cycle. For example, a memory block may have a single port. In these cases, in order to speed up data transfer during operation, the compiler may organize the words that make up the vector in different memory blocks to allow parallel and/or simultaneous reads of the words. In these cases, the compiler may store words in memory banks with dedicated lines. For example, if each vector includes two words and the memory controller has direct access to four memory banks, the compiler can arrange the data in four memory banks, transferring one word per memory bank and speeding up the data deliver. Furthermore, in an embodiment, when the memory controller 2210 may have more than a single connection to each memory bank, the compiler may instruct the configuration manager 2212 (or other processing unit) to access port-specific ports. In this manner, processing device 2200 may perform pipelined memory accesses to continuously provide data to processing units by simultaneously loading words in some lines and transferring data in other lines. Therefore, this pipelined memory access avoidance avoids latency issues.

图23为符合所公开实施例的示例性处理设备2300的功能方块图。该功能方块图展示简化的处理设备2300，其显示呈MAC单元2302形式的单一加速器、配置管理器2304(等效或类似于配置管理器2212)、存储器控制器2306(等效或类似于存储器控制器2210)及多个存储器区块2308(a)至2308(d)。23 is a functional block diagram of an exemplary processing device 2300 consistent with the disclosed embodiments. This functional block diagram shows a simplified processing device 2300 showing a single accelerator in the form of a MAC unit 2302, configuration manager 2304 (equivalent or similar to configuration manager 2212), memory controller 2306 (equivalent or similar to memory control 2210) and a plurality of memory blocks 2308(a)-2308(d).

在一些实施例中，MAC单元2302可为用于处理特定任务的特定加速器。作为实施例，处理设备2300可以2D卷积为任务。接着，配置管理器2304可向具有适当硬件的加速器发信以执行与任务相关联的计算。例如，MAC单元2302可具有四个内部递增计数器(用以管理卷积计算所需的四个回路的逻辑加法器及寄存器)及一乘法累加单元。配置管理器2304可向MAC单元2302发信以处理传入数据且执行任务。配置管理器2304可将指示传输至MAC单元2302以执行任务。在这些情形中，MAC单元2302可在所计算地址上进行反复，将数字相乘，且将其累加至内部寄存器。In some embodiments, MAC unit 2302 may be a specific accelerator for processing a specific task. As an example, the processing device 2300 may be tasked with 2D convolution. Next, configuration manager 2304 can signal accelerators with appropriate hardware to perform computations associated with the tasks. For example, MAC unit 2302 may have four internal increment counters (logical adders and registers to manage the four loops required for convolution calculations) and a multiply-accumulate unit. Configuration manager 2304 can signal MAC unit 2302 to process incoming data and perform tasks. Configuration manager 2304 may transmit instructions to MAC unit 2302 to perform the task. In these cases, the MAC unit 2302 may iterate over the calculated address, multiply the numbers, and accumulate them to internal registers.

在一些实施例中，配置管理器2304可配置加速器，而存储器控制器2306授权使用专用总线存取区块2308及MAC单元2302。然而，在其他实施例中，存储器控制器2306可基于从配置管理器2304或外部接口接收的指令而直接配置加速器。替代地或另外，配置管理器2304可预先加载几个配置且允许加速器反复地在具有不同大小的不同地址上运行。在这些实施例中，配置管理器2304可包括高速缓存，该高速缓存储存命令，之后该命令被传输至诸如加速器2216的多个处理单元中的至少一者。然而，在其他实施例中，配置管理器2304可能不包括高速缓存。In some embodiments, configuration manager 2304 may configure accelerators while memory controller 2306 authorizes access to block 2308 and MAC unit 2302 using a dedicated bus. However, in other embodiments, the memory controller 2306 may configure the accelerator directly based on instructions received from the configuration manager 2304 or an external interface. Alternatively or additionally, the configuration manager 2304 may preload several configurations and allow the accelerator to run iteratively at different addresses with different sizes. In these embodiments, configuration manager 2304 may include a cache that stores commands that are then transmitted to at least one of a plurality of processing units, such as accelerators 2216 . However, in other embodiments, configuration manager 2304 may not include a cache.

在一些实施例中，配置管理器2304或存储器控制器2306可接收为了任务需要存取的地址。配置管理器2304或存储器控制器2306可检查寄存器以判定地址是否已经在至存储器区块2308中的一个的加载的线中。若在加载的线中，则存储器控制器2306可从存储器区块2308读取字且将该字传递至MAC单元2302。若地址不在加载的线中，则配置管理器2304可请求存储器控制器2306可加载该线且向MAC单元2302发信以延迟，直至取回该加载的线。In some embodiments, the configuration manager 2304 or the memory controller 2306 may receive an address that needs to be accessed for a task. The configuration manager 2304 or the memory controller 2306 may check the registers to determine whether the address is already in the loaded line to one of the memory banks 2308. If in the loaded line, the memory controller 2306 may read the word from the memory block 2308 and pass the word to the MAC unit 2302. If the address is not in the loaded line, the configuration manager 2304 may request that the memory controller 2306 may load the line and signal the MAC unit 2302 to delay until the loaded line is retrieved.

在一些实施例中，如图23中所展示，存储器控制器2306可包括形成两个独立地址的两个输入。但若应同时存取多于两个地址，且这些地址在单一存储器区块中(例如，地址仅在存储器区块2308(a)中)，则存储器控制器2306或配置管理器2304可能会引发例外状况。替代地，当两个地址仅可经由单一线来存取时，配置管理器2304可传回无效数据信号。在其他实施例中，该单元可延迟处理程序执行，直至有可能取回所有需要的数据。此可降低总体效能。然而，编译程序可能够找到将防止延迟的配置及数据置放。In some embodiments, as shown in FIG. 23, the memory controller 2306 may include two inputs that form two independent addresses. However, if more than two addresses should be accessed at the same time, and these addresses are in a single memory bank (eg, addresses are only in memory bank 2308(a)), then memory controller 2306 or configuration manager 2304 may cause exceptions. Alternatively, configuration manager 2304 may return an invalid data signal when both addresses are only accessible via a single line. In other embodiments, the unit may delay handler execution until it is possible to retrieve all required data. This can reduce overall performance. However, the compiler may be able to find configuration and data placement that will prevent the delay.

在一些实施例中，编译程序可产生用于处理设备2300的配置或指令集，该配置或指令集可配置配置管理器2304及存储器控制器2306以及加速器2302以处置需要从单一存储器区块存取多个地址但该存储器区块具有一个端口的情形。例如，编译程序可将数据重新布置在存储器区块2308中，使得处理单元可存取存储器区块2308中的多个线。In some embodiments, the compiler may generate a configuration or set of instructions for processing device 2300 that may configure configuration manager 2304 and memory controller 2306 and accelerator 2302 to handle the need for access from a single memory block Multiple addresses but the memory block has one port. For example, the compiler may rearrange the data in the memory block 2308 so that multiple lines in the memory block 2308 can be accessed by the processing unit.

此外，存储器控制器2306也可在相同时间同时对多于一个输入进行工作。例如，存储器控制器2306可允许经由一个端口存取存储器区块2308中的一个及在于另一输入端中接收对不同存储器区块的请求时供应数据。因此，此操作可导致以示例性2D卷积为任务的加速器2216从相关存储器区块的专用通信线接收数据。In addition, the memory controller 2306 may also operate on more than one input at the same time. For example, the memory controller 2306 may allow access to one of the memory blocks 2308 via one port and supply data when a request for a different memory block is received in the other input. Thus, this operation may result in the accelerator 2216 tasked with the exemplary 2D convolution receiving data from the dedicated communication line of the associated memory bank.

另外或替代地，存储器控制器2306或逻辑区块可保持针对每个存储器区块2308的刷新计数器且处置所有线的刷新。具有此计数器允许存储器控制器2306插入设备的停滞访问时间之间的刷新循环中。Additionally or alternatively, the memory controller 2306 or logic block may maintain a refresh counter for each memory block 2308 and handle the refresh of all lines. Having this counter allows the memory controller 2306 to insert refresh cycles between the device's dead access times.

此外，存储器控制器2306可能可配置以执行管线式存储器存取，以接收地址且开放存储器区块中的线，之后供应数据。该管线式存储器存取可在不中断或不延迟时钟循环的情况下将数据提供至处理单元。例如，虽然存储器控制器2306或逻辑区块中的一个在图23中利用右方线存取数据，但存储器控制器或逻辑区块可正在左方线中传输数据。将关于图26更详细地解释这些方法。Additionally, the memory controller 2306 may be configurable to perform pipelined memory access to receive addresses and open lines in memory blocks before supplying data. This pipelined memory access can provide data to processing units without interrupting or delaying clock cycles. For example, although the memory controller 2306 or one of the logic blocks is accessing data using the right line in Figure 23, the memory controller or logic block may be transferring data in the left line. These methods will be explained in more detail with respect to FIG. 26 .

响应于所需数据，处理设备2300可使用多任务器和/或其他开关设备来选择服务哪些设备以执行给定任务。例如，配置管理器2304可配置多任务器，使得至少两个数据线到达MAC单元2302。以此方式，需要来自多个地址的数据的任务(诸如，2D卷积)可较快地执行，这是因为在卷积期间需要乘法的向量或字可在单一时钟中同时到达处理单元。此数据传送方法可允许诸如加速器2216的处理单元快速地输出结果。In response to the required data, the processing device 2300 may use a multiplexer and/or other switching device to select which devices to serve to perform a given task. For example, configuration manager 2304 may configure the multiplexer so that at least two data lines reach MAC unit 2302. In this way, tasks that require data from multiple addresses, such as 2D convolution, can be performed faster because vectors or words that require multiplication during convolution can arrive at the processing unit simultaneously in a single clock. This method of data transfer may allow a processing unit such as accelerator 2216 to output results quickly.

在一些实施例中，配置管理器2304可能可配置以基于任务的优先权执行处理程序。例如，配置管理器2304可被配置为使运行中处理程序无任何中断地完成。在此状况下，配置管理器2304可将任务的指令或配置提供至加速器2216，使该加速器不间断地运行，且仅在任务完成时切换多任务器。然而，在其他实施例中，配置管理器2304可在其接收到优先任务(诸如，来自外部接口的请求)时中断任务且重新配置数据路由。然而，在存储器区块2308足够的情况下，存储器控制器2306可能可配置以利用专用线将数据投送至处理单元或向处理单元授权存取，该专用线在任务完成之前不必改变。此外，在一些实施例中，所有设备可通过总线连接至配置管理器2304的实体，且设备可管理设备本身与总线之间的存取(例如，使用与多任务器相同的逻辑)。因此，存储器控制器2306可直接连接至数个存储器实例或存储器区块。In some embodiments, configuration manager 2304 may be configurable to execute handlers based on the priority of tasks. For example, the configuration manager 2304 may be configured to cause the running handler to complete without any interruption. In this case, configuration manager 2304 may provide instructions or configurations for the task to accelerator 2216, causing the accelerator to run uninterrupted and switching the multitasker only when the task is complete. However, in other embodiments, configuration manager 2304 may interrupt tasks and reconfigure data routing when it receives priority tasks, such as requests from external interfaces. However, where memory banks 2308 are sufficient, memory controller 2306 may be configurable to route data to or grant access to processing units using dedicated lines that do not have to be changed until the task is complete. Furthermore, in some embodiments, all devices may be connected to the entity of configuration manager 2304 through a bus, and the devices may manage access between the devices themselves and the bus (eg, using the same logic as a multiplexer). Thus, the memory controller 2306 may be directly connected to several memory instances or memory blocks.

替代地，存储器控制器2306可直接连接至存储器子实例。在一些实施例中，每个存储器实例或区块可由子实例建置(例如，DRAM可由布置在多个子区块中的具有独立数据线的垫建置)。另外，实例可包括DRAM垫、DRAM、组、快闪存储器垫或SRAM垫或任何其他类型的存储器中的至少一个。接着，存储器控制器2306可包括专用线以直接寻址子实例，以将管线式存储器存取期间在延时减至最少。Alternatively, the memory controller 2306 may connect directly to the memory sub-instance. In some embodiments, each memory instance or block may be constructed from a sub-instance (eg, a DRAM may be constructed of pads with independent data lines arranged in multiple sub-blocks). Additionally, examples may include at least one of a DRAM pad, DRAM, bank, flash memory pad, or SRAM pad, or any other type of memory. Next, memory controller 2306 may include dedicated lines to directly address sub-instances to minimize latency during pipelined memory accesses.

在一些实施例中，存储器控制器2306也可保持特定存储器实例所需的逻辑(诸如，行\列解码器、刷新逻辑等)，且存储器区块2308可处置其自身的逻辑。因此，存储器区块2308可获得地址且产生用于传回\写入数据的命令。In some embodiments, memory controller 2306 may also maintain logic required for a particular memory instance (such as row\column decoders, refresh logic, etc.), and memory block 2308 may handle its own logic. Thus, memory block 2308 can obtain an address and generate a command to return\write data.

图24描绘符合所公开实施例的示例性存储器配置图。在一些实施例中，产生用于处理设备2200的代码或配置的编译程序可执行用以通过将数据预先布置在每个区块中来配置自存储器区块2202及2204的加载的方法。例如，编译程序可预先布置数据，使得任务所需的每一字与一存储器实例或存储器区块的线相关。但对于需要比处理设备2200中可用的存储器区块多的存储器区块的任务，编译程序可实施使数据适配每个存储器区块的多于一个存储器位置的方法。编译程序也可依序储存数据且评估每个存储器区块的延时以避免线错漏延时。在一些实施例中，主机可为处理单元的部分，诸如配置管理器2212，但在其他实施例中，编译程序主机可经由外部接口连接至处理设备2200。在这些实施例中，主机可运行编译功能，诸如针对编译程序所描述的编译功能。24 depicts an exemplary memory configuration diagram consistent with disclosed embodiments. In some embodiments, a compiler generating code or configuration for processing device 2200 may execute a method to configure loading from memory blocks 2202 and 2204 by pre-arranging data in each block. For example, a compiler may pre-arrange data such that each word required by a task is associated with a line of memory instance or memory bank. But for tasks that require more memory blocks than are available in processing device 2200, the compiler may implement methods that fit data into more than one memory location per memory block. The compiler can also store the data sequentially and evaluate the latency of each memory block to avoid line-fault-miss latency. In some embodiments, the host may be part of the processing unit, such as configuration manager 2212, but in other embodiments, the compiler host may be connected to processing device 2200 via an external interface. In these embodiments, the host may execute a compilation function, such as that described for the compiler.

在一些实施例中，配置管理器2212可为CPU或微控制器(uC)。在这些实施例中，配置管理器2212可能必须存取存储器以取得置放于存储器中的命令或指令。特定编译程序可产生代码且以一方式将该代码置放于存储器中，该方式允许在相同存储器线中及跨越数个存储器组储存连续命令，以允许还对所取得命令进行管线式存储器存取。在这些实施例中，配置管理器2212及存储器控制器2210可能够通过促进管线式存储器存取来避免线性执行中的行延时。In some embodiments, the configuration manager 2212 may be a CPU or a microcontroller (uC). In these embodiments, configuration manager 2212 may have to access memory to obtain commands or instructions placed in memory. Certain compilers can generate code and place that code in memory in a way that allows sequential commands to be stored in the same memory line and across several memory banks to allow pipelined memory access to also fetched commands . In these embodiments, configuration manager 2212 and memory controller 2210 may be able to avoid row latency in linear execution by facilitating pipelined memory accesses.

程序的线性执行的先前状况描述供编译程序辨识及置放指令以允许管线式存储器执行的方法。然而，其他软件结构可能更复杂且将需要编译程序辨识其他软件结构且相应地采取动作。例如，在任务需要循环及分支的状况下，编译程序可将所有循环代码置放于单一线内，使得单一线可在不具有线开放延时的情况下进行循环。接着，存储器控制器2210可能不需要在执行期间改变线。The previous state of linear execution of a program describes a method for a compiler to identify and place instructions to allow pipelined memory execution. However, other software structures may be more complex and would require the compiler to recognize other software structures and act accordingly. For example, in situations where a task requires looping and branching, the compiler can place all looping code in a single line so that a single line can loop without line-open delays. Next, memory controller 2210 may not need to change lines during execution.

在一些实施例中，配置管理器2212可包括内部高速缓存或小存储器。内部高速缓存可储存由配置管理器2212执行以处置分支及循环的命令。例如，内部高速缓存中的命令可包括用以配置用于存取存储器区块的加速器的指令。In some embodiments, configuration manager 2212 may include an internal cache or small memory. The internal cache may store commands executed by the configuration manager 2212 to handle branches and loops. For example, commands in the internal cache may include instructions to configure accelerators for accessing memory blocks.

图25为说明符合所公开实施例的可能存储器配置处理程序2500的示例性流程图。在便于描述存储器配置处理程序2500的情况下，可参考图22中所描绘及上文所描述的元件的识别符。在一些实施例中，处理程序2500可由编译程序执行，该编译程序将指令提供至经由外部接口连接的主机。在其他实施例中，处理程序2500可由处理设备2200的组件(诸如，配置管理器2212)执行。25 is an exemplary flow diagram illustrating a possible memory configuration handler 2500 consistent with the disclosed embodiments. In the context of facilitating the description of the memory configuration handler 2500, reference may be made to the identifiers of the elements depicted in FIG. 22 and described above. In some embodiments, handler 2500 may be executed by a compiler that provides instructions to a host connected via an external interface. In other embodiments, handler 2500 may be executed by a component of processing device 2200, such as configuration manager 2212.

一般而言，处理程序2500可包括：判定执行任务同时所需的字的数量；判定可同时自多个存储器组中的每个存取的字的数量；及当同时所需的字的数量大于可同时存取的字的数量时，在多个存储器组之间划分同时所需的该数量的字。此外，划分同时所需的该数量的字可包括执行字的循环组织及依序地每个存储器组指派一个字。In general, the process 2500 can include: determining the number of words required simultaneously to perform a task; determining the number of words that can be simultaneously accessed from each of the plurality of memory banks; and when the number of words required simultaneously is greater than When the number of words that can be accessed at the same time, the number of words required at the same time is divided among the multiple memory banks. Furthermore, dividing the number of words required at the same time may include performing a cyclic organization of words and sequentially assigning one word per memory bank.

更具体而言，处理程序2500可以步骤2502开始，在该步骤中，编译程序可接收任务规格。该规格包括所需计算和/或优先权等级。More specifically, processing routine 2500 may begin at step 2502, where a compiler may receive a task specification. The specification includes required calculations and/or priority levels.

在步骤2504中，编译程序可识别可执行任务的加速器或加速器群组。替代地，编译程序可产生指令，因此处理单元(诸如，配置管理器2212)可识别加速器以执行该任务。例如，使用所需计算配置管理器2212可识别加速器2216的群组中的可处理该任务的加速器。In step 2504, the compiler may identify an accelerator or group of accelerators that can execute the task. Alternatively, a compiler may generate instructions so a processing unit (such as configuration manager 2212) may identify an accelerator to perform the task. For example, using desired compute configuration manager 2212 can identify accelerators in the group of accelerators 2216 that can handle the task.

在步骤2506中，编译程序可判定为了执行该任务需要同时存取的字的数量。例如，两个向量的乘法需要存取至少两个向量，且编译程序因此可判定必须同时存取向量字以执行操作。In step 2506, the compiler may determine the number of words that need to be accessed simultaneously in order to perform the task. For example, the multiplication of two vectors requires access to at least two vectors, and the compiler may therefore decide that the vector words must be simultaneously accessed to perform the operation.

在步骤2508中，编译程序可判定执行该任务必需的循环的数量。例如，若该任务需要对四个附带产生结果的卷积运算，则编译程序可判定至少4个循环将为执行该任务所必需的。In step 2508, the compiler may determine the number of cycles necessary to perform the task. For example, if the task requires four convolution operations with incidental results, the compiler may determine that at least 4 loops will be necessary to perform the task.

在步骤2510中，编译程序可将需要同时存取的字置放于不同存储器组中。以此方式，存储器控制器2210可被配置为开放至不同存储器实例的线且在一时钟循环内存取所需存储器区块，而不需要任何高速缓存数据。In step 2510, the compiler may place words that require simultaneous access in different memory banks. In this way, the memory controller 2210 can be configured to open lines to different memory instances and access the required memory blocks within one clock cycle without requiring any cache data.

在步骤2512中，编译程序将依序存取的字置放于相同存储器组中。例如，在需要操作的四个循环的状况下，编译程序可产生指令以在依序循环中将所需字写入单一存储器区块中，以避免在执行期间在不同存储器区块之间改变线。In step 2512, the compiler places sequentially accessed words in the same memory bank. For example, where four loops of operations are required, the compiler may generate instructions to write the desired words into a single memory bank in sequential loops to avoid changing lines between different memory banks during execution .

在步骤2514中，编译程序产生用于编程诸如配置管理器2212的处理单元的指令。该指令可指定操作开关设备(诸如，多任务器)或配置数据总线的条件。通过这些指令，配置管理器2212可根据任务配置存储器控制器2210以使用专用通信线将数据自存储器区块投送至处理单元或授权对该存储器区块的存取。In step 2514, the compiler generates instructions for programming a processing unit such as configuration manager 2212. The instruction may specify conditions for operating a switching device (such as a multiplexer) or configuring a data bus. Through these instructions, the configuration manager 2212 can configure the memory controller 2210 to use dedicated communication lines to route data from the memory block to the processing unit or to authorize access to the memory block according to the task.

图26为说明符合所公开实施例的存储器读取处理程序2600的示例性流程图。在便于描述存储器读取处理程序2600的情况下，可参考图22中所描绘及上文所描述的元件的识别符。在一些实施例中，如下文所描述，处理程序2600可由存储器控制器2210实施。然而，在其他实施例中，处理程序2600可由处理设备2200中的其他元件(诸如，配置管理器2212)实施。26 is an exemplary flow diagram illustrating a memory read handler 2600 consistent with the disclosed embodiments. In the context of facilitating the description of the memory read handler 2600, reference may be made to the identifiers of the elements depicted in FIG. 22 and described above. In some embodiments, process 2600 may be implemented by memory controller 2210, as described below. However, in other embodiments, handler 2600 may be implemented by other elements in processing device 2200, such as configuration manager 2212.

在步骤2602中，存储器控制器2210、配置管理器2212或其他处理单元可接收投送来自存储器组的数据或授权对存储器组的存取的指示。请求可指定地址及存储器区块。In step 2602, the memory controller 2210, configuration manager 2212, or other processing unit may receive an indication to post data from the memory bank or to authorize access to the memory bank. Requests can specify addresses and memory blocks.

在一些实施例中，该请求可经由线2218中指定读取命令及线2220中指定地址的数据总线接收。在其他实施例中，该请求可经由连接至存储器控制器2210的解多任务器接收。In some embodiments, the request may be received via the data bus specifying the read command in line 2218 and the address specified in line 2220. In other embodiments, the request may be received via a demultiplexer connected to the memory controller 2210.

在步骤2604中，配置管理器2212、主机或其他处理单元可查询内部寄存器。该内部寄存器可包括关于至存储器组的开放线、开放地址、开放存储器区块和/或即将进行的任务的信息。基于内部寄存器中的信息，可判定是否存在至存储器组的开放线和/或存储器区块是否在步骤2602中接收到请求。替代地或另外，存储器控制器2210可直接查询该内部寄存器。In step 2604, the configuration manager 2212, host, or other processing unit may query the internal registers. The internal registers may include information about open lines to memory banks, open addresses, open memory banks, and/or upcoming tasks. Based on the information in the internal registers, it can be determined whether there is an open line to the memory bank and/or whether the memory block received the request in step 2602 . Alternatively or additionally, the memory controller 2210 may query the internal register directly.

若该内部寄存器指示存储器组未加载开放线中(步骤2606：否)，则处理程序2600可继续至步骤2616，且可将线加载至与所接收地址相关联的存储器组。此外，存储器控制器2210或诸如配置管理器2212的处理单元可在步骤2616中将延迟发信至请求来自存储器地址的信息的元件。例如，若加速器2216正请求位于已被占用的存储器区块的存储器信息，则在步骤2618中，存储器控制器2210可将延迟信号发送至加速器。在步骤2620中，配置管理器2212或存储器控制器2210可更新内部寄存器以指示已开放至新存储器组或新存储器区块的线。If the internal register indicates that the memory bank is not loaded into the open line (step 2606: NO), then process 2600 may continue to step 2616 and the line may be loaded into the memory bank associated with the received address. Additionally, memory controller 2210 or a processing unit such as configuration manager 2212 may in step 2616 signal a delay to the element requesting information from the memory address. For example, if the accelerator 2216 is requesting memory information located in an occupied memory block, in step 2618, the memory controller 2210 may send a delay signal to the accelerator. In step 2620, configuration manager 2212 or memory controller 2210 may update internal registers to indicate lines that have been opened to new memory banks or new memory blocks.

若该内部寄存器指示存储器组加载开放线中(步骤2606：是)，则处理程序2600可继续至步骤2608。在步骤2608中，可判定加载有存储器组的线是否正用于不同地址。若该线正用于不同地址(步骤2608：是)，则此将指示单一区块中存在两个实例，且因此，不能同时存取该两个实例。因此，可在步骤2616中将错误或免除信号发送至请求来自存储器地址的信息的元件。但若该线并未正用于不同地址(步骤2608：否)，则可开放针对该地址的线并从目标存储器组取回数据，且继续至步骤2614以将数据传输至请求来自存储器地址的信息的元件。If the internal register indicates that the memory bank is loaded into an open line (step 2606 : Yes), then process 2600 may continue to step 2608 . In step 2608, it may be determined whether the line loaded with the memory bank is being used for a different address. If the line is being used for a different address (step 2608: YES), then this would indicate that there are two instances in a single block, and therefore, both instances cannot be accessed at the same time. Accordingly, an error or exemption signal may be sent in step 2616 to the element requesting information from the memory address. But if the line is not being used for a different address (step 2608: NO), then the line for that address can be opened and the data retrieved from the target memory bank, and proceed to step 2614 to transfer the data to the requesting address from the memory address element of information.

利用处理程序2600，处理设备2200能够建立处理单元与含有执行任务所需的信息的存储器区块或存储器实例之间的直接连接。数据的此组织将使得能够自不同存储器实例中的经组织向量读取信息，以及允许在设备请求多个这些地址时同时自不同存储器区块取回信息。Using the processing program 2600, the processing device 2200 can establish a direct connection between a processing unit and a memory block or memory instance that contains the information needed to perform a task. This organization of data will enable reading information from organized vectors in different memory instances, as well as allowing information to be retrieved from different memory blocks simultaneously when a device requests multiple of these addresses.

图27为说明符合所公开实施例的执行处理程序2700的示例性流程图。在便于描述执行处理程序2700的情况下，可参考图22中所描绘及上文所描述的元件的识别符。FIG. 27 is an exemplary flow diagram illustrating an execution process 2700 consistent with the disclosed embodiments. Reference may be made to the identifiers of the elements depicted in FIG. 22 and described above in the context of facilitating the description of the execution process 2700.

在步骤2702中，编译程序或诸如配置管理器2212的区域单元可接收需要执行的任务的指示。该任务可包括单一运算(例如，乘法)或更复杂运算(例如，矩阵之间的卷积)。该任务也可指示所需计算。In step 2702, a compiler or a local unit such as configuration manager 2212 may receive an indication of tasks that need to be performed. The task may include a single operation (eg, multiplication) or a more complex operation (eg, convolution between matrices). The task may also indicate required calculations.

在步骤2704中，编译程序或配置管理器2212可判定执行该任务同时所需的字的数量。例如，配置编译程序可判定同时需要两个字来执行向量之间的乘法。在另一实施例(2D卷积任务)中，配置管理器2212可判定矩阵之间的卷积需要「n」乘「m」个字，其中「n」及「m」为矩阵维度。此外，在步骤2704中，配置管理器2212也可判定执行该任务必需的循环的数量。In step 2704, the compiler or configuration manager 2212 may determine the number of words required to perform the task simultaneously. For example, the configuration compiler may determine that two words are needed at the same time to perform multiplication between vectors. In another embodiment (2D convolution task), configuration manager 2212 may determine that convolution between matrices requires "n" by "m" words, where "n" and "m" are matrix dimensions. Additionally, in step 2704, the configuration manager 2212 may also determine the number of cycles necessary to perform the task.

在步骤2706中，取决于步骤2704中的判定，编译程序可将需要同时存取的字写入安置于基板上的多个存储器组中。例如，当可从多个存储器组中的一者同时存取的字的数量的数量小于同时所需的字的数量时，编译程序可将数据组织在多个存储器组中以促进在一时钟内存取不同所需字。此外，当配置管理器2212或编译程序判定执行任务必需的循环的数量时，编译程序可在依序循环中将所需的字写入多个存储器组中的单一存储器组中，以防止存储器组之间的线的切换。In step 2706, depending on the determination in step 2704, the compiler may write words that require simultaneous access into multiple memory banks disposed on the substrate. For example, when the number of words that can be accessed simultaneously from one of the multiple memory banks is less than the number of words required at the same time, the compiler may organize the data in the multiple memory banks to facilitate one clock memory Take the different desired words. Additionally, when the configuration manager 2212 or the compiler determines the number of cycles necessary to perform a task, the compiler may write the required words into a single one of the multiple memory banks in sequential loops to prevent the memory bank Toggle between lines.

在步骤2708中，存储器控制器2210可被配置为使用第一存储器线从多个存储器组或区块中的第一存储器组读取至少一个第一字或授权对该至少一个第一字的存取。In step 2708, the memory controller 2210 may be configured to read or authorize storage of at least one first word from a first memory bank of the plurality of memory banks or banks using the first memory line Pick.

在步骤2170中，处理单元(例如，加速器2216中的一个)可使用至少一个第一字来处理任务。In step 2170, a processing unit (eg, one of accelerators 2216) may process the task using the at least one first word.

在步骤2712中，存储器控制器2210可被配置为开放第二存储器组中的第二存储器线。例如，基于任务且使用管线式存储器访问方法，存储器控制器2210可被配置为开放在步骤2706中写入有任务所需的信息的第二存储器区块中的第二存储器线。在一些实施例中，该第二存储器线可在步骤2170中的任务将要完成时开放。例如，若一任务需要100个时钟，则该第二存储器线可在第90个时钟中开放。In step 2712, the memory controller 2210 may be configured to open a second memory line in the second memory bank. For example, based on a task and using a pipelined memory access method, the memory controller 2210 may be configured to open a second memory line in the second memory bank in which the information required by the task was written in step 2706. In some embodiments, the second memory line may be opened when the task in step 2170 is about to complete. For example, if a task requires 100 clocks, the second memory line can be opened in the 90th clock.

在一些实施例中，步骤2708至2712可在一个线存取循环内执行。In some embodiments, steps 2708-2712 may be performed within one line access cycle.

在步骤2714中，存储器控制器2210可被配置为授权使用在步骤2710中开放的第二存储器线存取来自第二存储器组的至少一个第二字的数据。In step 2714, the memory controller 2210 may be configured to authorize access to data from at least one second word of the second memory bank using the second memory line opened in step 2710.

在步骤2176中，处理单元(例如，加速器2216中的一个)可使用至少第二字来处理任务。In step 2176, a processing unit (eg, one of accelerators 2216) may process the task using at least the second word.

在步骤2718中，存储器控制器2210可被配置为开放第一存储器组中的第二存储器线。例如，基于任务且使用管线式存储器访问方法，存储器控制器2210可被配置为开放至第一存储器区块的第二存储器线。在一些实施例中，至第一区块的第二存储器线可在步骤2176中的任务将要完成时开放。In step 2718, the memory controller 2210 may be configured to open the second memory line in the first memory bank. For example, based on the task and using a pipelined memory access method, the memory controller 2210 may be configured to open a second memory line to the first memory bank. In some embodiments, the second memory line to the first block may be opened when the task in step 2176 is about to complete.

在一些实施例中，步骤2714至2718可在一个线存取循环内执行。In some embodiments, steps 2714-2718 may be performed within one line access cycle.

在步骤2720中，存储器控制器2210可使用第一记忆组中的第二存储器线或第三记忆组中的第一线及在不同存储器组中继续而从多个存储器组或区块中的第一存储器组读取至少一个第三字或授权对该至少一个第三字的存取。In step 2720, the memory controller 2210 may use the second memory line in the first memory group or the first line in the third memory group and continue in a different memory group from the first memory line in the plurality of memory groups or blocks. A memory bank reads at least one third word or grants access to the at least one third word.

诸如动态随机存取存储器(DRAM)芯片的一些存储器芯片使用刷新以避免所储存数据(例如，使用电容)由于芯片的电容器或其他电组件中的电压衰减而失去。例如，在DRAM中，必须时常刷新每一胞元(基于特定处理程序及设计)以恢复电容器中的电荷，使得数据不会丢失或损坏。随着DRAM芯片的存储器容量增加，刷新存储器所需的时间量变得显着。在正刷新存储器的某一线的时间段期间，不能存取含有正刷新的该线的组。此可导致效能降低。另外，与刷新处理程序相关联的功率也可为显着的。先前已努力尝试缩减执行刷新的速率以缩减与刷新存储器相关联的不利影响，但大部分这些努力集中于DRAM的物理层。Some memory chips, such as dynamic random access memory (DRAM) chips, use refresh to avoid loss of stored data (eg, using capacitance) due to voltage decay in the chip's capacitors or other electrical components. For example, in DRAM, each cell must be refreshed from time to time (based on specific processing procedures and designs) to restore the charge in the capacitors so that data is not lost or corrupted. As the memory capacity of DRAM chips increases, the amount of time required to refresh the memory becomes significant. During the period of time a line of memory is being refreshed, the bank containing the line being refreshed cannot be accessed. This can lead to reduced performance. Additionally, the power associated with the refresh handler can also be significant. Previous efforts have been made to reduce the rate at which refresh is performed to reduce the adverse effects associated with refreshing memory, but most of these efforts have focused on the physical layer of DRAM.

刷新类似于读取及写回存储器的行。使用此原理且集中于存取存储器的图案，本公开的实施例包括软件及硬件技术以及对存储器芯片的修改，以使用较少功率用于刷新且缩减刷新存储器期间的时间量。例如，作为概述，一些实施例可使用硬件和/或软件以追踪线存取时序且在刷新循环内跳过最近存取行(例如，基于时序阈值)。在另一实施例中，一些实施例可依赖于由存储器芯片的刷新控制器执行的软件来指派读取及写入，使得对存储器的存取为非随机的。因此，软件可更精确地控制刷新以避免浪费刷新循环和/或线。这些技术可单独使用或与编码用于刷新控制器的命令及用于处理器的机器码的编译程序组合使用，使得对存储器的存取同样为非随机的。使用下文详细描述的这些技术及配置的任何组合，所公开实施例可通过缩减刷新存储器单元期间的时间量来降低存储器刷新功率要求和/或提高系统效能。Flushing is analogous to reading and writing rows back to memory. Using this principle and focusing on patterns of accessing memory, embodiments of the present disclosure include software and hardware techniques and modifications to memory chips to use less power for refresh and reduce the amount of time during which memory is refreshed. For example, as an overview, some embodiments may use hardware and/or software to track line access timing and skip most recently accessed lines within a refresh cycle (eg, based on timing thresholds). In another embodiment, some embodiments may rely on software executed by the memory chip's refresh controller to assign reads and writes such that access to the memory is non-random. Therefore, software can control refresh more precisely to avoid wasting refresh cycles and/or lines. These techniques can be used alone or in combination with a compiler that encodes commands for refreshing the controller and machine code for the processor, so that access to memory is also non-random. Using any combination of these techniques and configurations described in detail below, the disclosed embodiments may reduce memory refresh power requirements and/or improve system performance by reducing the amount of time during which memory cells are refreshed.

图28描绘符合本公开的具有刷新控制器2803的实施例存储器芯片2800。例如，存储器芯片2800可包括基板上的多个存储器组(例如，存储器组2801a及其类似物)。在图28的实施例中，基板包括四个存储器组，其各具有四个线。一线可指存储器芯片2800的一个或多个存储器组或存储器芯片2800内的存储器胞元的任何其他集合(诸如，存储器组的一部分或沿着存储器组的一整行或存储器组的群组)内的字线。28 depicts an embodiment memory chip 2800 with a refresh controller 2803 consistent with the present disclosure. For example, memory chip 2800 may include multiple memory banks (eg, memory bank 2801a and the like) on a substrate. In the embodiment of Figure 28, the substrate includes four memory banks, each having four lines. A line may refer to one or more memory banks of memory chip 2800 or within any other collection of memory cells within memory chip 2800 (such as a portion of a memory bank or an entire row along a memory bank or a group of memory banks) word line.

在其他实施例中，基板可包括任何数量的存储器组，且每个存储器组可包括任何数量的线。一些存储器组可包括相同数量的线(如图28中所展示)，而其他存储器组可包括不同数量的线。如图28中进一步描绘，存储器芯片2800可包括控制器2805，该控制器用以接收至存储器芯片2800的输入且自存储器芯片2800传输输出(例如，如上文在「码的划分」中所描述)。In other embodiments, the substrate may include any number of memory banks, and each memory bank may include any number of lines. Some memory banks may include the same number of lines (as shown in Figure 28), while other memory banks may include different numbers of lines. As further depicted in FIG. 28, the memory chip 2800 may include a controller 2805 for receiving inputs to the memory chip 2800 and transmitting outputs from the memory chip 2800 (eg, as described above in "Partitioning of Codes").

在一些实施例中，多个存储器组可包含动态随机存取存储器(DRAM)。然而，多个存储器组可包含储存需要周期性刷新的数据的任何易失性存储器。In some embodiments, the plurality of memory banks may include dynamic random access memory (DRAM). However, the plurality of memory banks may include any volatile memory that stores data that needs to be refreshed periodically.

如下文将更详细地论述，本公开所公开的实施例可使用计数器或电阻器-电容器电路以对刷新循环进行计时。例如，计数器或定时器可用以对自最后完整刷新循环的时间进行计数，且接着当计数器达到其目标值时，可使用另一计数器对所有行进行反复。本公开的实施例可另外追踪对存储器芯片2800的区段的存取且缩减所需的刷新功率。例如，尽管未在图28中描绘，但存储器芯片2800还可以包括数据储存器，该数据储存器被配置为储存指示多个存储器组中的一个或多个区段的存取操作的存取信息。例如，该一个或多个区段可包含存储器芯片2800内的存储器胞元的线、列或任何其他分组的任何部分。在一个特定实施例中，该一个或多个区段可包括多个存储器组内的至少一行的存储器结构。刷新控制器2803可被配置为至少部分地基于所储存的存取信息而执行该一个或多个区段的刷新操作。As will be discussed in more detail below, embodiments disclosed in the present disclosure may use a counter or resistor-capacitor circuit to time refresh cycles. For example, a counter or timer can be used to count the time since the last complete refresh cycle, and then when the counter reaches its target value, another counter can be used to iterate over all rows. Embodiments of the present disclosure may additionally track accesses to sectors of the memory chip 2800 and reduce the required refresh power. For example, although not depicted in Figure 28, the memory chip 2800 may also include a data store configured to store access information indicative of an access operation of one or more sectors in the plurality of memory banks . For example, the one or more segments may comprise any portion of a line, column, or any other grouping of memory cells within memory chip 2800. In a particular embodiment, the one or more sections may comprise at least one row of memory structures within the plurality of memory banks. The refresh controller 2803 can be configured to perform a refresh operation for the one or more sectors based at least in part on the stored access information.

例如，该数据储存器可包含与存储器芯片2800的区段(例如，存储器芯片2800内的存储器胞元的线、列或任何其他分组)相关联的一个或多个寄存器、静态随机存取存储器(SRAM)胞元，或其类似物。另外，该数据储存器可被配置为储存指示相关联的区段是否在一个或多个先前循环中经存取的比特。「比特」可包含储存至少一个比特的任何数据结构，诸如寄存器、SRAM胞元、非易失性存储器或其类似物。此外，比特可通过将数据结构的对应开关(或开关元件，诸如晶体管)设定为接通(其可等效于「1」或「真」)来设定。另外或替代地，比特可通过修改数据结构内的任何其他性质(诸如，对快闪存储器的浮动门充电，修改SRAM中的一个或多个触发器的状态或其类似物)以便将「1」写入至该数据结构(或指示比特的设定的任何其他值)来设定。若一比特被判定为作为存储器控制器的刷新操作的部分而经设定，则刷新控制器2803可跳过相关联区段的刷新循环且清空与其部分相关联的寄存器。For example, the data store may include one or more registers, static random access memory (SRAM) associated with a segment of memory chip 2800 (eg, a line, column, or any other grouping of memory cells within memory chip 2800 ). SRAM) cell, or its analogs. Additionally, the data store may be configured to store bits that indicate whether the associated segment was accessed in one or more previous cycles. A "bit" may include any data structure that stores at least one bit, such as a register, SRAM cell, non-volatile memory, or the like. Furthermore, a bit may be set by setting the corresponding switch (or switching element, such as a transistor) of the data structure to on (which may be equivalent to "1" or "true"). Additionally or alternatively, a bit may be changed by modifying any other property within the data structure (such as charging a floating gate of flash memory, modifying the state of one or more flip-flops in SRAM, or the like) in order to set a "1" Write to this data structure (or any other value indicating the setting of the bits) to set. If a bit is determined to be set as part of the memory controller's refresh operation, the refresh controller 2803 may skip the refresh cycle for the associated sector and clear the registers associated with its portion.

在另一实施例中，数据储存器可包含与存储器芯片2800的区段(例如，存储器芯片2800内的存储器胞元的线、列或任何其他分组)相关联的一个或多个非易失性存储器(例如，快闪存储器或其类似物)。非易失性存储器可被配置为储存指示相关联的区段是否在一个或多个先前循环中经存取的比特。In another embodiment, the data store may include one or more non-volatiles associated with a segment of the memory chip 2800 (eg, a line, column, or any other grouping of memory cells within the memory chip 2800 ) Memory (eg, flash memory or the like). The non-volatile memory may be configured to store bits that indicate whether the associated segment was accessed in one or more previous cycles.

一些实施例可另外或替代地在每一行或行群组(或存储器芯片2800的其他区段)上添加时间戳寄存器，该时间戳寄存器保存当前刷新循环内线被存取的最后时刻。此意谓在每一行存取的情况下，刷新控制器可更新行时间戳寄存器。因此，当下一次刷新发生时(例如，在刷新循环结束时)，刷新控制器可比较所储存时间戳，且若相关联区段先前在某一时间段内(例如，在如应用于所储存时间戳的某一阈值内)经存取，则刷新控制器可跳至下一区段。此避免系统在最近已存取的区段上消耗刷新功率。此外，刷新控制器可继续追踪存取以确保在下一循环存取或刷新每个区段。Some embodiments may additionally or alternatively add a timestamp register on each row or row group (or other section of the memory chip 2800) that holds the last time the line was accessed within the current refresh cycle. This means that on every row access, the refresh controller can update the row timestamp register. Thus, when the next refresh occurs (eg, at the end of a refresh cycle), the refresh controller can compare the stored timestamps and if the associated segment was previously within a certain period of time (eg, at the time as applied to the stored time) within a certain threshold of the stamp), the refresh controller may skip to the next segment. This avoids the system consuming refresh power on recently accessed segments. In addition, the refresh controller can continue to track accesses to ensure that each segment is accessed or refreshed on the next cycle.

因此，在又一实施例中，数据储存器可包含与存储器芯片2800的区段(例如，存储器芯片2800内的存储器胞元的线、列或任何其他分组)相关联的一个或多个寄存器或非易失性存储器。该寄存器或非易失性存储器可被配置为储存指示相关联区段的最近存取的时间戳或其他信息，而非使用比特来指示是否已存取相关联区段。在此实施例中，刷新控制器2803可基于储存于相关联寄存器或存储器中的时间戳与当前时间(例如，来自定时器，如下文在图29A及图29B中所解释)之间的时间量是否超过预定阈值(例如，8ms、16ms、32ms、64ms或其类似物)来判定是否刷新或存取相关联区段。Thus, in yet another embodiment, a data store may include one or more registers associated with a segment of memory chip 2800 (eg, a line, column, or any other grouping of memory cells within memory chip 2800) or non-volatile memory. Rather than using bits to indicate whether the associated segment has been accessed, the register or non-volatile memory may be configured to store a timestamp or other information indicating the most recent access to the associated segment. In this embodiment, the refresh controller 2803 may be based on the amount of time between a timestamp stored in an associated register or memory and the current time (eg, from a timer, as explained below in FIGS. 29A and 29B ) Whether a predetermined threshold (eg, 8ms, 16ms, 32ms, 64ms, or the like) is exceeded is determined to refresh or access the associated segment.

因此，预定阈值可包含确保相关联区段在每个刷新循环内被刷新(若并非存取)至少一次的刷新循环的时间量。替代地，预定阈值可包含短于刷新循环所需的时间量的时间量(例如，以确保任何所需刷新或存取信号可在刷新循环完成之前到达相关联区段)。例如，预定时间可包含用于具有8ms刷新时段的存储器芯片的7ms，使得若区段在7ms内尚未被存取，则刷新控制器将发送在8ms刷新时段结束时到达该区段的刷新或存取信号。在一些实施例中，预定阈值可取决于相关联区段的大小。例如，对于存储器芯片2800的较小区段，预定阈值可较小。Thus, the predetermined threshold may comprise the amount of time of a refresh cycle that ensures that the associated segment is refreshed (if not accessed) at least once in each refresh cycle. Alternatively, the predetermined threshold may comprise an amount of time that is shorter than the amount of time required for the refresh cycle (eg, to ensure that any required refresh or access signals can reach the associated section before the refresh cycle is complete). For example, the predetermined time may include 7ms for a memory chip with an 8ms refresh period, such that if a segment has not been accessed within 7ms, the refresh controller will send a refresh or memory that arrives at the segment at the end of the 8ms refresh period Take the signal. In some embodiments, the predetermined threshold may depend on the size of the associated segment. For example, for smaller sections of memory chip 2800, the predetermined threshold may be smaller.

尽管上文关于存储器芯片进行描述，但本公开的刷新控制器也可用于分布式处理器架构中，如在上文及贯穿本公开的章节中描述的这些架构。此类架构的一个实施例描绘于图7A中。在这些实施例中，与存储器芯片2800相同的基板可包括安置于其上的多个处理群组，例如，如图7A中所描绘。如上文关于图3A所解释，「处理群组」可指基板上的两个或多于两个处理器子单元及其对应存储器组。该群组可表示用于编译代码以供在存储器芯片2800上执行的目的的基板上的空间分布和/或逻辑分组。因此，该基板可包括存储器阵列，该存储器阵列包括多个组，诸如图28中所展示的组2801a及其他组。此外，该基板可包括处理阵列，该处理阵列可包括多个处理器子单元(诸如，图7A中所展示的子单元730a、730b、730c、730d、730e、730f、730g及730h)。Although described above with respect to memory chips, the refresh controllers of the present disclosure may also be used in distributed processor architectures, such as those described above and in sections throughout this disclosure. One embodiment of such an architecture is depicted in Figure 7A. In these embodiments, the same substrate as memory chip 2800 may include multiple processing groups disposed thereon, eg, as depicted in Figure 7A. As explained above with respect to FIG. 3A, a "processing group" may refer to two or more processor sub-units and their corresponding memory groups on a substrate. The group may represent a spatial distribution and/or logical grouping on the substrate for the purpose of compiling code for execution on the memory chip 2800 . Thus, the substrate can include a memory array that includes multiple groups, such as group 2801a shown in FIG. 28 and others. Additionally, the substrate can include a processing array that can include a plurality of processor subunits (such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in Figure 7A).

如上文关于图7A进一步所解释，每一处理群组可包括一处理器子单元及专用于该处理器子单元的一个或多个对应存储器组。此外，为了允许每一处理器子单元与其对应的专用存储器组通信，该基板可包括将处理器子单元中的一个连接至其对应的专用存储器组的第一多个总线。As explained further above with respect to FIG. 7A, each processing group may include a processor sub-unit and one or more corresponding memory banks dedicated to that processor sub-unit. Furthermore, to allow each processor subunit to communicate with its corresponding dedicated memory bank, the baseboard may include a first plurality of buses connecting one of the processor subunits to its corresponding dedicated memory bank.

在这些实施例中，如图7A中所展示，该基板可包括用以将每一处理器子单元连接至至少一个其他处理器子单元(例如，相同行中的邻近子单元、相同列中的邻近处理器子单元，或基板上的任何其他处理器子单元)的第二多个总线。第一多个总线和/或第二多个总线可能不含时序硬件逻辑组件，使得在处理器子单元之间及跨越所述多个总线中的对应者的数据传送不受时序硬件逻辑组件控制，如上文在「使用软件的同步」章节中所解释。In these embodiments, as shown in FIG. 7A , the substrate may include elements to connect each processor subunit to at least one other processor subunit (eg, adjacent subunits in the same row, a second plurality of buses adjacent to the processor subunit, or any other processor subunit on the substrate). The first plurality of buses and/or the second plurality of buses may not contain sequential hardware logic components such that data transfers between processor subunits and across corresponding ones of the plurality of buses are not controlled by sequential hardware logic components , as explained above in the "Using Software Synchronization" section.

在与存储器芯片2800相同的基板可包括安置于其上的多个处理群组(例如，如图7A中所描绘)的实施例中，处理器子单元还可以包括地址生成器(例如，如图4中所描绘的地址生成器450)。此外，每一处理群组可包括一处理器子单元及专用于该处理器子单元的一个或多个对应存储器组。因此，地址生成器中的每个可与所述多个存储器组中的对应的专用存储器组相关联。此外，该基板可包括多个总线，每个总线将所述多个地址生成器中的一个连接至其对应的专用存储器组。In embodiments where the same substrate as memory chip 2800 may include multiple processing groups disposed thereon (eg, as depicted in FIG. 7A ), the processor subunit may also include an address generator (eg, as depicted in FIG. 7A ) address generator 450 depicted in 4). Additionally, each processing group may include a processor sub-unit and one or more corresponding memory banks dedicated to the processor sub-unit. Accordingly, each of the address generators may be associated with a corresponding dedicated memory bank of the plurality of memory banks. Additionally, the substrate may include a plurality of buses, each bus connecting one of the plurality of address generators to its corresponding dedicated memory bank.

图29A描绘符合本公开的实施例刷新控制器2900。刷新控制器2900可并入本公开的存储器芯片(诸如，图28的存储器芯片2800)中。如图29A中所描绘，刷新控制器2900可包括定时器2901，该定时器可包含用于刷新控制器2900的片上振荡器或任何其他时序电路。在图29A中所描绘的配置中，定时器2901可周期性地(例如，每8ms、16ms、32ms、64ms或其类似时间)触发刷新循环。刷新循环可使用行计数器2903以循环通过对应存储器芯片的所有行，且使用加法器2901结合有效比特2905来针对每一行产生一刷新信号。如图29A中所展示，比特2905可固定为1(「真」)以确保每一行在一循环期间刷新。FIG. 29A depicts an embodiment refresh controller 2900 consistent with the present disclosure. The refresh controller 2900 may be incorporated into a memory chip of the present disclosure, such as the memory chip 2800 of FIG. 28 . As depicted in FIG. 29A, the refresh controller 2900 may include a timer 2901, which may include an on-chip oscillator or any other sequential circuit for the refresh controller 2900. In the configuration depicted in Figure 29A, timer 2901 may trigger a refresh cycle periodically (eg, every 8ms, 16ms, 32ms, 64ms, or the like). A refresh cycle may use row counters 2903 to cycle through all rows of the corresponding memory chip, and adders 2901 in conjunction with valid bits 2905 to generate a refresh signal for each row. As shown in Figure 29A, bit 2905 may be fixed to 1 ("true") to ensure that each row is refreshed during a cycle.

在本公开的实施例中，刷新控制器2900可包括数据储存器。如上文所描述，该数据储存器可包含与存储器芯片2800的区段(例如，存储器芯片2800内的存储器胞元的线、列或任何其他分组)相关联的一个或多个寄存器或非易失性存储器。该寄存器或非易失性存储器可被配置为储存指示相关联区段的最近存取的时间戳或其他信息。In an embodiment of the present disclosure, the refresh controller 2900 may include data storage. As described above, the data store may include one or more registers or nonvolatile registers associated with segments of memory chip 2800 (eg, lines, columns, or any other grouping of memory cells within memory chip 2800 ) Sexual memory. The register or non-volatile memory may be configured to store a timestamp or other information indicating the most recent access to the associated segment.

刷新控制器2900可使用所储存的信息来跳过存储器芯片2900的区段的刷新。例如，若该信息指示一区段在一个或多个先前刷新循环期间已刷新，则刷新控制器2900可在当前刷新循环中跳过该区段。在另一实施例中，若针对一区段所储存的时间戳与当前时间之间的差低于阈值，则刷新控制器2900可在当前刷新循环中跳过该区段。刷新控制器2900可进一步经由多个刷新循环继续追踪存储器芯片2800的区段的存取及刷新。例如，刷新控制器2900可使用定时器2901更新所储存时间戳。在这些实施例中，刷新控制器2900可被配置为在阈值时间间隔之后，使用定时器的输出来清除储存于数据储存器中的存取信息。例如，在数据储存器储存相关联区段的最近存取或刷新的时间戳的实施例中，每当将存取命令或刷新信号发送至该区段时，刷新控制器2900便可将新时间戳储存于数据储存器中。若数据储存器储存比特而非时间戳，则定时器2901可被配置为清除经设定持续长于阈值时间段的比特。例如，在数据储存器储存指示相关联区段在一个或多个先前循环中经存取的实施例中，每当定时器2901触发新的刷新循环，刷新控制器2900便可清除数据储存器中的比特(例如，将其设定为0)，该新的刷新循环在设定相关联比特(例如，设定为1)后经过临界数量的循环(例如，一个、两个或其类似物)的循环。The refresh controller 2900 may use the stored information to skip the refresh of sectors of the memory chip 2900 . For example, if the information indicates that a segment has been refreshed during one or more previous refresh cycles, the refresh controller 2900 may skip the segment in the current refresh cycle. In another embodiment, if the difference between the timestamp stored for a segment and the current time is below a threshold, the refresh controller 2900 may skip the segment in the current refresh cycle. Refresh controller 2900 may further continue to track access and refresh of segments of memory chip 2800 through multiple refresh cycles. For example, refresh controller 2900 may use timer 2901 to update the stored timestamp. In these embodiments, the refresh controller 2900 may be configured to use the output of the timer to clear the access information stored in the data store after a threshold time interval. For example, in embodiments where the data store stores a timestamp of the most recent access or refresh of the associated segment, refresh controller 2900 may update the new time each time an access command or refresh signal is sent to that segment. The stamps are stored in the data storage. If the data store stores bits rather than time stamps, then timer 2901 can be configured to clear bits that are set to last longer than a threshold time period. For example, in embodiments where data store storage indicates that the associated section was accessed in one or more previous cycles, refresh controller 2900 may clear the data store whenever timer 2901 triggers a new refresh cycle bit (eg, set it to 0), this new refresh cycle goes through a critical number of cycles (eg, one, two, or the like) after the associated bit is set (eg, set to 1) cycle.

刷新控制器2900可协同存储器芯片2800的其他硬件追踪存储器芯片2800的区段的存取。例如，存储器芯片使用感测放大器以执行读取操作(例如，如上文在图9及图10中所展示)。该感测放大器可包含多个晶体管，所述多个晶体管被配置为感测来自将数据储存于一个或多个存储器胞元中的存储器芯片2800的区段的低功率信号，且将小的电压摆动放大至较高电压电平，使得数据可由诸如如上文所解释的外部CPU或GPU或集成式处理器子单元的逻辑解译。尽管在图29A中未描绘，但刷新控制器2900可进一步与感测放大器通信，该感测放大器被配置为存取一个或多个区段且改变至少一个比特寄存器的状态。例如，当感测放大器存取一个或多个区段时，其可设定与该区段相关联的比特(例如，设定为1)，该比特指示相关联区段在前一循环中经存取。在数据储存器储存相关联区段的最近存取或刷新的时间戳的实施例中，当感测放大器存取一个或多个区段时，其可触发将来自定时器2901的时间戳写入至寄存器、存储器或包含数据储存器的其他元件。The refresh controller 2900 may cooperate with other hardware of the memory chip 2800 to track access to segments of the memory chip 2800 . For example, memory chips use sense amplifiers to perform read operations (eg, as shown above in Figures 9 and 10). The sense amplifier may include a plurality of transistors configured to sense low power signals from segments of the memory chip 2800 that store data in one or more memory cells, and to convert small voltages The swing is amplified to higher voltage levels so that the data can be interpreted by logic such as an external CPU or GPU or an integrated processor sub-unit as explained above. Although not depicted in Figure 29A, refresh controller 2900 may further communicate with a sense amplifier configured to access one or more segments and change the state of at least one bit register. For example, when a sense amplifier accesses one or more segments, it may set a bit associated with that segment (eg, set to 1), which indicates that the associated segment has been access. In embodiments where the data store stores a timestamp of the most recent access or refresh of the associated segment, when the sense amp accesses one or more segments, it can trigger the writing of the timestamp from timer 2901 to registers, memory, or other elements that contain data storage.

在上文所描述的实施例中的任一者中，刷新控制器2900可与用于多个存储器组的存储器控制器集成。例如，类似于图3A中所描绘的实施例，刷新控制器2900可并入至与存储器芯片2800的存储器组或其他区段相关联的逻辑及控制子单元中。In any of the embodiments described above, refresh controller 2900 may be integrated with a memory controller for multiple memory banks. For example, similar to the embodiment depicted in FIG. 3A , refresh controller 2900 may be incorporated into logic and control subunits associated with memory banks or other sections of memory chip 2800 .

图29B描绘符合本公开的另一实施例刷新控制器2900'。刷新控制器2900'可并入本公开的存储器芯片(诸如，图28的存储器芯片2800)中。类似于刷新控制器2900，刷新控制器2900'包括定时器2901、行计数器2903、有效比特2905及加法器2907。另外，刷新控制器2900'可包括数据储存器2909。如图29B中所展示，数据储存器2909可包含与存储器芯片2800的区段(例如，存储器芯片2800内的存储器胞元的线、列或任何其他分组)相关联的一个或多个寄存器或非易失性存储器，且数据储存器内的状态可被配置为响应于一个或多个区段经存取而改变(例如，通过感测放大器和/或刷新控制器2900'的其他元件，如上文所描述)。因此，刷新控制器2900'可被配置为基于数据储存器内的状态跳过一个或多个区段的刷新。例如，若与区段相关联的状态经启动(例如，通过接通、使性质变更以便储存「1」或其类似物而设定为1)，则刷新控制器2900'可跳过相关联区段的刷新循环且清除与其部分相关联的状态。该状态可至少通过一比特寄存器或被配置为储存至少一个数据比特的任何其他存储器结构来储存。Figure 29B depicts another embodiment refresh controller 2900' consistent with the present disclosure. The refresh controller 2900 ′ may be incorporated into a memory chip of the present disclosure, such as the memory chip 2800 of FIG. 28 . Similar to refresh controller 2900 , refresh controller 2900 ′ includes timer 2901 , line counter 2903 , valid bits 2905 and adder 2907 . Additionally, refresh controller 2900' may include data storage 2909. As shown in FIG. 29B, data store 2909 may include one or more registers or non-volatile registers associated with a segment of memory chip 2800 (eg, a line, column, or any other grouping of memory cells within memory chip 2800). Volatile memory, and the state within the data store can be configured to change in response to one or more segments being accessed (eg, through sense amplifiers and/or other elements of the refresh controller 2900', as above Described). Accordingly, the refresh controller 2900' may be configured to skip the refresh of one or more sectors based on the state within the data store. For example, the refresh controller 2900' may skip the associated region if the state associated with the segment is enabled (eg, set to 1 by turning on, changing a property to store a "1", or the like) A refresh cycle for a segment and clears the state associated with its part. The state may be stored by at least a one-bit register or any other memory structure configured to store at least one bit of data.

为了确保存储器芯片的区段在每一刷新循环期间经刷新或存取，刷新控制器2900'可重设或以其他方式清除状态以便在下一刷新循环期间触发刷新信号。在一些实施例中，在一区段被跳过之后，刷新控制器2900'可清除相关联状态，以便确保在下一刷新循环刷新该区段。在其他实施例中，刷新控制器2900'可被配置为在阈值时间间隔之后重设数据储存器内的状态。例如，每当从相关联状态经设定(例如，通过接通、使性质变更以便储存「1」或其类似物而设定为1)起，定时器2901超过阈值时间，刷新控制器2900'便可清除数据储存器中的状态(例如，将其设定为0)。在一些实施例中，刷新控制器2900'可使用临界数量的刷新循环(例如，一个、两个或其类似物)或使用临界数量的时钟循环(例如，两个、四个或其类似物)而非阈值时间。To ensure that sectors of the memory chip are refreshed or accessed during each refresh cycle, refresh controller 2900' may reset or otherwise clear the state to trigger a refresh signal during the next refresh cycle. In some embodiments, after a segment is skipped, the refresh controller 2900' may clear the associated state to ensure that the segment is refreshed on the next refresh cycle. In other embodiments, the refresh controller 2900' may be configured to reset the state within the data store after a threshold time interval. For example, whenever the timer 2901 exceeds a threshold time since the associated state has been set (eg, set to 1 by turning on, changing a property to store a "1", or the like), the controller 2900' is refreshed The state in the data store can be cleared (eg, set to 0). In some embodiments, refresh controller 2900' may use a critical number of refresh cycles (eg, one, two, or the like) or use a critical number of clock cycles (eg, two, four, or the like) rather than a threshold time.

在其他实施例中，该状态可包含相关联区段的最近刷新或存取的时间戳，使得若该时间戳与当前时间(例如，来自图29A及图29B的定时器2901)之间的时间量超过预定阈值(例如，8ms、16ms、32ms、64ms或其类似时间)，则刷新控制器2900'可将存取命令或刷新信号发送至相关联区段且更新与其部分相关联的时间戳(例如，使用定时器2901)。另外或替代地，若刷新时间指示符指示最后刷新时间在预定时间阈值内，则刷新控制器2900'可被配置为跳过相对于多个存储器组中的一个或多个区段的刷新操作。在这些实施例中，在跳过相对于一个或多个区段的刷新操作之后，刷新控制器2900'可被配置为更改与一个或多个区段相关联的所储存的刷新时间指示符，使得在下一操作循环期间，将刷新该一个或多个区段。例如，如上文所描述，刷新控制器2900'可使用定时器2901来更新所储存的刷新时间指示符。In other embodiments, the state may include the timestamp of the most recent refresh or access of the associated segment, such that if the time between the timestamp and the current time (eg, timer 2901 from FIGS. 29A and 29B ) is amount exceeds a predetermined threshold (eg, 8ms, 16ms, 32ms, 64ms, or the like), the refresh controller 2900' may send an access command or refresh signal to the associated segment and update the timestamp associated with its portion ( For example, use timer 2901). Additionally or alternatively, if the refresh time indicator indicates that the last refresh time is within a predetermined time threshold, the refresh controller 2900' may be configured to skip a refresh operation with respect to one or more sectors in the plurality of memory banks. In these embodiments, after skipping a refresh operation with respect to one or more segments, the refresh controller 2900' may be configured to alter the stored refresh time indicators associated with the one or more segments, Such that during the next cycle of operation, the one or more sections will be refreshed. For example, as described above, refresh controller 2900' may use timer 2901 to update the stored refresh time indicator.

因此，数据储存器可包括被配置为储存刷新时间指示符的时间戳寄存器，该刷新时间指示符指示最后刷新多个存储器组中的一个或多个区段的时间。此外，刷新控制器2900'可在阈值时间间隔之后，使用定时器的输出来清除储存于数据储存器中的存取信息。Accordingly, the data store may include a time stamp register configured to store a refresh time indicator indicating when one or more sectors in the plurality of memory banks were last refreshed. Additionally, the refresh controller 2900' may use the output of the timer to clear the access information stored in the data store after a threshold time interval.

在上文所描述的实施例中的任一者中，对一个或多个区段的存取可包括与一个或多个区段相关联的写入操作。另外或替代地，对一个或多个区段的存取可包括与一个或多个区段相关联的读取操作。In any of the embodiments described above, an access to one or more sectors may include a write operation associated with the one or more sectors. Additionally or alternatively, accessing the one or more sectors may include a read operation associated with the one or more sectors.

此外，如图29B中所描绘，刷新控制器2900'可包含被配置为至少部分地基于数据储存器内的状态而辅助更新数据储存器2909的行计数器2903及加法器2907。数据储存器2909可包含与多个存储器组相关联的比特表。例如，该比特表可包含被配置为保存用于相关联区段的比特的开关(或开关元件，诸如晶体管)或寄存器(例如，SRAM或其类似物)的阵列。另外或替代地，数据储存器2909可储存与多个存储器组相关联的时间戳。Additionally, as depicted in FIG. 29B, refresh controller 2900' may include row counter 2903 and adder 2907 configured to assist in updating data store 2909 based at least in part on the state within the data store. Data store 2909 may contain bit tables associated with multiple memory banks. For example, the bit table may contain an array of switches (or switching elements, such as transistors) or registers (eg, SRAM or the like) configured to hold bits for the associated segment. Additionally or alternatively, the data store 2909 may store timestamps associated with multiple memory banks.

此外，刷新控制器2900'可包括刷新门2911，该刷新门被配置为基于储存于比特表中的对应值而控制是否进行对一个或多个区段的刷新。例如，刷新门2911可包含逻辑门(诸如，「与(and)门」，该逻辑门在数据储存器2909的对应状态指示相关联区段在一个或多个先前时钟循环期间经刷新或存取的情况下使来自行计数器2903的刷新信号无效。在其他实施例中，刷新门2911可包含微处理器或其他电路，该微处理器或其他电路被配置为在来自数据储存器2909的对应时间戳指示相关联区段在预定阈值时间值内经刷新或存取的情况下使来自行计数器2903的刷新信号无效。Additionally, the refresh controller 2900' may include a refresh gate 2911 configured to control whether a refresh of one or more sectors is performed based on corresponding values stored in the bit table. For example, refresh gate 2911 may include a logic gate (such as an "and gate") that indicates that the associated segment was refreshed or accessed during one or more previous clock cycles at the corresponding state of data store 2909 deasserts the refresh signal from row counter 2903. In other embodiments, refresh gate 2911 may include a microprocessor or other circuit configured to operate at corresponding times from data store 2909 The stamp indicates that the refresh signal from row counter 2903 is invalidated if the associated segment is refreshed or accessed within a predetermined threshold time value.

图30为用于存储器芯片(例如，图28的存储器芯片2800)中的部分刷新的处理程序3000的实施例流程图，处理程序3000可由符合本公开的刷新控制器执行，诸如图29A的刷新控制器2900或图29B的刷新控制器2900'。Figure 30 is a flow diagram of an embodiment of a process 3000 for partial refresh in a memory chip (eg, memory chip 2800 of Figure 28) that may be executed by a refresh controller consistent with the present disclosure, such as the refresh control of Figure 29A 2900 or refresh controller 2900' of FIG. 29B.

在步骤3010处，刷新控制器可存取指示多个存储器组中的一个或多个区段的存取操作的信息。例如，如上文关于图29A及图29B所解释，刷新控制器可包括数据储存器，该数据储存器与存储器芯片2800的区段(例如，存储器芯片2800内的存储器胞元的线、列或任何其他分组)相关联且被配置为储存指示相关联区段的最近存取的时间戳或其他信息。At step 3010, the refresh controller may access information indicative of access operations for one or more sectors in the plurality of memory banks. For example, as explained above with respect to FIGS. 29A and 29B , the refresh controller may include a data store that is associated with a segment of the memory chip 2800 (eg, a line, column, or any line of memory cells within the memory chip 2800 ) other packets) are associated and are configured to store timestamps or other information indicating the most recent access to the associated segment.

在步骤3020处，刷新控制器可至少部分地基于所存取信息而产生刷新和/或存取命令。例如，如上文关于图29A及图29B所解释，若所存取信息指示最后刷新或访问时间在预定时间阈值内和/或若所存取信息指示最后刷新或存取发生在一个或多个先前时钟循环期间，则刷新控制器可跳过相对于多个存储器组中的一个或多个区段的刷新操作。另外或替代地，刷新控制器可基于所存取信息是否指示最后刷新或访问时间超过预定阈值和/或所存取信息是否指示最后刷新或存取并未在一个或多个先前时钟循环期间发生，而产生刷新或存取相关联区段的意见。At step 3020, the refresh controller may generate refresh and/or access commands based at least in part on the accessed information. For example, as explained above with respect to Figures 29A and 29B, if the accessed information indicates that the last refresh or access time is within a predetermined time threshold and/or if the accessed information indicates that the last refresh or access occurred one or more previous During a clock cycle, the refresh controller may then skip refresh operations with respect to one or more banks in the plurality of memory banks. Additionally or alternatively, the refresh controller may be based on whether the accessed information indicates that the last refresh or access time exceeded a predetermined threshold and/or whether the accessed information indicates that the last refresh or access did not occur during one or more previous clock cycles , which generates views to refresh or access the associated section.

在步骤3030处，刷新控制器可更改与一个或多个区段相关联的所储存的刷新时间指示符，使得在下一操作循环期间，将刷新该一个或多个区段。例如，在跳过相对于一个或多个区段的刷新操作之后，刷新控制器可更改指示该一个或多个区段的存取操作的信息，使得在下一时钟循环期间，将刷新该一个或多个区段。因此，在跳过刷新循环之后，刷新控制器可清除区段的状态(例如，设定为0)。另外或替代地，刷新控制器可设定在当前循环期间刷新和/或存取的区段的状态(例如，设定为1)。在指示一个或多个区段的存取操作的信息包括时间戳的实施例中，刷新控制器可更新与在当前循环期间刷新和/或存取的区段相关联的任何所储存的时间戳。At step 3030, the refresh controller may modify the stored refresh time indicators associated with one or more segments so that during the next cycle of operation, the one or more segments will be refreshed. For example, after skipping a refresh operation with respect to one or more sectors, the refresh controller may alter information indicative of access operations for the one or more sectors such that during the next clock cycle, the one or more sectors will be refreshed. multiple sections. Therefore, after skipping the refresh cycle, the refresh controller may clear the state of the segment (eg, set to 0). Additionally or alternatively, the refresh controller may set the state (eg, set to 1) of the segment that is refreshed and/or accessed during the current cycle. In embodiments where the information indicative of an access operation for one or more segments includes a time stamp, the refresh controller may update any stored time stamps associated with segments that were refreshed and/or accessed during the current cycle .

方法3000还可以包括额外步骤。例如，除步骤3030以外或作为该步骤的替代，感测放大器可存取一个或多个区段且可改变与该一个或多个区段相关联的信息。另外或替代地，感测放大器可在存取已发生时向刷新控制器发信，使得刷新控制器可更新与一个或多个区段相关联的信息。如上文所解释，感测放大器可包含多个晶体管，所述多个晶体管被配置为感测来自将数据储存于一个或多个存储器胞元中的存储器芯片的区段的低功率信号，且将小的电压摆动放大至较高电压电平，使得数据可由诸如如上文所解释的外部CPU或GPU或集成式处理器子单元的逻辑解译。在此实施例中，每当感测放大器存取一个或多个区段时，其可设定(例如，设定为1)与区段相关联的比特，该比特指示相关联区段在前一循环中经存取。在指示一个或多个区段的存取操作的信息包括时间戳的实施例中，每当感测放大器存取一个或多个区段时，其便可触发将来自刷新控制器的定时器的时间戳写入至数据储存器以更新与该区段相关联的任何所储存的时间戳。Method 3000 may also include additional steps. For example, in addition to or in lieu of step 3030, a sense amplifier may access one or more segments and may change information associated with the one or more segments. Additionally or alternatively, the sense amplifier can signal the refresh controller when an access has occurred so that the refresh controller can update information associated with one or more segments. As explained above, a sense amplifier may include a plurality of transistors configured to sense low power signals from a segment of a memory chip that stores data in one or more memory cells, and to store data in one or more memory cells. Small voltage swings are amplified to higher voltage levels so that the data can be interpreted by logic such as an external CPU or GPU or an integrated processor sub-unit as explained above. In this embodiment, whenever a sense amplifier accesses one or more segments, it may set (eg, set to 1) a bit associated with the segment that indicates that the associated segment is previous Accessed in a cycle. In embodiments where the information indicative of the access operation of one or more sectors includes a timestamp, each time the sense amplifier accesses one or more sectors, it can trigger a timer to be sent from the refresh controller. Timestamps are written to the data store to update any stored timestamps associated with the segment.

图31为用于判定存储器芯片(例如，图28的存储器芯片2800)的刷新的处理程序3100的一实施例流程图。处理程序3100可实施于符合本公开的编译程序内。如上文所解释，「编译程序」指将较高级语言(例如，程序性语言，诸如C、FORTRAN、BASIC或其类似物；面向对象式语言，诸如Java、C++、Pascal、Python或其类似物；等等)转换成较低级语言(例如，组合代码、目标代码、机器码或其类似物)的任何计算机程序。编译程序可允许人类以人类可读语言来程序设计一系列指令，接着将该人类可读语言转换成机器可执行语言。编译程序可包含由一个或多个处理器执行的软件指令。FIG. 31 is a flowchart of an embodiment of a processing procedure 3100 for determining refresh of a memory chip (eg, the memory chip 2800 of FIG. 28 ). The handler 3100 may be implemented within a compiler consistent with the present disclosure. As explained above, a "compiler" refers to converting a higher-level language (eg, a procedural language, such as C, FORTRAN, BASIC, or the like; an object-oriented language, such as Java, C++, Pascal, Python, or the like; etc.) into a lower-level language (eg, composite code, object code, machine code, or the like). A compiler may allow a human to program a series of instructions in a human-readable language, and then convert the human-readable language into a machine-executable language. A compiler may contain software instructions that are executed by one or more processors.

在步骤3110处，一个或多个处理器可接收较高阶计算机代码。例如，该较高阶计算机代码可编码于存储器(例如，诸如硬盘机或其类似物的非易失性存储器、诸如DRAM的易失性存储器，或其类似物)上的一个或多个文件中或经由网络(例如，因特网或其类似物)接收。另外或替代地，可从使用者接收该较高阶计算机代码(例如，使用诸如键盘的输入设备)。At step 3110, one or more processors may receive higher order computer code. For example, the higher-level computer code may be encoded in one or more files on memory (eg, non-volatile memory such as a hard disk drive or the like, volatile memory such as DRAM, or the like) or received via a network (eg, the Internet or the like). Additionally or alternatively, the higher-level computer code may be received from a user (eg, using an input device such as a keyboard).

在步骤3120处，一个或多个处理器可识别要由较高阶计算机代码存取的在与存储器芯片相关联的多个存储器组上分布的多个存储器区段。例如，一个或多个处理器可存取定义存储器芯片的多个存储器组及一对应结构的数据结构。一个或多个处理器可从存储器(例如，诸如硬盘机或其类似物的非易失性存储器、诸如DRAM的易失性存储器，或其类似物)存取数据结构，或经由网络(例如，因特网或其类似物)接收数据结构。在这些实施例中，数据结构包括于可由编译程序存取的一个或多个库中，以准许编译程序产生用于要存取的特定存储器芯片的指令。At step 3120, the one or more processors may identify a plurality of memory sectors distributed over a plurality of memory banks associated with the memory chip to be accessed by the higher level computer code. For example, one or more processors may access data structures that define a plurality of memory banks and a corresponding structure of a memory chip. One or more processors can access data structures from memory (eg, non-volatile memory such as a hard drive or the like, volatile memory such as DRAM, or the like), or via a network (eg, Internet or the like) to receive the data structure. In these embodiments, the data structures are included in one or more libraries accessible by the compiler to permit the compiler to generate instructions for the particular memory chip to be accessed.

在步骤3130处，一个或处理器可评估较高阶计算机代码以识别在多个存储器存取循环内出现的多个存储器读取命令。例如，一个或多个处理器可识别需要自存储器读取的一个或多个读取命令和/或写入至存储器的一个或多个写入命令的较高阶计算机代码内的每一操作。这些指令可包含变量初始化、变量重新指派、对变量进行逻辑运算、输入输出操作或其类似物。At step 3130, one or a processor may evaluate higher-level computer code to identify multiple memory read commands occurring within multiple memory access cycles. For example, one or more processors may identify each operation within higher level computer code that requires one or more read commands to read from memory and/or one or more write commands to write to memory. These instructions may include variable initialization, variable reassignment, logical operations on variables, input and output operations, or the like.

在步骤3140处，一个或多个处理器可致使与多个存储器存取命令相关联的数据跨越多个存储器区段中的每个的分布，使得在多个存储器存取循环中的每个期间存取多个存储器区段中的每个。例如，一个或多个处理器可自定义存储器芯片的结构的数据结构识别存储器区段，且接着将来自较高阶代码的变量指派给存储器区段中的各者，使得在每一刷新循环(其可包含特定数量的时钟循环)期间存取(例如，经由写入或读取)每个存储器区段至少一次。在此实施例中，一个或多个处理器可存取指示较高阶代码的每一线需要多少个时钟循环的信息，以便指派来自较高阶代码的线的变量，使得在特定数量的时钟循环期间存取(例如，经由写入或读取)每个存储器区段至少一次。At step 3140, the one or more processors may cause the distribution of data associated with the plurality of memory access commands across each of the plurality of memory sectors such that during each of the plurality of memory access cycles Each of the plurality of memory banks is accessed. For example, one or more processors may customize the data structure of the memory chip's structure to identify memory segments, and then assign variables from higher-level code to each of the memory segments such that at each refresh cycle ( It may include accessing (eg, via writing or reading) each memory segment at least once during a specified number of clock cycles). In this embodiment, one or more processors may access information indicating how many clock cycles each line of higher-order code requires in order to assign variables from lines of higher-order code such that at a particular number of clock cycles Each memory segment is accessed (eg, via writing or reading) at least once during the period.

在另一实施例中，一个或多个处理器可首先自较高阶代码产生机器码或其他较低阶代码。一个或多个处理器可接着将来自较低阶代码的变量指派给存储器区段中的各者，使得在每一刷新循环(其可包含特定数量的时钟循环)期间存取(例如，经由写入或读取)每个存储器区段至少一次。在此实施例中，较低阶代码的每一线可能需要单一时钟循环。In another embodiment, one or more processors may first generate machine code or other lower-order code from higher-order code. The one or more processors may then assign variables from the lower order code to each of the memory segments such that during each refresh cycle (which may include a certain number of clock cycles) accesses (eg, via writes) input or read) at least once per memory segment. In this embodiment, each line of lower order code may require a single clock cycle.

在上文所给出的实施例中的任一者中，一个或多个处理器可将使用临时输出的逻辑运算或其他命令进一步指派给存储器区段中的各者。这些临时输出也可产生读取和/或写入命令，使得即使尚未将指明的变量指派给经指派的存储器区段，在此刷新循环期间仍存取该存储器区段。In any of the embodiments presented above, one or more processors may further assign logical operations or other commands using temporary outputs to each of the memory segments. These temporary outputs can also generate read and/or write commands so that the assigned memory segment is accessed during this refresh cycle even if the specified variable has not been assigned to the memory segment.

方法3100还可以包括额外步骤。例如，在变量在编译之前经指派的实施例中，一个或多个处理器可从较高阶代码产生机器码或其他较低阶代码。此外，一个或多个处理器可传输经编译代码以供存储器芯片及对应的逻辑电路执行。该逻辑电路可包含诸如GPU或CPU的常规电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。因此，如上文所描述，该基板可包括存储器阵列，该存储器阵列包括多个组，诸如图28中所展示的组2801a及其他组。此外，该基板可包括处理阵列，该处理阵列可包括多个处理器子单元(诸如，图7A中所展示的子单元730a、730b、730c、730d、730e、730f、730g及730h)。Method 3100 may also include additional steps. For example, in embodiments where variables are assigned prior to compilation, one or more processors may generate machine code or other lower-order code from higher-order code. In addition, one or more processors may transmit compiled code for execution by memory chips and corresponding logic circuits. The logic circuit may include conventional circuitry such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip, eg, as depicted in Figure 7A. Thus, as described above, the substrate may include a memory array that includes a plurality of groups, such as group 2801a shown in FIG. 28 and others. Additionally, the substrate can include a processing array that can include a plurality of processor subunits (such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in Figure 7A).

图32为用于判定存储器芯片(例如，图28的存储器芯片2800)的刷新的处理程序3200的另一实施例流程图。处理程序3200可实施于符合本公开的编译程序内。处理程序3200可由执行包含编译程序的软件指令的一个或多个处理器执行。处理程序3200可与图31的处理程序3100分开地或组合地实施。FIG. 32 is a flowchart of another embodiment of a processing routine 3200 for determining refresh of a memory chip (eg, memory chip 2800 of FIG. 28 ). The handler 3200 may be implemented within a compiler consistent with the present disclosure. Process 3200 may be executed by one or more processors executing software instructions including a compiler. The processing procedure 3200 may be implemented separately or in combination with the processing procedure 3100 of FIG. 31 .

在步骤3210处，类似于步骤3110，一个或多个处理器可接收较高阶计算机代码。在步骤3220处，类似于步骤3210，一个或多个处理器可识别要由较高阶计算机代码存取的在与存储器芯片相关联的多个存储器组上分布的多个存储器区段。At step 3210, similar to step 3110, one or more processors may receive higher order computer code. At step 3220, similar to step 3210, one or more processors may identify a plurality of memory sectors distributed over a plurality of memory banks associated with the memory chip to be accessed by the higher level computer code.

在步骤3230处，一个或多个处理器可评估较高阶计算机代码以识别各涉及多个存储器区段中的一个或多个的多个存储器读取命令。例如，一个或多个处理器可识别需要自存储器读取的一个或多个读取命令和/或写入至存储器的一个或多个写入命令的较高阶计算机代码内的每一操作。这些指令可包含变量初始化、变量重新指派、对变量进行逻辑运算、输入输出操作或其类似物。At step 3230, the one or more processors may evaluate higher-level computer code to identify a plurality of memory read commands each involving one or more of the plurality of memory sectors. For example, one or more processors may identify each operation within higher level computer code that requires one or more read commands to read from memory and/or one or more write commands to write to memory. These instructions may include variable initialization, variable reassignment, logical operations on variables, input and output operations, or the like.

在一些实施例中，一个或多个处理器可使用逻辑电路及多个存储器区段模拟较高阶代码的执行。例如，该模拟可包含较高阶代码的逐线逐步通过，其类似于除错器或其他指令集仿真器(ISS)的情况。该模拟可进一步维持表示多个存储器区段的地址的内部变量，其类似于除错器可如何维持表示处理器的寄存器的内部变量。In some embodiments, one or more processors may simulate the execution of higher order code using logic circuits and multiple memory segments. For example, the simulation may include a line-by-line step-through of higher order code, similar to the case of a debugger or other instruction set simulator (ISS). The emulation may further maintain internal variables representing addresses of multiple memory segments, similar to how a debugger may maintain internal variables representing registers of a processor.

在步骤3240处，一个或多个处理器可基于对存储器存取命令的分析且针对多个存储器区段当中的每个存储器区段而追踪自对存储器区段的最后一次存取起所累积的时间量。例如，使用上文所描述的模拟，一个或多个处理器可判定对多个存储器区段中的每个内的一个或多个地址的每一存取(例如，读取或写入)之间的时间长度。可按绝对时间、时钟循环或刷新循环(例如，由存储器芯片的已知刷新速率判定)来量测时间长度。At step 3240, the one or more processors may track, based on the analysis of the memory access commands, and for each memory segment of the plurality of memory segments, the accumulated amount since the last access to the memory segment amount of time. For example, using the simulations described above, one or more processors may determine whether each access (eg, read or write) to one or more addresses within each of the plurality of memory segments length of time between. The length of time can be measured in absolute time, clock cycles, or refresh cycles (eg, as determined by the known refresh rate of the memory chip).

在步骤3250处，响应于自任何特定存储器区段的最后一次存取起经过的时间量将超过预定阈值的判定，一个或多个处理器可将被配置为致使对特定存储器区段的存取的存储器刷新命令或存储器存取命令中的至少一个引入至较高阶计算机代码中。例如，一个或多个处理器可包括供刷新控制器(例如，图29A的刷新控制器2900或图29B的刷新控制器2900')执行的刷新命令。在逻辑电路不嵌入与存储器芯片相同的基板上的实施例中，一个或多个处理器可产生与用于发送至逻辑电路的较低阶代码分开的用于发送至存储器芯片的刷新命令。At step 3250, in response to a determination that the amount of time elapsed since the last access of any particular memory segment will exceed a predetermined threshold, the one or more processors may be configured to cause access to the particular memory segment At least one of a memory refresh command or a memory access command of the is introduced into the higher-level computer code. For example, one or more processors may include refresh commands for execution by a refresh controller (eg, refresh controller 2900 of Figure 29A or refresh controller 2900' of Figure 29B). In embodiments where the logic circuits are not embedded on the same substrate as the memory chips, one or more processors may generate refresh commands for sending to the memory chips separate from lower-level code for sending to the logic circuits.

另外或替代地，一个或多个处理器可包括供存储器控制器(其可与刷新控制器分开或并入至刷新控制器中)执行的存取命令。该存取命令可包含虚设命令，该虚设命令被配置为触发对存储器区段的读取操作，但不使逻辑电路对来自存储器区段的经读取或写入变量执行任何其他操作。Additionally or alternatively, the one or more processors may include access commands for execution by the memory controller (which may be separate from or incorporated into the refresh controller). The access command may include a dummy command configured to trigger a read operation to the memory segment, but not cause the logic circuit to perform any other operations on the read or written variables from the memory segment.

在一些实施例中，编译程序可包括来自处理程序3100的步骤及来自处理程序3200的步骤的组合。例如，编译程序可根据步骤3140指派变量且接着根据步骤3250运行上文所描述的模拟以加入任何额外的存储器刷新命令或存储器存取命令中。此组合可允许编译程序跨越尽可能多的存储器区段来分布变量，且为无法在预定阈值时间量存取的任何存储器区段产生刷新或存取命令。在另一组合实施例中，编译程序可根据步骤3230模拟代码，且基于该模拟指示在预定阈值时间量内将不存取的任何存储器区段而根据步骤3140指派变量。在一些实施例中，此组合还可以包括步骤3250以允许编译程序为在预定阈值时间量内无法存取的任何存储器区段产生刷新或存取命令，即使在根据步骤3140的指派完成之后亦如此。In some embodiments, the compiler may include a combination of steps from handler 3100 and steps from handler 3200 . For example, the compiler may assign variables according to step 3140 and then run the simulation described above according to step 3250 to add any additional memory refresh commands or memory access commands. This combination may allow the compiler to distribute variables across as many memory segments as possible, and generate flush or access commands for any memory segment that cannot be accessed within a predetermined threshold amount of time. In another combined embodiment, the compiler may simulate the code according to step 3230 and assign variables according to step 3140 based on the simulation indicating any memory segments that will not be accessed for a predetermined threshold amount of time. In some embodiments, this combination may also include step 3250 to allow the compiler to generate refresh or access commands for any memory segments that cannot be accessed within a predetermined threshold amount of time, even after assignments according to step 3140 are complete .

本公开的刷新控制器可允许由逻辑电路(无论系诸如CPU及GPU的常规逻辑电路抑或与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘)执行的软件停用由刷新控制器执行的自动刷新，且替代地经由所执行软件控制刷新。因此，本公开的一些实施例可将具有已知存取图案的软件提供至存储器芯片(例如，若编译程序能够存取定义存储器芯片的多个存储器组及一对应结构的数据结构)。在这些实施例中，编译后优化器可停用自动刷新，且仅针对存储器芯片的在阈值时间量内未被存取的区段手动地设定刷新控制。因此，类似于上文所描述的步骤3250但在编译之后，编译后优化器可产生刷新命令以确保使用预定阈值时间量存取或刷新每个存储器区段。The refresh controller of the present disclosure may allow software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as memory chips, eg, as depicted in FIG. 7A ) to disable Automatic refresh performed by the refresh controller, and alternatively controlled refresh via executed software. Accordingly, some embodiments of the present disclosure may provide software with known access patterns to a memory chip (eg, if a compiler can access data structures that define multiple memory banks and a corresponding structure of the memory chip). In these embodiments, the post-compile optimizer may disable automatic flushing and manually set flush control only for sections of the memory chip that have not been accessed for a threshold amount of time. Thus, similar to step 3250 described above but after compilation, the post-compile optimizer may generate flush commands to ensure that each memory segment is accessed or flushed using a predetermined threshold amount of time.

缩减刷新循环的另一实施例可包括使用对存储器芯片的存取的预定义图案。例如，若由逻辑电路执行的软件可控制其用于存储器芯片的存取图案，则一些实施例可产生用于超出常规线性线刷新的刷新的存取图案。例如，若控制器判定由逻辑电路执行的软件规则地每隔一行存储器进行存取，则本公开的刷新控制器可使用并非每隔一线进行刷新的存取图案以便加速存储器芯片且缩减功率使用量。Another embodiment of reducing refresh cycles may include using a predefined pattern of accesses to memory chips. For example, some embodiments may generate access patterns for refresh beyond conventional linear line refresh if software executed by logic circuitry can control its access patterns for memory chips. For example, if the controller determines that the software executed by the logic circuit regularly accesses every other line of memory, the refresh controller of the present disclosure can use an access pattern that is not refreshed every other line in order to speed up the memory chip and reduce power usage .

此刷新控制器的实施例展示于图33中。图33描绘符合本公开的通过所储存图案配置的实施例刷新控制器3300。刷新控制器3300可并入本公开的存储器芯片中，该存储器芯片例如具有多个存储器组及包括于多个存储器组中的每个中的多个存储器区段，诸如图28的存储器芯片2800。An embodiment of such a refresh controller is shown in FIG. 33 . 33 depicts an embodiment refresh controller 3300 configured with stored patterns consistent with the present disclosure. The refresh controller 3300 may be incorporated into a memory chip of the present disclosure, such as the memory chip 2800 of FIG. 28 , for example, having a plurality of memory banks and a plurality of memory sectors included in each of the plurality of memory banks.

刷新控制器3300包括定时器3301(类似于图29A及图29B的定时器2901)、行计数器3303(类似于图29A及图29B的行计数器2903)及加法器3305(类似于图29A及图29B的加法器2907)。此外，刷新控制器3300包括数据储存器3307。不同于图29B的数据储存区2909，数据储存器3307可储存至少一个存储器刷新图案，该至少一个存储器刷新图案要被实施以刷新包括于多个存储器组中的每个中的多个存储器区段。例如，如图33中所描绘，数据储存器3307可包括成行和/或列定义存储器组中的区段的Li(例如，在图33的实施例中，L1、L2、L3及L4)及Hi(例如，在图33的实施例中，H1、H2、H3及H4)。此外，每个区段可与Inci变量(例如，在图33的实施例中，Inc1、Inc2、Inc3及Inc4)相关联，该变量定义与区段相关联的行如何递增(例如，是否存取或刷新每一行，是否每隔其他行进行存取或刷新，或其类似物)。因此，如图33中所展示，刷新图案可包含一表，该表包括由软件指派的多个存储器区段识别符，所述多个存储器区段识别符用以识别特定存储器组中的在刷新循环期间需刷新的多个存储器区段的范围，及该特定存储器组中的在该刷新循环期间不需刷新的多个存储器区段的范围。The refresh controller 3300 includes a timer 3301 (similar to the timer 2901 of FIGS. 29A and 29B ), a row counter 3303 (similar to the row counter 2903 of FIGS. 29A and 29B ), and an adder 3305 (similar to the row counter 2903 of FIGS. 29A and 29B ) adder 2907). Additionally, refresh controller 3300 includes data storage 3307 . Unlike data storage area 2909 of FIG. 29B, data storage 3307 may store at least one memory refresh pattern to be implemented to refresh a plurality of memory sectors included in each of a plurality of memory banks . For example, as depicted in FIG. 33, data store 3307 may include Li (eg, in the embodiment of FIG. 33, L1, L2, L3, and L4) and Hi of rows and/or columns that define segments in a memory bank (eg, in the embodiment of Figure 33, H1, H2, H3, and H4). Additionally, each section may be associated with an Inci variable (eg, in the embodiment of Figure 33, Inc1, Inc2, Inc3, and Inc4) that defines how the row associated with the section is incremented (eg, whether to access or refresh every row, whether to access or refresh every other row, or the like). Thus, as shown in FIG. 33, the refresh pattern may include a table that includes a plurality of memory sector identifiers assigned by software to identify those in a particular memory bank that are being refreshed The range of memory sectors that need to be refreshed during the cycle, and the range of memory sectors in that particular memory bank that do not need to be refreshed during the refresh cycle.

因此，数据储存器3308可定义由逻辑电路(无论系诸如CPU及GPU的常规逻辑电路抑或与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘)执行的软件可选择以供使用的刷新图案。存储器刷新图案可使用软件可配置，以识别在刷新循环期间，特定存储器组中的多个存储器区段中的哪些需刷新，而特定存储器组中的多个存储器区段中的哪些在该刷新循环期间不需刷新。因此，刷新控制器3300可根据Inci刷新在当前循环期间未被存取的所定义区段内的一些或所有行。刷新控制器3300可跳过经设定为在当前循环期间被存取的所定义区段的其他行。Thus, data store 3308 may define software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as memory chips, eg, as depicted in FIG. 7A ) that can be selected to Refresh pattern for use. The memory refresh pattern is software configurable to identify which of the plurality of memory banks in a particular memory bank need to be refreshed during a refresh cycle and which of the plurality of memory banks in a particular memory bank are in that refresh cycle No need to refresh during this period. Accordingly, the refresh controller 3300 may refresh some or all of the rows within the defined segment that have not been accessed during the current cycle according to Inci. Refresh controller 3300 may skip other rows of the defined section that are set to be accessed during the current cycle.

在刷新控制器3300的数据储存器3308包括多个存储器刷新图案的实施例中，每个存储器刷新图案可表示不同的刷新图案，用于刷新包括于多个存储器组中的每个中的多个存储器区段。存储器刷新图案可为可选择的以用于多个存储器区段上。因此，刷新控制器3300可被配置为允许选择在特定刷新循环期间实施多个存储器刷新图案中的哪一个。例如，由逻辑电路(无论系诸如CPU及GPU的常规逻辑电路抑或与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘)执行的软件可选择不同存储器刷新图案以供在一个或多个不同刷新循环期间使用。替代地，由逻辑电路执行的软件可选择一个存储器刷新图案以供贯穿不同刷新循环中的一些或全部而使用。In embodiments where the data store 3308 of the refresh controller 3300 includes multiple memory refresh patterns, each memory refresh pattern may represent a different refresh pattern for refreshing multiple ones included in each of the multiple memory banks memory segment. The memory refresh pattern may be selectable for use on multiple memory sectors. Accordingly, refresh controller 3300 may be configured to allow selection of which of a plurality of memory refresh patterns to implement during a particular refresh cycle. For example, software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as memory chips, eg, as depicted in FIG. 7A ) may select different memory refresh patterns for use in Used during one or more different refresh cycles. Alternatively, software executed by logic circuitry may select a memory refresh pattern for use throughout some or all of the different refresh cycles.

可使用储存于数据储存器3308中的一个或多个变量来编码存储器刷新图案。例如，在多个存储器区段布置成行的实施例中，每个存储器区段识别符可被配置为识别在存储器的行内，存储器刷新应开始或结束的特定位置。例如，除Li及Hi以外，一个或多个额外变量也可定义由Li及Hi定义哪些行的哪些部分在区段内。The memory refresh pattern may be encoded using one or more variables stored in data store 3308. For example, in embodiments where multiple memory banks are arranged in rows, each memory bank identifier may be configured to identify a particular location within a row of memory at which a memory refresh should begin or end. For example, in addition to Li and Hi, one or more additional variables may also define which parts of which rows are within a section as defined by Li and Hi.

图34为用于判定存储器芯片(例如，图28的存储器芯片2800)的刷新的处理程序3400的实施例流程图。处理程序3100可由符合本公开的刷新控制器(例如，图33的刷新控制器3300)内的软件实施。34 is a flowchart of an embodiment of a processing routine 3400 for determining refresh of a memory chip (eg, memory chip 2800 of FIG. 28). Process 3100 may be implemented by software within a refresh controller consistent with the present disclosure (eg, refresh controller 3300 of Figure 33).

在步骤3410处，刷新控制器可储存至少一个存储器刷新图案，该至少一个存储器刷新图案要被实施以刷新包括于多个存储器组中的每个中的多个存储器区段。例如，如上文关于图33所解释，刷新图案可包含一表，该表包括由软件指派的多个存储器区段识别符，所述多个存储器区段识别符用以识别特定存储器组中的在刷新循环期间需刷新的多个存储器区段的范围，及该特定存储器组中的在刷新循环期间不需刷新的多个存储器区段的范围。At step 3410, the refresh controller may store at least one memory refresh pattern to be implemented to refresh the plurality of memory banks included in each of the plurality of memory banks. For example, as explained above with respect to FIG. 33, the refresh pattern may include a table including a plurality of memory sector identifiers assigned by software to identify the The range of memory banks that need to be refreshed during a refresh cycle, and the range of memory banks that do not need to be refreshed during the refresh cycle in that particular memory bank.

在一些实施例中，至少一个刷新图案可在制造期间编码至刷新控制器上(例如，编码至与刷新控制器相关联或至少可由刷新控制器存取的只读存储器上)。因此，刷新控制器可存取至少一个存储器刷新图案，但不储存该至少一个存储器刷新图案。In some embodiments, the at least one refresh pattern may be encoded onto the refresh controller during manufacture (eg, onto a read-only memory associated with or at least accessible by the refresh controller). Accordingly, the refresh controller may access the at least one memory refresh pattern, but not store the at least one memory refresh pattern.

在步骤3420及3430处，刷新控制器可使用软件以识别特定存储器组中的多个存储器区段中的哪些在刷新循环期间需刷新，而特定存储器组中的多个存储器区段中的哪些在该刷新循环期间不需刷新。例如，如上文关于图33所解释，由逻辑电路(无论系诸如CPU及GPU的常规逻辑电路抑或与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘)执行的软件可选择至少一个存储器刷新图案。此外，刷新控制器可存取选定的至少一个存储器刷新图案以在每一刷新循环期间产生对应刷新信号。刷新控制器可根据该至少一个存储器刷新图案刷新在当前循环期间未被存取的所定义区段内的一些或所有部分，且可跳过经设定为在当前循环期间被存取的所定义区段的其他部分。At steps 3420 and 3430, the refresh controller may use software to identify which of the plurality of memory banks in a particular memory bank need to be refreshed during a refresh cycle and which of the plurality of memory banks in the particular memory bank are No refresh is required during this refresh cycle. For example, as explained above with respect to FIG. 33, software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as memory chips, eg, as depicted in FIG. 7A), may Select at least one memory refresh pattern. Additionally, the refresh controller may access the selected at least one memory refresh pattern to generate corresponding refresh signals during each refresh cycle. The refresh controller may refresh some or all portions of the defined segments that are not accessed during the current cycle according to the at least one memory refresh pattern, and may skip defined defined segments that are set to be accessed during the current cycle other parts of the section.

在步骤3440处，刷新控制器可产生对应刷新命令。例如，如图33中所描绘，加法器3305可包含逻辑电路，该逻辑电路被配置为根据数据储存器3307中的至少一个存储器刷新图案来使用于未被刷新的特定区段的刷新信号无效。另外或替代地，微处理器(图33中未展示)可基于根据数据储存器3307中的至少一个存储器刷新图案将刷新哪些区段而产生特定刷新信号。At step 3440, the refresh controller may generate a corresponding refresh command. For example, as depicted in FIG. 33 , adder 3305 may include logic configured to de-assert the refresh signal for a particular sector that is not refreshed according to at least one memory refresh pattern in data store 3307 . Additionally or alternatively, a microprocessor (not shown in FIG. 33 ) may generate specific refresh signals based on which sectors are to be refreshed according to at least one memory refresh pattern in data store 3307 .

方法3400还可以包括额外步骤。例如，在至少一个存储器刷新图案被配置为每一个、两个或其他数量的刷新循环而改变(例如，自L1、H1及Inc1移动至L2、H2及Inc2，如图33中所展示)的实施例中，刷新控制器可根据步骤3430及3440存取数据储存器的不同部分以用于刷新信号的下一判定。类似地，若由逻辑电路(无论系诸如CPU及GPU的常规逻辑电路抑或与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘)执行的软件自数据储存器选择新的存储器刷新图案以用于一个或多个未来刷新循环中，则刷新控制器可根据步骤3430及3440存取数据储存器的不同部分以用于刷新信号的下一判定。Method 3400 may also include additional steps. For example, in an implementation where at least one memory refresh pattern is configured to change for every, two, or other number of refresh cycles (eg, moving from L1, H1, and Inc1 to L2, H2, and Inc2, as shown in FIG. 33) For example, the refresh controller may access different portions of the data store according to steps 3430 and 3440 for the next determination of the refresh signal. Similarly, if software executed by logic circuits (whether conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as memory chips, eg, as depicted in FIG. 7A ) selects new data from the data store The memory refresh pattern is used in one or more future refresh cycles, then the refresh controller may access different portions of the data store according to steps 3430 and 3440 for the next determination of the refresh signal.

当设计存储器芯片且目标为存储器的某一容量时，存储器容量改变至较大大小或较小大小可能需要重新设计产品及重新设计整个光罩集。通常，产品设计与市场研究平行地进行，且在一些状况下，产品设计在市场研究可用之前完成。因此，产品设计与市场的实际需求之间可能存在脱节。本公开提出灵活地提供具有满足市场需求的存储器容量的存储器芯片的方式。设计方法可包括在晶圆上设计晶粒连同适当的互连电路系统，使得可从晶圆选择性地切割可含有一个或多个晶粒的存储器芯片，以便提供自单一晶圆生产具有大小可变的存储器容量的存储器芯片的机会。When designing a memory chip and targeting a certain capacity of memory, changing the memory capacity to a larger or smaller size may require redesign of the product and redesign of the entire reticle set. Often, product design is performed in parallel with market research, and in some cases, product design is completed before market research is available. Therefore, there may be a disconnect between product design and the actual needs of the market. The present disclosure proposes a way to flexibly provide memory chips with memory capacities that meet market demands. The design method may include designing the die together with appropriate interconnecting circuitry on the wafer such that memory chips, which may contain one or more dies, may be selectively diced from the wafer in order to provide production from a single wafer with sizes of variable size. Opportunities for memory chips of variable memory capacity.

本公开系关于用于通过从晶圆切割存储器芯片来制造存储器芯片的系统及方法。该方法可用于从晶圆生产大小可选择的存储器芯片。含有晶粒3503的晶圆3501的实施例实施例展示于图35A中。晶圆3501可由半导体材料(例如，硅(Si)、硅锗(SiGe)、绝缘体上硅(SOI)、氮化镓(GaN)、氮化铝(AlN)、氮化铝镓(AlGaN)、氮化硼(BN)、砷化镓(GaAs)、砷化镓铝(AlGaAs)、氮化铟(InN)、以上各者的组合及其类似物)形成。晶粒3503可包括任何合适的电路元件(例如，晶体管、电容器、电阻器和/或其类似物)，该电路元件可包括任何合适的半导体、介电或金属组件。晶粒3503可由可能与晶圆3501的材料相同或不同的半导体材料形成。除晶粒3503以外，晶圆3501也可包括其他结构和/或电路系统。在一些实施例中，可提供一个或多个耦接电路且该一个或多个耦接电路将晶粒中的一个或多个耦接在一起。在一实施例实施例中，此耦接电路可包括由两个或多于两个晶粒3503共享的总线。另外，该耦接电路可包括经设计以控制与晶粒3503相关联的电路系统和/或将信息导引至晶粒3503/导引来自晶粒3503的信息的一个或多个逻辑电路。在一些状况下，该耦接电路可包括存储器存取管理逻辑。此逻辑可将逻辑存储器地址转译成与晶粒3503相关联的实体地址。应注意，如本文中所使用，术语制造可共同地指用于建置所公开晶圆、晶粒和/或芯片的步骤中的任一者。例如，制造可指包括于晶圆上的各种晶粒(及任何其他电路系统)的同时布置及形成。制造也可指从晶圆切割大小可选择的存储器芯片以在一些状况下包括一个晶粒，或在其他状况下包括多个晶粒。当然，术语制造并不欲限于这些实施例，而是可包括与所公开存储器芯片及中间结构中的任一者或全部的产生相关联的其他方面。The present disclosure relates to systems and methods for fabricating memory chips by dicing memory chips from a wafer. The method can be used to produce memory chips of selectable size from wafers. An embodiment embodiment of wafer 3501 containing die 3503 is shown in FIG. 35A. Wafer 3501 may be made of semiconductor material (eg, silicon (Si), silicon germanium (SiGe), silicon on insulator (SOI), gallium nitride (GaN), aluminum nitride (AlN), aluminum gallium nitride (AlGaN), nitrogen Boronide (BN), Gallium Arsenide (GaAs), Aluminum Gallium Arsenide (AlGaAs), Indium Nitride (InN), combinations of the above, and the like). Die 3503 may include any suitable circuit elements (eg, transistors, capacitors, resistors, and/or the like), which may include any suitable semiconductor, dielectric, or metallic components. Die 3503 may be formed of a semiconductor material that may or may not be the same material as wafer 3501 . In addition to die 3503, wafer 3501 may also include other structures and/or circuitry. In some embodiments, one or more coupling circuits may be provided and couple one or more of the dies together. In one example embodiment, this coupling circuit may include a bus shared by two or more dies 3503 . Additionally, the coupling circuit may include one or more logic circuits designed to control circuitry associated with die 3503 and/or to direct information to/from die 3503 . In some cases, the coupling circuit may include memory access management logic. This logic may translate logical memory addresses into physical addresses associated with die 3503. It should be noted that, as used herein, the term fabrication may collectively refer to any of the steps used to build the disclosed wafers, dies, and/or chips. For example, fabrication may refer to the simultaneous placement and formation of various dies (and any other circuitry) included on a wafer. Fabrication may also refer to dicing memory chips of selectable sizes from a wafer to include one die in some cases, or multiple dies in other cases. Of course, the term fabrication is not intended to be limited to these embodiments, but may include other aspects associated with the production of any or all of the disclosed memory chips and intermediate structures.

晶粒3503或一群组的晶粒可用于制造存储器芯片。存储器芯片可包括分布式处理器，如本公开的其他章节中所描述。如图35B中所展示，晶粒3503可包括基板3507及安置于该基板上的存储器阵列。该存储器阵列可包括一个或多个存储器单元，诸如经设计以储存数据的存储器组3511A至3511D。在各种实施例中，存储器组可包括基于半导体的电路元件，诸如晶体管、电容器及其类似物。在一实施例实施例中，一存储器组可包括多行及多列储存单元。在一些状况下，此存储器组可具有大于一兆字节的容量。该存储器组可包括动态或静态存取存储器。Die 3503 or a group of dies can be used to fabricate memory chips. The memory chips may include distributed processors, as described in other sections of this disclosure. As shown in Figure 35B, die 3503 may include a substrate 3507 and a memory array disposed on the substrate. The memory array may include one or more memory cells, such as memory banks 3511A-3511D designed to store data. In various embodiments, the memory banks may include semiconductor-based circuit elements such as transistors, capacitors, and the like. In one embodiment, a memory bank may include multiple rows and columns of storage cells. In some cases, this memory bank may have a capacity greater than one megabyte. The memory bank may include dynamic or static access memory.

晶粒3503还可以包括安置于基板上的处理阵列，该处理阵列包括多个处理器子单元3515A至3515D，如图35B中所展示。如上文所描述，每个存储器组可包括由一专用总线连接的专用处理器子单元。例如，处理器子单元3515A经由总线或连接件3512与存储器组3511A相关联。应理解，存储器组3511A至3511D与处理器子单元3515A至3515D之间的各种连接系可能的，且仅一些说明性连接展示于图35B中。在一实施例实施例中，处理器子单元可对相关联的存储器组执行读取/写入操作，且可进一步相对于储存于各种存储器组中的存储器执行刷新操作或任何其他合适的操作。Die 3503 may also include a processing array disposed on the substrate, the processing array including a plurality of processor subunits 3515A-3515D, as shown in Figure 35B. As described above, each memory bank may include dedicated processor subunits connected by a dedicated bus. For example, processor subunit 3515A is associated with memory bank 3511A via bus or connection 3512. It should be understood that various connections between memory banks 3511A-3511D and processor subunits 3515A-3515D are possible, and that only some illustrative connections are shown in Figure 35B. In an embodiment embodiment, the processor sub-units may perform read/write operations to the associated memory banks, and may further perform refresh operations or any other suitable operations with respect to the memory stored in the various memory banks .

如所提到，晶粒3503可包括被配置为将处理器子单元与其对应存储器组连接的总线的第一群组。实施例总线可包括连接电组件的一组导线或导体，且允许将数据及地址传送至每个存储器组及其相关联的处理器子单元以及从每个存储器组及其相关联的处理器子单元传送数据。在一实施例中，连接件3512可用作用于将处理器子单元3515A连接至存储器组3511A的专用总线。晶粒3503可包括此类总线的群组，每个总线将一处理器子单元连接至一对应的专用存储器组。另外，晶粒3503可包括总线的另一群组，每个总线将处理器子单元(例如，子单元3515A至3515D)彼此连接。例如，此类总线可包括连接件3516A至3516D。在各种实施例中，用于存储器组3511A至3511D的数据可经由输入输出总线3530递送。在一实施例中，输入输出总线3530可携载数据相关信息，及用于控制晶粒3503的存储器单元的操作的命令相关信息。数据信息可包括用于储存于存储器组中的数据、自存储器组读取的数据、基于相对于储存于对应存储器组中的数据执行的操作的来自处理器子单元中的一个或多个的处理结果、命令相关信息、各种代码等。As mentioned, die 3503 may include a first group of buses configured to connect processor subunits with their corresponding memory banks. An embodiment bus may include a set of wires or conductors that connect electrical components and allow data and addresses to be transferred to and from each memory bank and its associated processor subunits. unit transmits data. In one embodiment, connector 3512 may serve as a dedicated bus for connecting processor subunit 3515A to memory bank 3511A. Die 3503 may include groups of such buses, each bus connecting a processor subunit to a corresponding dedicated memory bank. Additionally, die 3503 may include another group of buses, each bus connecting processor subunits (eg, subunits 3515A-3515D) to each other. For example, such a bus may include connections 3516A-3516D. In various embodiments, data for memory banks 3511A-3511D may be delivered via input-output bus 3530. In one embodiment, the I/O bus 3530 may carry data-related information, as well as command-related information for controlling the operation of the memory cells of the die 3503 . The data information may include data for data stored in the memory banks, data read from the memory banks, processing from one or more of the processor sub-units based on operations performed relative to the data stored in the corresponding memory banks Results, command-related information, various codes, etc.

在各种状况下，由输入输出总线3530传输的数据及命令可由输入输出(IO)控制器3521控制。在一实施例中，IO控制器3521可控制自总线3530至处理器子单元3515A至3515D及来自处理器子单元3515A至3515D的数据流。IO控制器3521可判定自处理器子单元3515A至3515D中的哪一个取回信息。在各种实施例中，IO控制器3521可包括被配置为撤销启动IO控制器3521的熔断器3554。若多个晶粒组合在一起以形成较大存储器芯片(也被称作多晶粒存储器芯片，作为仅含有一个晶粒的单晶粒存储器芯片的替代)，则可使用熔断器3554。该多晶粒存储器芯片可接着使用形成该多晶粒存储器芯片的晶粒单元中的一个的IO控制器中的一个，同时通过使用对应于与其他晶粒单元相关的其他IO控制器的熔断器来停用其他IO控制器。Under various conditions, the data and commands transmitted by the input output bus 3530 may be controlled by the input output (IO) controller 3521 . In one embodiment, the IO controller 3521 may control the flow of data from the bus 3530 to and from the processor sub-units 3515A-3515D. The IO controller 3521 can determine from which of the processor sub-units 3515A-3515D the information is to be retrieved. In various embodiments, the IO controller 3521 may include a fuse 3554 configured to deactivate the IO controller 3521. Fuse 3554 may be used if multiple dies are combined together to form a larger memory chip (also known as a multi-die memory chip, as an alternative to a single-die memory chip containing only one die). The multi-die memory chip can then use one of the IO controllers that form one of the die cells of the multi-die memory chip, while using fuses corresponding to the other IO controllers associated with the other die cells to deactivate other IO controllers.

如所提到，每个存储器芯片或前置晶粒或一群组的晶粒可包括与对应存储器组相关联的分布式处理器。在一些实施例中，这些分布式处理器可布置在与多个存储器组安置在相同基板上的处理阵列中。另外，该处理阵列可包括各包括一地址生成器(也被称作地址生成器单元(AGU))的一个或多个逻辑部分。在一些状况下，该地址生成器可为至少一个处理器子单元的部分。该地址生成器可产生自与存储器芯片相关联的一个或多个存储器组取得数据所需的存储器地址。地址产生计算可涉及整数算术运算，诸如加法、减法、模数运算或比特移位。该地址生成器可被配置为一次对多个操作数进行运算。此外，多个地址生成器可同时执行多于一个地址计算运算。在各种实施例中，地址生成器可能与对应存储器组相关联。地址生成器可藉助于对应总线线与其对应存储器组连接。As mentioned, each memory chip or pre-die or group of dies may include a distributed processor associated with the corresponding memory bank. In some embodiments, these distributed processors may be arranged in a processing array disposed on the same substrate as multiple memory banks. Additionally, the processing array may include one or more logic sections each including an address generator (also referred to as an address generator unit (AGU)). In some cases, the address generator may be part of at least one processor subunit. The address generator may generate the memory addresses required to fetch data from one or more memory banks associated with the memory chip. Address generation calculations may involve integer arithmetic operations such as addition, subtraction, modulo operations, or bit shifting. The address generator can be configured to operate on multiple operands at once. Additionally, multiple address generators can perform more than one address calculation operation simultaneously. In various embodiments, address generators may be associated with corresponding memory banks. The address generators can be connected to their corresponding memory banks by means of corresponding bus lines.

在各种实施例中，大小可选择的存储器芯片可通过选择性地切割晶圆3501的不同区而由该晶圆形成。如所提到，该晶圆可包括晶粒3503的群组，该群组包括晶圆上所包括的两个或多于两个晶粒(例如，2个、3个、4个、5个、10个或多于10个晶粒)的任何群组。如将在下文进一步所论述，在一些状况下，单一存储器芯片可通过切割晶圆的仅包括一群组的晶粒中的一个晶粒的一部分来形成。在这些状况下，所得存储器芯片将包括与一个晶粒相关联的存储器单元。然而，在其他状况下，大小可选择的存储器芯片可形成为包括多于一个晶粒。这些存储器芯片可通过切割晶圆的包括晶圆上所包括的一群组的晶粒中的两个或多于两个晶粒的区来形成。在这些状况下，晶粒连同将晶粒耦接在一起的耦接电路提供多晶粒存储器芯片。一些额外电路元件也可板载地线连接于芯片之间，诸如时钟元件、数据总线或任何合适的逻辑电路。In various embodiments, memory chips of selectable size may be formed from wafer 3501 by selectively dicing different regions of the wafer. As mentioned, the wafer may include a group of dies 3503 that includes two or more dies (eg, 2, 3, 4, 5) included on the wafer , 10 or more dies) any group. As will be discussed further below, in some cases a single memory chip may be formed by dicing a portion of a wafer that includes only one of a group of dies. Under these conditions, the resulting memory chip will include memory cells associated with one die. However, in other cases, sized memory chips may be formed to include more than one die. These memory chips may be formed by dicing regions of the wafer that include two or more dies in a group of dies included on the wafer. Under these conditions, the dies together with the coupling circuits that couple the dies together provide a multi-die memory chip. Some additional circuit elements may also be grounded on-board between chips, such as clock elements, data buses, or any suitable logic circuit.

在一些状况下，与一群组的晶粒相关联的至少一个控制器可被配置为控制一群组的晶粒作为单一存储器芯片(例如，多存储器单元存储器芯片)进行操作。该控制器可包括管理进入存储器芯片及来自存储器芯片的数据流的一个或多个电路。存储器控制器可为存储器芯片的一部分，或其可为不与存储器芯片直接相关的分开芯片的一部分。在一实施例中，控制器可被配置为促进与存储器芯片的分布式处理器相关联的读取及写入请求或其他命令，且可被配置为控制存储器芯片的任何其他合适的方面(例如，刷新存储器芯片，与分布式处理器相互作用等)。在一些状况下，控制器可为晶粒3503的部分，且在其他状况下，控制器可邻近于晶粒3503布置。在各种实施例中，控制器也可包括存储器芯片上所包括的存储器单元中的至少一个的至少一个存储器控制器。在一些状况下，用于存取存储器芯片上的信息的协议可能与可存在于存储器芯片上的复制逻辑及存储器单元(例如，存储器组)无关。该协议可被配置为具有用于充分存取存储器芯片上的数据的不同ID或地址范围。具有此协议的芯片的实施例可包括具有联合电子设备工程委员会(JEDEC)双数据速率(DDR)控制器的芯片，其中不同存储器组可具有不同地址范围、串行周边接口(SPI)连接，其中不同存储器单元(例如，存储器组)具有不同标识(ID)，及其类似物。In some cases, at least one controller associated with a group of dies may be configured to control the group of dies to operate as a single memory chip (eg, a multi-memory cell memory chip). The controller may include one or more circuits that manage the flow of data to and from the memory chip. The memory controller may be part of the memory chip, or it may be part of a separate chip not directly related to the memory chip. In an embodiment, the controller may be configured to facilitate read and write requests or other commands associated with the distributed processors of the memory chip, and may be configured to control any other suitable aspect of the memory chip (eg, , refresh memory chips, interact with distributed processors, etc.). In some cases, the controller may be part of die 3503 , and in other cases, the controller may be disposed adjacent to die 3503 . In various embodiments, the controller may also include at least one memory controller for at least one of the memory cells included on the memory chip. In some cases, the protocol used to access information on a memory chip may be independent of replication logic and memory cells (eg, memory banks) that may exist on the memory chip. The protocol can be configured with different IDs or address ranges for adequate access to data on the memory chip. Examples of chips with this protocol may include chips with a Joint Electronic Device Engineering Council (JEDEC) Double Data Rate (DDR) controller, where different memory banks may have different address ranges, Serial Peripheral Interface (SPI) connections, where Different memory cells (eg, memory banks) have different identifications (IDs), and the like.

在各种实施例中，可从晶圆切割多个区，其中各个区包括一个或多个晶粒。在一些状况下，每一分开区可用以建置一多晶粒存储器芯片。在其他状况下，要从晶圆切割的每个区可包括单一晶粒以提供单晶粒存储器芯片。在一些状况下，该区中的两者或多于两者可具有相同形状且具有以相同方式耦接至耦接电路的相同数量的晶粒。替代地，在一些实施例中，区的第一群组可用以形成第一类型的存储器芯片，且区的第二群组可用以形成第二类型的存储器芯片。例如，如图35C中所展示，晶圆3501可包括区3505，该区可包括单一晶粒，且第二区3504可包括两个晶粒的群组。当从晶圆3501切割区3505时，将提供单晶粒存储器芯片。当从晶圆3501切割区3504时，将提供多晶粒存储器芯片。图35C中所展示的群组仅为说明性，且可从晶圆3501切下晶粒的各种其他区及群组。In various embodiments, multiple regions may be cut from the wafer, wherein each region includes one or more dies. In some cases, each partition can be used to build a multi-die memory chip. In other cases, each region to be cut from the wafer may include a single die to provide a single die memory chip. In some cases, two or more of the regions may have the same shape and have the same number of dies coupled to the coupling circuit in the same manner. Alternatively, in some embodiments, a first group of regions may be used to form a first type of memory chip, and a second group of regions may be used to form a second type of memory chip. For example, as shown in Figure 35C, wafer 3501 may include region 3505, which may include a single die, and second region 3504, which may include a group of two dies. When region 3505 is cut from wafer 3501, a single die memory chip will be provided. When region 3504 is diced from wafer 3501, a multi-die memory chip will be provided. The groups shown in FIG. 35C are illustrative only, and various other regions and groups of die may be cut from wafer 3501.

在各种实施例中，晶粒可形成于晶圆3501上，使得其沿着晶圆的一或多行布置，如展示于例如图35C中。晶粒可共享对应于一或多行的输入输出总线3530。在一实施例中，可使用各种切割形状从晶圆3501切下一群组的晶粒，其中当切下可用以形成存储器芯片的一群组的晶粒时，可能不包括共享的输入输出总线3530的至少一部分(例如，仅可包括输入输出总线3530的一部分作为形成为包括一群组的晶粒的存储器芯片的一部分)。In various embodiments, dies may be formed on wafer 3501 such that they are arranged along one or more rows of the wafer, as shown, for example, in Figure 35C. The dies may share an I/O bus 3530 corresponding to one or more rows. In one embodiment, various dicing shapes may be used to cut groups of dies from wafer 3501, where shared input and output may not be included when cutting a group of dies that may be used to form memory chips At least a portion of bus 3530 (eg, only a portion of input-output bus 3530 may be included as part of a memory chip formed to include a group of dies).

如先前所论述，当多个晶粒(例如，晶粒3506A及3506B，如图35C中所展示)用以形成存储器芯片3517时，对应于所述多个晶粒中的一个的一个IO控制器可经启用且被配置为控制至晶粒3506A及3506B的所有处理器子单元的数据流。例如，图35D展示经组合以形成存储器芯片3517的存储器晶粒3506A及3506B，该存储器芯片包括存储器组3511A至3511H、处理器子单元3515A至3515H、IO控制器3521A及3521B，以及熔断器3554A及3554B。应注意，在从晶圆移除存储器芯片3517之前，该存储器芯片对应于晶圆3501的区3517。换言之，如此处且在本公开中别处所使用，一旦从晶圆3501切割，晶圆3501的区3504、3505、3517等便将产生存储器芯片3504、3505、3517等。另外，本文中的熔断器也被称作停用元件。在一实施例中，熔断器3554B可用以撤销启动IO控制器3521B，且IO控制器3521A可用以通过将数据传达至处理器子单元3515A至3515H来控制至所有存储器组3511A至3511H的数据流。在一实施例中，IO控制器3521A可使用任何合适的连接来连接至各种处理器子单元。在一些实施例中，如下文进一步所描述，处理器子单元3515A至3515H可互连，且IO控制器3521A可被配置为控制至形成存储器芯片3517的处理逻辑的处理器子单元3515A至3515H的数据流。As previously discussed, when multiple dies (eg, dies 3506A and 3506B, as shown in FIG. 35C ) are used to form memory chip 3517, one IO controller corresponds to one of the multiple dies May be enabled and configured to control data flow to all processor subunits of dies 3506A and 3506B. For example, Figure 35D shows memory dies 3506A and 3506B combined to form memory chip 3517, which includes memory banks 3511A-3511H, processor subunits 3515A-3515H, IO controllers 3521A and 3521B, and fuses 3554A and 3554A. 3554B. It should be noted that the memory chip 3517 corresponds to the region 3517 of the wafer 3501 before the memory chip 3517 is removed from the wafer. In other words, as used here and elsewhere in this disclosure, once diced from wafer 3501, regions 3504, 3505, 3517, etc. of wafer 3501 will yield memory chips 3504, 3505, 3517, etc. Additionally, fuses are also referred to herein as deactivation elements. In one embodiment, fuse 3554B may be used to deactivate IO controller 3521B, and IO controller 3521A may be used to control data flow to all memory banks 3511A-3511H by communicating data to processor subunits 3515A-3515H. In one embodiment, the IO controller 3521A may connect to the various processor subunits using any suitable connections. In some embodiments, as described further below, the processor subunits 3515A-3515H may be interconnected, and the IO controller 3521A may be configured to control communication to the processor subunits 3515A-3515H that form the processing logic of the memory chip 3517. data flow.

在一实施例中，诸如控制器3521A及3521B的IO控制器以及对应熔断器3554A及3554B可连同形成存储器组3511A至3511H及处理器子单元3515A至3515H一起在晶圆3501上形成。在各种实施例中，当形成存储器芯片3517时，可启动熔断器中的一个(例如，熔断器3554B)使得晶粒3506A及3506B被配置为形成存储器芯片3517，该存储器芯片用作单一芯片且受单一输入输出控制器(例如，控制器3521A)控制。在一实施例中，启动熔断器可包括施加电流以触发熔断器。在各种实施例中，当多于一个晶粒用于形成存储器芯片时，可经由对应熔断器撤销启动除一个IO控制器之外的所有其他IO控制器。In one embodiment, IO controllers such as controllers 3521A and 3521B and corresponding fuses 3554A and 3554B may be formed on wafer 3501 along with forming memory banks 3511A-3511H and processor subunits 3515A-3515H. In various embodiments, when memory chip 3517 is formed, one of the fuses (eg, fuse 3554B) can be activated such that dies 3506A and 3506B are configured to form memory chip 3517 that functions as a single chip and Controlled by a single I/O controller (eg, controller 3521A). In one embodiment, activating the fuse may include applying a current to trigger the fuse. In various embodiments, when more than one die is used to form a memory chip, all but one IO controller may be deactivated via corresponding fuses.

在各种实施例中，如图35C中所展示，多个晶粒连相同组输入输出总线和/或控制总线一起形成于晶圆3501上。输入输出总线3530的实施例绘示于图35C中。在一实施例中，输入输出总线中的一个(例如，输入输出总线3530)可连接至多个晶粒。图35C展示接近晶粒3506A及3506B通过的输入输出总线3530的实施例。如图35C中所展示的晶粒3506A及3506B以及输入输出总线3530的配置仅为说明性的，且可使用各种其他配置。例如，图35E说明形成于晶圆3501上且布置成六边形形式的晶粒3540。可从晶圆3501切下包括四个晶粒3540的存储器芯片3532。在一实施例中，存储器芯片3532可包括通过合适的总线线(例如，线3533，如图35E中所展示)连接至四个晶粒的输入输出总线3530的一部分。为了将信息投送至存储器芯片3532的适当存储器单元，存储器芯片3532可包括置放于输出总线3530的分支点处的输入/输出控制器3542A及3542B。控制器3542A及3542B可经由输入输出总线3530接收命令数据，且选择总线3530的分支用于将信息传输至适当存储器单元。例如，若命令数据包括自/至与晶粒3546相关联的存储器单元的读取/写入信息，则控制器3542A可接收命令请求且将数据传输至总线3530的分支3531A，如图35D中所展示，而控制器3542B可接收命令请求且将数据传输至分支3531B。图35E指示可进行的不同区的各种切割，其中切割线由虚线表示。In various embodiments, as shown in FIG. 35C, multiple dies are formed on wafer 3501 together with the same set of input-output buses and/or control buses. An embodiment of an I/O bus 3530 is shown in Figure 35C. In one embodiment, one of the I/O buses (eg, I/O bus 3530) may be connected to multiple dies. FIG. 35C shows an embodiment of an input-output bus 3530 through which proximity dies 3506A and 3506B pass. The configuration of dies 3506A and 3506B and input output bus 3530 as shown in FIG. 35C is merely illustrative, and various other configurations may be used. For example, FIG. 35E illustrates die 3540 formed on wafer 3501 and arranged in a hexagonal form. A memory chip 3532 including four dies 3540 can be diced from wafer 3501 . In one embodiment, memory chip 3532 may include a portion of input-output bus 3530 connected to the four dies through suitable bus lines (eg, line 3533, as shown in FIG. 35E). To route information to the appropriate memory cells of memory chip 3532, memory chip 3532 may include input/output controllers 3542A and 3542B placed at branch points of output bus 3530. Controllers 3542A and 3542B may receive command data via input-output bus 3530 and select branches of bus 3530 for transferring information to the appropriate memory cells. For example, if the command data includes read/write information from/to memory cells associated with die 3546, then controller 3542A may receive the command request and transfer the data to branch 3531A of bus 3530, as shown in Figure 35D Shown, while controller 3542B can receive the command request and transmit data to branch 3531B. Figure 35E indicates the various cuts that can be made in different regions, where the cut lines are represented by dashed lines.

在一实施例中，一群组的晶粒及互连电路系统可经设计以包括于如图36A中所展示的存储器芯片3506中。此实施例可包括可被配置为彼此通信的处理器子单元(用于存储器内处理)。例如，要包括于存储器芯片3506中的每一晶粒可包括诸如存储器组3511A至3511D的各种存储器单元、处理器子单元3515A至3515D，以及IO控制器3521及3522。IO控制器3521及3522可并联连接至输入输出总线3530。IO控制器3521可具有熔断器3554，且IO控制器3522可具有熔断器3555。在一实施例中，处理器子单元3515A至3515D可藉助于例如总线3613连接。在一些状况下，IO控制器中的一个可使用对应熔断器来停用。例如，可使用熔断器3555停用IO控制器3522，且IO控制器3521可经由处理器子单元3515A至3515D控制至存储器组3511A至3511D中的数据流，该处理器子单元经由总线3613彼此连接。In one embodiment, a group of dies and interconnect circuitry can be designed to be included in a memory chip 3506 as shown in Figure 36A. This embodiment may include processor subunits (for in-memory processing) that may be configured to communicate with each other. For example, each die to be included in memory chip 3506 may include various memory cells such as memory banks 3511A-3511D, processor sub-units 3515A-3515D, and IO controllers 3521 and 3522. The IO controllers 3521 and 3522 can be connected to the input and output bus 3530 in parallel. IO controller 3521 may have fuse 3554 and IO controller 3522 may have fuse 3555. In one embodiment, the processor subunits 3515A-3515D may be connected by means of a bus 3613, for example. In some cases, one of the IO controllers may be disabled using a corresponding fuse. For example, fuse 3555 can be used to disable IO controller 3522, and IO controller 3521 can control data flow into memory banks 3511A-3511D via processor subunits 3515A-3515D, which are connected to each other via bus 3613 .

如图36A中所展示的存储器单元的配置仅为说明性的，且各种其他配置可通过切割晶圆3501的不同区来形成。例如，图36B展示具有三个域3601至3603的配置，该三个域含有存储器单元且连接至输入输出总线3530。在一实施例中，域3601至3603系使用可由对应熔断器3554至3556停用的IO控制模块3521至3523连接至输入输出总线3530。布置含有存储器单元的域的实施例的另一实施例绘示于图36C中，其中使用总线线3611、3612及3613将三个域3601、3602及3603连接至输入输出总线3530。图36D展示经由IO控制器3521至3524连接至输入输出总线3530A及3530B的存储器芯片3506A至3506D的另一实施例。在一实施例中，可使用对应熔断器元件3554至3557撤销启动IO控制器，如图36D中所展示。The configuration of memory cells as shown in FIG. 36A is merely illustrative, and various other configurations may be formed by dicing different regions of wafer 3501 . For example, FIG. 36B shows a configuration with three domains 3601-3603 containing memory cells and connected to an input-output bus 3530. In one embodiment, domains 3601-3603 are connected to I/O bus 3530 using IO control modules 3521-3523 that can be deactivated by corresponding fuses 3554-3556. Another embodiment of an embodiment of arranging domains containing memory cells is shown in FIG. 36C , where bus lines 3611 , 3612 and 3613 are used to connect three domains 3601 , 3602 and 3603 to an input-output bus 3530 . 36D shows another embodiment of memory chips 3506A-3506D connected to input output buses 3530A and 3530B via IO controllers 3521-3524. In one embodiment, the IO controller may be deactivated using corresponding fuse elements 3554-3557, as shown in Figure 36D.

图37展示晶粒3503的各种群组，诸如可包括一个或多个晶粒3503的群组3713及群组3715。在一实施例中，除在晶圆3501上形成晶粒3503以外，晶圆3501也可含有被称作胶合逻辑3711的逻辑电路3711。相较于在不存在胶合逻辑3711的情况下可能已制造的晶粒的数量，胶合逻辑3711可占用晶圆3501上的一些空间，以导致每晶圆3501制造较少数量的晶粒。然而，存在胶合逻辑3711可允许多个晶粒被配置为共同用作单一存储器芯片。例如，胶合逻辑可连接多个晶粒，而不必改变配置且不必为仅用于将晶粒连接在一起的电路系统指明晶粒本身中的任一者内的区域。在各种实施例中，胶合逻辑3711提供与其他存储器控制器的接口，使得多晶粒存储器芯片用作单一存储器芯片。胶合逻辑3711可连同一群组的晶粒(例如，如由群组3713展示)一起切割。替代地，若存储器芯片仅需要一个晶粒，例如，如对于群组3715，则可能不切割胶合逻辑。例如，在不需要实现不同晶粒之间的协作之处，可选择性地消除胶合逻辑。在图37中，可进行不同区的各种切割，例如，如由虚线区所展示。在各种实施例中，如图37中所展示，对于每两个晶粒3506，可在晶圆上布置一个胶合逻辑元件3711。在一些状况下，一个胶合逻辑元件3711可用于形成一群组的晶粒的任何合适数量的晶粒3506。胶合逻辑3711可被配置为连接至来自一群组的晶粒的所有晶粒。在各种实施例中，连接至胶合逻辑3711的晶粒可被配置为形成多晶粒存储器芯片，且可被配置为在晶粒不连接至胶合逻辑3711时形成分开的单晶粒存储器芯片。在各种实施例中，连接至胶合逻辑3711且经设计以共同起作用的晶粒可作为群组从晶圆3501切下，且可包括胶合逻辑3711，例如，如由群组3713所指示。未连接至胶合逻辑3711的晶粒可从晶圆3501切下而不包括胶合逻辑3711，例如，如由群组3715所指示，以形成单晶粒存储器芯片。37 shows various groups of dies 3503, such as group 3713 and group 3715, which may include one or more dies 3503. In one embodiment, in addition to forming die 3503 on wafer 3501 , wafer 3501 may also contain logic circuits 3711 referred to as glue logic 3711 . Glue logic 3711 may take up some space on wafer 3501 to result in a lower number of dies being fabricated per wafer 3501 compared to the number of dies that could have been fabricated without glue logic 3711 . However, the presence of glue logic 3711 may allow multiple dies to be configured together to function as a single memory chip. For example, glue logic can connect multiple dies without having to change configurations and without specifying areas within any of the dies themselves for circuitry that is only used to connect the dies together. In various embodiments, glue logic 3711 provides an interface to other memory controllers so that a multi-die memory chip acts as a single memory chip. Glue logic 3711 may be diced together with dies of the same group (eg, as shown by group 3713). Alternatively, if only one die is required for the memory chip, eg, as for group 3715, the glue logic may not be cut. For example, glue logic can be selectively eliminated where collaboration between different dies is not required. In Figure 37, various cuts of different regions can be made, eg, as shown by the dashed regions. In various embodiments, as shown in FIG. 37, for every two dies 3506, one glue logic element 3711 may be placed on the wafer. In some cases, one glue logic element 3711 may be used to form any suitable number of dies 3506 of a group of dies. Glue logic 3711 may be configured to connect to all dies from a group of dies. In various embodiments, the die connected to glue logic 3711 may be configured to form a multi-die memory chip, and may be configured to form a separate single-die memory chip when the die is not connected to glue logic 3711 . In various embodiments, dies connected to glue logic 3711 and designed to function together may be diced from wafer 3501 as a group and may include glue logic 3711 , eg, as indicated by group 3713 . Dies not connected to glue logic 3711 may be diced from wafer 3501 without including glue logic 3711, eg, as indicated by group 3715, to form single-die memory chips.

在一些实施例中，在从晶圆3501制造多晶粒存储器芯片期间，可判定一个或多个切割形状(例如，形成群组3713、3715的形状)用于产生多晶粒存储器芯片中的所要集合。在一些状况下，如由群组3715所展示，切割形状可能不包括胶合逻辑3711。In some embodiments, during the fabrication of multi-die memory chips from wafer 3501, one or more cutting shapes (eg, shapes forming groups 3713, 3715) may be determined for use in producing desired in the multi-die memory chips gather. In some cases, as shown by group 3715, the cut shape may not include glue logic 3711.

在各种实施例中，胶合逻辑3711可为用于控制多晶粒存储器芯片的多个存储器单元的控制器。在一些状况下，胶合逻辑3711可包括可由各种其他控制器修改的参数。例如，用于多晶粒存储器芯片的耦接电路可包括用于配置胶合逻辑3711的参数或存储器控制器的参数的电路(例如，处理器子单元3515A至3515D，如展示于例如图35B中)。胶合逻辑3711可被配置为进行多种任务。例如，逻辑3711可被配置为判定哪些晶粒可能需要寻址。在一些状况下，逻辑3711可用以使多个存储器单元同步。在各种实施例中，逻辑3711可被配置为控制各种存储器单元，使得存储器单元作为单一芯片操作。在一些状况下，可在输入输出总线(例如，总线3530，如图35C中所展示)与处理器子单元3515A至3515D之间添加放大器以放大来自总线3530的数据信号。In various embodiments, glue logic 3711 may be a controller for controlling multiple memory cells of a multi-die memory chip. In some cases, glue logic 3711 may include parameters that are modifiable by various other controllers. For example, coupling circuitry for a multi-die memory chip may include circuitry for configuring parameters of glue logic 3711 or parameters of a memory controller (eg, processor subunits 3515A-3515D, as shown, eg, in FIG. 35B ) . Glue logic 3711 can be configured to perform a variety of tasks. For example, logic 3711 may be configured to determine which dies may need addressing. In some cases, logic 3711 may be used to synchronize multiple memory cells. In various embodiments, logic 3711 may be configured to control various memory cells such that the memory cells operate as a single chip. In some cases, amplifiers may be added between the input output bus (eg, bus 3530 as shown in FIG. 35C ) and the processor subunits 3515A-3515D to amplify the data signals from the bus 3530.

在各种实施例中，从晶圆3501切割复杂形状在技术可能为困难/昂贵的，且可采用较简单的切割方法，其限制条件为晶粒在晶圆3501上对准。例如，图38A展示经对准以形成矩形栅格的晶粒3506。在一实施例中，可进行跨越整个晶圆3501的竖直切割3803及水平切割3801以分开切下的一群组的晶粒。在一实施例中，竖直切割3803及水平切割3801可产生含有选定数量的晶粒的群组。例如，切割3803及3801可产生含有单一晶粒的区(例如，区3811A)、含有两个晶粒的区(例如，区3811B)及含有四个晶粒的区(例如，区3811C)。由切割3801及3803形成的区仅为说明性的，且可形成任何其他合适的区。在各种实施例中，取决于晶粒对准，可进行各种切割。例如，若晶粒布置成三角形栅格，如图38B中所展示，则诸如线3802、3804及3806的切割线可用以制成多晶粒存储器芯片。例如，一些区可包括六个晶粒、五个晶粒、四个晶粒、三个晶粒、两个晶粒、一个晶粒任何其他合适数量的晶粒。In various embodiments, cutting complex shapes from wafer 3501 may be technically difficult/expensive, and simpler cutting methods may be employed, limited by die alignment on wafer 3501 . For example, Figure 38A shows die 3506 aligned to form a rectangular grid. In one embodiment, vertical dicing 3803 and horizontal dicing 3801 across the entire wafer 3501 may be performed to separate a diced group of dies. In one embodiment, vertical dicing 3803 and horizontal dicing 3801 can produce groups containing a selected number of dies. For example, dicing 3803 and 3801 may result in a region containing a single die (eg, region 3811A), a region containing two die (eg, region 3811B), and a region containing four die (eg, region 3811C). The regions formed by cuts 3801 and 3803 are merely illustrative, and any other suitable regions may be formed. In various embodiments, various dicing may be performed depending on the die alignment. For example, if the dies are arranged in a triangular grid, as shown in Figure 38B, scribe lines such as lines 3802, 3804, and 3806 can be used to make a multi-die memory chip. For example, some regions may include six dies, five dies, four dies, three dies, two dies, one die, any other suitable number of dies.

图38C展示布置成三角形栅格的总线线3530，其中晶粒3503在通过总线线3530相交形成的三角形的中心对准。晶粒3503可经由总线线3820连接至所有相邻的总线线。通过切割含有两个或多于两个邻近晶粒的区(例如，区3822，如图38C中所展示)，至少一个总线线(例如，线3824)保留在区3822内，且总线线3824可用以将数据及命令供应至使用区3822形成的多晶粒存储器芯片。38C shows bus lines 3530 arranged in a triangular grid, with dies 3503 aligned at the centers of the triangles formed by the intersection of bus lines 3530. Die 3503 may be connected to all adjacent bus lines via bus lines 3820 . By dicing a region (eg, region 3822, as shown in Figure 38C) containing two or more adjacent dies, at least one bus line (eg, line 3824) remains within region 3822, and bus line 3824 is available To supply data and commands to a multi-die memory chip formed using area 3822.

图39展示可形成于处理器子单元3515A至3515P之间以允许存储器单元的群组用作单一存储器芯片的各种连接件。例如，各种存储器单元的群组3901可包括处理器子单元3515B与子单元3515E之间的连接件3905。连接件3905可用作用于将数据及命令自子单元3515B传输至可用以控制各别存储器组3511E的子单元3515E的总线线。在各种实施例中，处理器子单元之间的连接件可在晶圆3501上形成晶粒期间实施。在一些状况下，额外连接件可在由若干晶粒形成的存储器芯片的封装阶段期间制造。39 shows various connections that may be formed between processor subunits 3515A-3515P to allow groups of memory cells to function as a single memory chip. For example, group 3901 of various memory cells may include connections 3905 between processor sub-unit 3515B and sub-unit 3515E. Connector 3905 can be used as a bus line for transferring data and commands from subunit 3515B to subunit 3515E, which can be used to control respective memory banks 3511E. In various embodiments, connections between processor subunits may be implemented during die formation on wafer 3501 . In some cases, additional connections may be fabricated during the packaging stage of a memory chip formed from several dies.

如图39中所展示，处理器子单元3515A至3515P可使用各种总线(例如，连接件3905)彼此连接。连接件3905可能不含时序硬件逻辑组件，使得在处理器子单元之间及跨越连接件3905的数据传送可能不受时序硬件逻辑组件控制。在各种实施例中，连接处理器子单元3515A至3515P的总线可在晶圆3501上制造各种电路之前布置于晶圆3501上。As shown in FIG. 39, the processor subunits 3515A-3515P may be connected to each other using various buses (eg, connections 3905). Connectors 3905 may not contain sequential hardware logic components, such that data transfers between processor subunits and across connectors 3905 may not be controlled by sequential hardware logic components. In various embodiments, the busses connecting the processor subunits 3515A-3515P may be placed on wafer 3501 prior to fabrication of various circuits on wafer 3501.

在各种实施例中，处理器子单元(例如，子单元3515A至3515P)可互连。例如，子单元3515A至3515P可通过合适总线(例如，连接件3905)连接。连接件3905可将子单元3515A至3515P中的任一者与子单元3515A至3515P中的任何其他子单元连接。在一实施例中，连接的子单元可在相同晶粒上(例如，子单元3515A及3515B)，且在其他状况下，连接的子单元可在不同晶粒上(例如，子单元3515B及3515E)。连接件3905可包括用于连接符单元的专用总线且可被配置为在子单元3515A至3515P之间高效地传输数据。In various embodiments, processor subunits (eg, subunits 3515A-3515P) may be interconnected. For example, subunits 3515A-3515P may be connected by a suitable bus (eg, connector 3905). Connector 3905 can connect any of subunits 3515A-3515P with any other of subunits 3515A-3515P. In one embodiment, the connected subunits may be on the same die (eg, subunits 3515A and 3515B), and in other cases, the connected subunits may be on different dies (eg, subunits 3515B and 3515E) ). Connector 3905 may include a dedicated bus for connector units and may be configured to efficiently transfer data between subunits 3515A-3515P.

本公开的各种方面系关于用于从晶圆生产大小可选择的存储器芯片的方法。在一实施例中，大小可选择的存储器芯片可由一个或多个晶粒形成。如前文所提到，晶粒可沿着一或多行布置，如展示于例如图35C中。在一些状况下，对应于一或多行的至少一个共享的输入输出总线可布置于晶圆3501上。例如，总线3530可如图35C中所展示而布置。在各种实施例中，总线3530可电连接至晶粒中的至少两个的存储器单元，且连接的晶粒可用以形成多晶粒存储器芯片。在一实施例中，一个或多个控制器(例如，输入输出控制器3521及3522，如图35B中所展示)可被配置为控制用以形成多晶粒存储器芯片的至少两个晶粒的存储器单元。在各种实施例中，可从晶圆切下具有连接至总线3530的存储器单元的晶粒，其具有共享的输入输出总线(例如，总线3530，如图35B中所展示)的一个对应部分，该共享的输入输出总线将信息传输至至少一个控制器(例如，控制器3521、3522)以配置控制器以控制所连接晶粒的存储器单元，以用作单一芯片。Various aspects of the present disclosure relate to methods for producing selectable sized memory chips from wafers. In one embodiment, a size selectable memory chip may be formed from one or more dies. As mentioned previously, the dies may be arranged along one or more rows, as shown, for example, in Figure 35C. In some cases, at least one shared I/O bus corresponding to one or more rows may be arranged on wafer 3501 . For example, bus 3530 may be arranged as shown in Figure 35C. In various embodiments, the bus 3530 can be electrically connected to memory cells of at least two of the dies, and the connected dies can be used to form a multi-die memory chip. In one embodiment, one or more controllers (eg, input output controllers 3521 and 3522, as shown in FIG. 35B ) may be configured to control the control of at least two dies used to form a multi-die memory chip. memory unit. In various embodiments, a die with memory cells connected to bus 3530, which has a corresponding portion of a shared input-output bus (eg, bus 3530, as shown in FIG. 35B), can be diced from the wafer, The shared I/O bus communicates information to at least one controller (eg, controllers 3521, 3522) to configure the controller to control the memory cells of the connected die for use as a single chip.

在一些状况下，可在通过切割晶圆3501的区来制造存储器芯片之前测试位于晶圆3501上的存储器单元。可使用至少一个共享的输入输出总线(例如，总线3530，如图35C中所展示)进行该测试。当存储器单元通过测试时，存储器芯片可由含有所述存储器单元的晶粒的群组形成。可舍弃未通过测试的存储器单元，且不将所述存储器单元用于制造存储器芯片。In some cases, memory cells located on wafer 3501 may be tested prior to fabricating memory chips by dicing regions of wafer 3501 . This test can be performed using at least one shared input-output bus (eg, bus 3530, as shown in Figure 35C). When a memory cell passes the test, the memory chip may be formed from a group of dies containing the memory cell. Memory cells that fail the test can be discarded and not used in the manufacture of memory chips.

图40展示从一群组的晶粒建置存储器芯片的处理程序4000的一实施例。在处理程序4000的步骤4011处，可在半导体晶圆3501上布置晶粒。在步骤4015处，可使用任何合适的方法在晶圆3501上制造晶粒。例如，可通过蚀刻晶圆3501，沉积各种电介质、金属或半导体层及进一步蚀刻所沉积层等来制造晶粒。例如，可沉积及蚀刻多个层。在各种实施例中，可使用任何合适的掺杂元素对层进行n型掺杂或P型掺杂。例如，可用磷对半导体层进行n型掺杂有且用硼对半导体层进行P型掺杂。如图35A中所展示，晶粒3503可通过可用以从晶圆3501切下晶粒3503的空间彼此分开。例如，晶粒3503可通过间隔区彼此隔开，其中可选择间隔区的宽度以允许在间隔区中进行晶圆切割。40 shows one embodiment of a process 4000 for building a memory chip from a group of dies. At step 4011 of process sequence 4000 , dies may be placed on semiconductor wafer 3501 . At step 4015, dies may be fabricated on wafer 3501 using any suitable method. For example, the die may be fabricated by etching the wafer 3501, depositing various dielectric, metal or semiconductor layers, further etching the deposited layers, and the like. For example, multiple layers can be deposited and etched. In various embodiments, the layers may be n-type doped or P-type doped using any suitable doping element. For example, the semiconductor layer may be n-doped with phosphorus and p-doped with boron. As shown in FIG. 35A , dies 3503 may be separated from each other by spaces that may be used to cut dies 3503 from wafer 3501 . For example, the dies 3503 may be separated from each other by spacers, wherein the width of the spacers may be selected to allow wafer dicing in the spacers.

在步骤4017处，可使用任何合适的方法从晶圆3501切下晶粒3503。在一实施例中，可使用雷射切下晶粒3503。在一实施例中，可首先刻划(scribe)晶圆3501，其后接着进行机械划割。替代地，可使用机械划割锯。在一些状况下，可使用隐式划割处理程序。在划割期间，一旦切下晶粒，晶圆3501便可安装于用于固持晶粒的划割带上。在各种实施例中，可进行大的切割，例如，如在图38A中由切割3801及3803所展示或在图38B中由切割3802、3804或3806所展示。一旦个别地或以群组切下晶粒3503，如在图35C中例如由群组3504所展示，便可封装晶粒3503。晶粒的封装可包括形成至晶粒3503的接点，在接点上方沉积保护层，附接热管理设备(例如，散热片)及囊封晶粒3503。在各种实施例中，取决于选择多少晶粒来形成一存储器芯片，可使用接点及总线的适当配置。在一实施例中，可在存储器芯片封装期间制作形成存储器芯片的不同晶粒之间的接点中的一些。At step 4017, die 3503 may be diced from wafer 3501 using any suitable method. In one embodiment, the die 3503 may be cut off using a laser. In one embodiment, wafer 3501 may be scribed first, followed by mechanical dicing. Alternatively, a mechanical scribing saw can be used. In some cases, an implicit stripe handler may be used. During dicing, once the die has been cut, the wafer 3501 can be mounted on a dicing tape for holding the die. In various embodiments, large cuts may be made, eg, as shown by cuts 3801 and 3803 in Figure 38A or cuts 3802, 3804, or 3806 in Figure 38B. Once dies 3503 have been cut individually or in groups, as shown, for example, by group 3504 in Figure 35C, dies 3503 can be packaged. Packaging of the die may include forming contacts to the die 3503 , depositing a protective layer over the contacts, attaching thermal management devices (eg, heat sinks), and encapsulating the die 3503 . In various embodiments, depending on how many dies are selected to form a memory chip, an appropriate configuration of contacts and busses may be used. In one embodiment, some of the contacts between the different dies forming the memory chip may be fabricated during packaging of the memory chip.

图41A展示用于制造含有多个晶粒的存储器芯片的处理程序4100的一实施例。处理程序4100的步骤4011可与处理程序4000的步骤4011相同。在步骤4111处，如图37中所展示，可将胶合逻辑3711布置于晶圆3501上。胶合逻辑3711可为用于控制晶粒3506的操作的任何合适的逻辑，如图37中所展示。如前文所描述，胶合逻辑3711的存在可允许多个晶粒用作单一存储器芯片。胶合逻辑3711可提供与其他存储器控制器的接口，使得由多个晶粒形成的存储器芯片用作单一存储器芯片。FIG. 41A shows one embodiment of a process 4100 for fabricating a memory chip containing multiple dies. Step 4011 of processing procedure 4100 may be the same as step 4011 of processing procedure 4000 . At step 4111 , as shown in FIG. 37 , glue logic 3711 may be disposed on wafer 3501 . Glue logic 3711 may be any suitable logic for controlling the operation of die 3506, as shown in FIG. As previously described, the presence of glue logic 3711 may allow multiple dies to function as a single memory chip. Glue logic 3711 may provide an interface to other memory controllers so that memory chips formed from multiple dies function as a single memory chip.

在处理程序4100的步骤4113处，可将总线(例如，输入输出总线及控制总线)布置于晶圆3501上。总线可布置为使得其与各种晶粒及诸如胶合逻辑3711的逻辑电路连接。在一些状况下，总线可连接存储器单元。例如，总线可被配置为连接不同晶粒的处理器子单元。在步骤4115处，可使用任何合适的方法制造晶粒、胶合逻辑及总线。例如，可通过蚀刻晶圆3501，沉积各种电介质、金属或半导体层及进一步蚀刻所沉积层等来制造逻辑元件。可使用例如金属蒸镀来制造总线。At step 4113 of process routine 4100 , buses (eg, input and output buses and control buses) may be placed on wafer 3501 . The bus may be arranged such that it connects with various dies and logic circuits such as glue logic 3711 . In some cases, a bus may connect memory cells. For example, a bus may be configured to connect processor subunits of different dies. At step 4115, the die, glue logic, and busses may be fabricated using any suitable method. For example, logic elements may be fabricated by etching wafer 3501, depositing various dielectric, metal or semiconductor layers, further etching the deposited layers, and the like. The bus can be fabricated using, for example, metal evaporation.

在步骤4140处，可使用切割形状以切割连接至单一胶合逻辑3711的晶粒的群组，如展示于例如图37中。可使用含有多个晶粒3503的存储器芯片的存储器要求来判定切割形状。例如，图41B展示处理程序4101，该处理程序可为处理程序4100的变体，其中步骤4117及4119可置于处理程序4100的步骤4140之前。在步骤4117处，用于切割晶圆3501的系统可接收描述存储器芯片的要求的指令。例如，要求可包括形成包括四个晶粒3503的存储器芯片。在一些状况下，在步骤4119处，程序软件可判定用于一群组的晶粒及胶合逻辑3711的周期性图案。例如，周期性图案可包括两个胶合逻辑3711元件及四个晶粒3503，其中每两个晶粒连接至一个胶合逻辑3711。替代地，在步骤4119处，可由存储器芯片的设计者提供该图案。At step 4140, a cut shape may be used to cut groups of dies connected to a single glue logic 3711, as shown in, eg, FIG. 37 . The dicing shape may be determined using the memory requirements of a memory chip containing multiple dies 3503. For example, FIG. 41B shows handler 4101, which can be a variation of handler 4100, in which steps 4117 and 4119 can be placed before step 4140 of handler 4100. At step 4117, the system for dicing wafer 3501 may receive instructions describing the requirements for the memory chips. For example, requirements may include forming a memory chip including four dies 3503 . In some cases, at step 4119, the program software may determine a periodic pattern for a group of dies and glue logic 3711. For example, a periodic pattern may include two glue logic 3711 elements and four dies 3503, where every two dies are connected to one glue logic 3711. Alternatively, at step 4119, the pattern may be provided by the designer of the memory chip.

在一些状况下，可选择该图案以使从晶圆3501形成的存储器芯片的产率达至最大。在一实施例中，可测试晶粒3503的存储器单元以识别具有故障存储器单元的晶粒(此类晶粒被称作发生故障晶粒)，且基于故障晶粒的位置，可识别含有通过测试的存储器单元的晶粒3503的群组且可判定适当的切割图案。例如，若在晶圆3501的边缘处，大量晶粒3503发生故障，则可判定切割图案以避开晶圆3501的边缘处的晶粒。处理程序4101的诸如步骤4011、4111、4113、4115及4140的其他步骤可与处理程序4100的相同编号步骤相同。In some cases, the pattern may be selected to maximize the yield of memory chips formed from wafer 3501. In one embodiment, the memory cells of die 3503 can be tested to identify dies with failed memory cells (such dies are referred to as failed dies), and based on the location of the failed dies, can be identified that contain a passing test A group of dies 3503 of memory cells and an appropriate dicing pattern can be determined. For example, if a large number of dies 3503 fail at the edge of the wafer 3501 , a dicing pattern can be determined to avoid the dies at the edge of the wafer 3501 . Other steps of process 4101 such as steps 4011 , 4111 , 4113 , 4115 and 4140 may be the same as the same numbered steps of process 4100 .

图41C展示可为处理程序4101的变化形式的处理程序4102的实施例。处理程序4102的步骤4011、4111、4113、4115及4140可与处理程序4101的相同编号步骤相同，处理程序4102的步骤4131可替代处理程序4101的步骤4117，且处理程序4102的步骤4133可替代处理程序4101的步骤4119。在步骤4131处，用于切割晶圆3501的系统可接收描述第一存储器芯片集合及第二存储器芯片集合的要求的指令。例如，要求可包括：形成具有由四个晶粒3503组成的存储器芯片的第一存储器芯片集合；及形成具有由两个晶粒3503组成的存储器芯片的第二存储器芯片集合。在一些状况下，可能需要从晶圆3501形成多于两个存储器芯片集合。例如，第三存储器芯片集合可包括仅由一个晶粒3503组成的存储器芯片。在一些状况下，在步骤4133处，程序软件可判定用于一群组的晶粒及胶合逻辑3711的周期性图案，以用于形成每个存储器芯片集合的存储器芯片。例如，第一存储器芯片集合可包括含有两个胶合逻辑3711及四个晶粒3503的存储器芯片，其中每两个晶粒连接至一个胶合逻辑3711。在各种实施例中，用于相同存储器芯片的胶合逻辑单元3711可链接在一起以用作单一胶合逻辑。例如，在制造胶合逻辑3711期间，可形成将胶合逻辑单元3711彼此链接的适当总线线。41C shows an embodiment of a handler 4102 that may be a variation of handler 4101. Steps 4011, 4111, 4113, 4115, and 4140 of handler 4102 may be identical to the same numbered steps of handler 4101, step 4131 of handler 4102 may be substituted for step 4117 of handler 4101, and step 4133 of handler 4102 may be substituted for processing Step 4119 of procedure 4101. At step 4131, the system for dicing wafer 3501 may receive instructions describing requirements for the first set of memory chips and the second set of memory chips. For example, the requirements may include: forming a first set of memory chips having memory chips consisting of four dies 3503 ; and forming a second set of memory chips having memory chips consisting of two dies 3503 . In some cases, more than two sets of memory chips may need to be formed from wafer 3501 . For example, the third set of memory chips may include memory chips consisting of only one die 3503 . In some cases, at step 4133, the program software may determine a periodic pattern for a group of dies and glue logic 3711 for forming the memory chips of each memory chip set. For example, the first set of memory chips may include memory chips containing two glue logic 3711 and four dies 3503 , where every two dies are connected to one glue logic 3711 . In various embodiments, glue logic cells 3711 for the same memory chip can be linked together to function as a single glue logic. For example, during the manufacture of glue logic 3711, appropriate bus lines linking glue logic cells 3711 to each other may be formed.

第二存储器芯片集合可包括含有一个胶合逻辑3711及两个晶粒3503的存储器芯片，其中晶粒3503连接至胶合逻辑3711。在一些状况下，当选择第三存储器芯片集合时且当第三存储器芯片集合包括由单一晶粒3503组成的存储器芯片时，这些存储器芯片可能不需要胶合逻辑3711。The second set of memory chips may include memory chips containing one glue logic 3711 and two dies 3503 , wherein the die 3503 is connected to the glue logic 3711 . In some cases, when the third set of memory chips is selected and when the third set of memory chips includes memory chips composed of a single die 3503, these memory chips may not require glue logic 3711.

当设计存储器芯片或芯片内的存储器实例时，一个重要的特性为在单一时钟循环期间可同时存取的字的数量。对于读取和/或写入，可同时存取的地址愈多(例如，沿着也被称作字或字线的行及也被称作比特或比特线的列的地址)，存储器芯片愈快。虽然在开发包括多路端口的存储器时已存在一些活动，该端口允许同时存取多个地址，例如用于建置寄存器文件、现金或共享存储器，但大部分实例使用大小较大且支持多个地址存取的存储器垫。然而，DRAM芯片通常包括连接至每个存储器胞元的每一电容器的单一比特线及单一行线。因此，本公开的实施例试图提供对现有DRAM芯片的多端口存取，而不修改DRAM阵列的此常规单端口存储器结构。An important characteristic when designing a memory chip or a memory instance within a chip is the number of words that can be accessed simultaneously during a single clock cycle. For reading and/or writing, the more addresses that can be accessed simultaneously (eg, addresses along rows, also called words or word lines, and columns, also called bits or bit lines), the more quick. While there has been some activity in developing memories that include multiplexed ports that allow simultaneous access to multiple addresses, such as for building register files, cash, or shared memory, most instances use larger sizes and support multiple address to access the memory pad. However, DRAM chips typically include a single bit line and a single row line connected to each capacitor of each memory cell. Accordingly, embodiments of the present disclosure seek to provide multi-port access to existing DRAM chips without modifying this conventional single-port memory structure of the DRAM array.

本公开的实施例可使用存储器以两倍于逻辑电路的速度来对存储器实例或芯片进行时钟控制。使用存储器的任何逻辑电路可因此「对应于」存储器及其任何组件。因此，本公开的实施例可在两个存储器阵列时钟循环中对两个地址进行撷取或写入，该两个存储器阵列时钟循环等效于用于逻辑电路的单一处理时钟循环。该逻辑电路可包含诸如控制器、加速器、GPU或CPU的电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。如上文关于图3A所解释，「处理群组」可指基板上的两个或多于两个处理器子单元及其对应存储器组。该群组可表示用于编译代码以供在存储器芯片2800上执行的目的的基板上的空间分布和/或逻辑分组。因此，如上文关于图7A所描述，具有存储器芯片的基板可包括存储器阵列，该存储器阵列具有多个组，诸如图28中所展示的组2801a及其他组。此外，该基板也可包括处理阵列，该处理阵列可包括多个处理器子单元(诸如，图7A中所展示的子单元730a、730b、730c、730d、730e、730f、730g及730h)。Embodiments of the present disclosure may use memory to clock memory instances or chips at twice the speed of logic circuits. Any logic circuit using memory may thus "correspond" to memory and any of its components. Thus, embodiments of the present disclosure can fetch or write two addresses in two memory array clock cycles, which are equivalent to a single processing clock cycle for the logic circuit. The logic circuit may include circuitry such as a controller, accelerator, GPU, or CPU, or may include a processing group on the same substrate as the memory chip, eg, as depicted in Figure 7A. As explained above with respect to FIG. 3A, a "processing group" may refer to two or more processor sub-units and their corresponding memory groups on a substrate. The group may represent a spatial distribution and/or logical grouping on the substrate for the purpose of compiling code for execution on the memory chip 2800 . Thus, as described above with respect to FIG. 7A, a substrate with memory chips may include a memory array having multiple groups, such as group 2801a shown in FIG. 28 and others. In addition, the substrate may also include a processing array, which may include a plurality of processor subunits (such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in Figure 7A).

因此，本公开的实施例可在两个连续存储器循环中的每个内从阵列取回数据，以便针对每个逻辑循环处置两个地址，且向逻辑提供两个结果，就如同单端口存储器阵列为双端口存储器芯片一般。额外时钟控制可允许本公开的存储器芯片如同单端口存储器阵列为双端口存储器实例、三端口存储器实例、四端口存储器实例端口或任何其他多端口存储器实例一般起作用。Thus, embodiments of the present disclosure can retrieve data from the array in each of two consecutive memory cycles to handle two addresses for each logic cycle and provide two results to the logic, just like a single port memory array For dual-port memory chips in general. The additional clocking may allow the memory chips of the present disclosure to function as if the single-port memory array were a dual-port memory instance, a triple-port memory instance, a quad-port memory instance port, or any other multi-port memory instance.

图42描绘符合本公开的提供沿着存储器芯片的列的双端口存取的电路系统4200的实施例，在该存储器芯片中使用电路系统4200。图42中所描绘的实施例可使用具有两个列多任务器(「mux」)4205a及4205b以在用于逻辑电路的相同时钟循环期间存取相同行上的两个字的一个存储器阵列4201。例如，在一存储器时钟循环期间，RowAddrA用于行解码器4203中且ColAddrA用于多任务器4205a中以缓冲来自具有地址(RowAddrA,ColAddrA)的存储器胞元的数据。在相同存储器时钟循环期间，ColAddrB用于多任务器4205b中以缓冲来自具有地址(RowAddrA,ColAddrB)的存储器胞元的数据。因此，电路系统4200可允许沿着相同行或字线对储存于两个不同地址处的存储器胞元上的数据(例如，DataA及DataB)进行双端口存取。因此，两个地址可共享行使得行解码器4203针对两次取回启动相同字线。此外，如图42中所描绘的实施例可使用列多任务器，使得可在相同存储器时钟循环期间存取两个地址。42 depicts an embodiment of a circuit system 4200 that provides dual port access along a column of memory chips in which the circuit system 4200 is used consistent with the present disclosure. The embodiment depicted in Figure 42 may use one memory array 4201 with two column multiplexers ("mux") 4205a and 4205b to access two words on the same row during the same clock cycle for the logic circuit . For example, during a memory clock cycle, RowAddrA is used in row decoder 4203 and ColAddrA is used in multiplexer 4205a to buffer data from memory cells with address (RowAddrA, ColAddrA). During the same memory clock cycle, ColAddrB is used in the multiplexer 4205b to buffer data from the memory cell with address (RowAddrA, ColAddrB). Thus, circuitry 4200 may allow dual port access to data (eg, DataA and DataB) stored on memory cells at two different addresses along the same row or word line. Thus, two addresses can share a row such that the row decoder 4203 enables the same word line for both fetches. Furthermore, the embodiment depicted in Figure 42 may use a column multiplexer so that both addresses may be accessed during the same memory clock cycle.

类似地，图43描绘符合本公开的提供沿着存储器芯片的行的双端口存取的电路系统4300的一实施例，在该存储器芯片中使用电路系统4300。图43中所描绘的实施例可使用具有与多任务器(「mux」)耦接的行解码器4303以在用于逻辑电路的相同时钟循环期间存取相同列上的两个字的一个存储器阵列4301。例如，在两个存储器时钟循环中的第一存储器时钟循环上，RowAddrA用于行解码器4303中且ColAddrA用于列多任务器4305中以缓冲来自具有地址(RowAddrA,ColAddrA)的存储器胞元的数据(例如，至图43的「缓冲字」缓冲器)。在两个存储器时钟循环中的第二存储器时钟循环上，RowAddrB用于行解码器4303中且ColAddrA用于列多任务器4305中以缓冲来自具有地址(RowAddrB,ColAddrA)的存储器胞元的数据。因此，电路系统4300可允许沿着相同列或比特线对储存于两个不同地址处的存储器胞元上的数据(例如，DataA及DataB)进行双端口存取。因此，两个地址可共享行使得列解码器(其可与一个或多个列多任务器分开或组合，如图43中所描绘)针对两次取回启动相同比特线。如图43中所描绘的实施例可使用两个存储器时钟循环，这是因为行解码器4303启动每一字线都可能需要一个存储器时钟循环。因此，若以至少两倍于对应的逻辑电路的速度进行时钟控制，则使用电路系统4300的存储器芯片可用作双端口存储器。Similarly, Figure 43 depicts one embodiment of a circuit system 4300 that provides dual port access along a row of memory chips in which the circuit system 4300 is used consistent with the present disclosure. The embodiment depicted in Figure 43 may use one memory with row decoders 4303 coupled with a multiplexer ("mux") to access two words on the same column during the same clock cycle for the logic circuit Array 4301. For example, on the first memory clock cycle of two memory clock cycles, RowAddrA is used in row decoder 4303 and ColAddrA is used in column multiplexer 4305 to buffer memory cells with address (RowAddrA, ColAddrA) data (eg, to the "buffer word" buffer of Figure 43). On the second memory clock cycle of the two memory clock cycles, RowAddrB is used in row decoder 4303 and ColAddrA is used in column multiplexer 4305 to buffer data from the memory cell with address (RowAddrB, ColAddrA). Thus, circuitry 4300 may allow dual-port access to data (eg, DataA and DataB) stored on memory cells at two different addresses along the same column or bit line. Thus, two addresses may share a row such that a column decoder (which may be separate or combined with one or more column multiplexers, as depicted in Figure 43) enables the same bit line for both fetches. The embodiment depicted in FIG. 43 may use two memory clock cycles, since row decoder 4303 may require one memory clock cycle to activate each word line. Thus, a memory chip using circuitry 4300 can function as a dual port memory if clocked at least twice as fast as the corresponding logic circuit.

因此，如上文所解释，图43可在比用于对应的逻辑电路的时钟循环快的两个存储器时钟循环期间取回DataA及DataB。例如，行解码器(例如，图43的行解码器4303)及列解码器(其可与一个或多个列多任务器分开或组合，如图43中所描绘)可被配置为以至少两倍于对应的逻辑电路产生两个地址的速率的速率进行时钟控制。例如，用于电路系统4300的时钟电路(图43中未展示)可根据至少两倍于对应的逻辑电路产生两个地址的速率的速率对电路系统4300进行时钟控制。Thus, as explained above, Figure 43 may retrieve DataA and DataB during two memory clock cycles faster than the clock cycle for the corresponding logic circuit. For example, a row decoder (eg, row decoder 4303 of FIG. 43 ) and a column decoder (which may be separate or combined with one or more column multiplexers, as depicted in FIG. 43 ) may be configured in at least two Clocking is performed at a rate that is twice the rate at which the corresponding logic circuit generates the two addresses. For example, a clock circuit (not shown in FIG. 43 ) for the circuit system 4300 may clock the circuit system 4300 at a rate that is at least twice the rate at which the corresponding logic circuit generates the two addresses.

可分开地或组合地使用图42的实施例及图43的实施例。因此，在单端口存储器阵列或垫上提供双端口功能性的电路系统(例如，电路系统4200或4300)可包含沿着至少一行及至少一列布置的多个存储器组。所述多个存储器组在图42中描绘为存储器阵列4201及在图43中描绘为存储器阵列4301。该实施例可进一步使用被配置为在单一时钟循环期间接收用于读取或写入的两个地址的至少一个行多任务器(如图43中所描绘)或至少一个列多任务器(如图42中所描绘)。此外，该实施例可使用行解码器(例如，图42的行解码器4203及图43的行解码器4303)及列解码器(其可与一个或多个列多任务器分开或组合，如图42及图43中所描绘)以从两个地址读取或写入至两个地址。例如，行解码器及列解码器可在第一循环期间从至少一个行多任务器或至少一个列多任务器取回两个地址中的第一地址，且解码对应于第一地址的字线及比特线。此外，行解码器及列解码器可在第二循环期间从至少一个行多任务器或至少一个列多任务器取回两个地址中的第二地址，且解码对应于第二地址的字线及比特线。该取回可各包含使用行解码器启动对应于地址的字线及使用列解码器启动经启动的字线上的对应于地址的比特线。The embodiment of Figure 42 and the embodiment of Figure 43 may be used separately or in combination. Thus, circuitry that provides dual-port functionality on a single-port memory array or pad (eg, circuitry 4200 or 4300) may include multiple memory banks arranged along at least one row and at least one column. The plurality of memory banks are depicted as memory array 4201 in FIG. 42 and as memory array 4301 in FIG. 43 . This embodiment may further use at least one row multiplexer (as depicted in Figure 43) or at least one column multiplexer (as depicted in Figure 43) configured to receive two addresses for reading or writing during a single clock cycle depicted in Figure 42). Furthermore, this embodiment may use row decoders (eg, row decoder 4203 of FIG. 42 and row decoder 4303 of FIG. 43 ) and column decoders (which may be separate or combined with one or more column multiplexers, such as 42 and 43) to read from or write to two addresses. For example, the row decoder and the column decoder may retrieve a first of the two addresses from the at least one row multiplexer or the at least one column multiplexer during a first cycle and decode the word line corresponding to the first address and bit lines. Additionally, the row decoder and the column decoder may retrieve a second one of the two addresses from the at least one row multiplexer or the at least one column multiplexer during the second cycle and decode the word line corresponding to the second address and bit lines. The retrieval may each include enabling a word line corresponding to the address using a row decoder and enabling a bit line corresponding to the address on the enabled word line using a column decoder.

尽管上文针对取回进行了描述，但图42及图43的实施例(无论系分开地抑或组合地实施)均可包括写入命令。例如，在第一循环期间，行解码器及列解码器可将从至少一个行多任务器或至少一个列多任务器取回的第一数据写入至两个地址中的第一地址。例如，在第二循环期间，行解码器及列解码器可将从至少一个行多任务器或至少一个列多任务器取回的第二数据写入至两个地址中的第二地址。Although described above with respect to retrieval, the embodiments of Figures 42 and 43, whether implemented separately or in combination, may include write commands. For example, during the first cycle, the row decoder and the column decoder may write the first data retrieved from the at least one row multiplexer or the at least one column multiplexer to a first address of the two addresses. For example, during the second cycle, the row decoder and the column decoder may write the second data retrieved from the at least one row multiplexer or the at least one column multiplexer to a second address of the two addresses.

图42绘示在第一地址及第二地址共享字线地址时的此处理程序的实施例，而图43绘示在第一地址及第二地址共享行地址时的此处理程序的实施例。如下文关于图47进一步所描述，在第一地址及第二地址不共享字线地址抑或行地址时，可实施相同处理程序。FIG. 42 shows an embodiment of this processing procedure when the first address and the second address share a word line address, and FIG. 43 shows an embodiment of this processing procedure when the first address and the second address share a row address. As described further below with respect to FIG. 47, the same processing procedure may be implemented when the first address and the second address do not share either word line addresses or row addresses.

因此，尽管上文的实施例提供沿着行或列中的至少一个的双端口存取，但额外实施例可提供沿着行和列两者的双端口存取。图44描绘符合本公开的提供沿着存储器芯片的行和列两者的双端口存取的电路系统4400的实施例，在该存储器芯片中使用电路系统4400。因此，电路系统4700可表示图42的电路系统4200与图43的电路系统4300的组合。Thus, while the above embodiments provide dual port access along at least one of rows or columns, additional embodiments may provide dual port access along both rows and columns. 44 depicts an embodiment of a circuit system 4400 that provides dual port access along both rows and columns of a memory chip in which the circuit system 4400 is used consistent with the present disclosure. Thus, circuitry 4700 may represent a combination of circuitry 4200 of FIG. 42 and circuitry 4300 of FIG. 43 .

图44中所描绘的实施例可使用具有与多任务器(「mux」)耦接的行解码器4403以在用于逻辑电路的相同时钟循环期间存取两行的一个存储器阵列4401。此外，图44中所描绘的实施例可使用具有与多任务器(「mux」)耦接的列解码器(或多任务器)4405以在相同时钟循环期间存取两列的存储器阵列4401。例如，在两个存储器时钟循环中的第一存储器时钟循环上，RowAddrA用于行解码器4403中且ColAddrA用于列多任务器4405中以缓冲来自具有地址(RowAddrA,ColAddrA)的存储器胞元的数据(例如，至图44的「缓冲字」缓冲器)。在两个存储器时钟循环中的第二存储器时钟循环上，RowAddrB用于行解码器4403中且ColAddrB用于列多任务器4405中以缓冲来自具有地址(RowAddrB,ColAddrB)的存储器胞元的数据。因此，电路系统4400可允许对储存于两个不同地址处的存储器胞元上的数据(例如，DataA及DataB)进行双端口存取。如图44中所描绘的实施例可使用额外缓冲器，这是因为行解码器4403启动每一字线可能都需要一个存储器时钟循环。因此，若以至少两倍于对应的逻辑电路的速度进行时钟控制，则使用电路系统4400的存储器芯片可用作双端口存储器。The embodiment depicted in FIG. 44 may use one memory array 4401 with row decoders 4403 coupled to a multiplexer ("mux") to access two rows during the same clock cycle for the logic circuit. Furthermore, the embodiment depicted in Figure 44 may use a memory array 4401 having a column decoder (or multiplexer) 4405 coupled with a multiplexer ("mux") to access both columns during the same clock cycle. For example, on the first memory clock cycle of two memory clock cycles, RowAddrA is used in row decoder 4403 and ColAddrA is used in column multiplexer 4405 to buffer memory cells with address (RowAddrA, ColAddrA) data (eg, to the "buffer word" buffer of Figure 44). On the second memory clock cycle of the two memory clock cycles, RowAddrB is used in row decoder 4403 and ColAddrB is used in column multiplexer 4405 to buffer data from the memory cell with address (RowAddrB, ColAddrB). Thus, circuitry 4400 may allow dual port access to data (eg, DataA and DataB) stored on memory cells at two different addresses. Embodiments as depicted in Figure 44 may use additional buffers because row decoder 4403 may require one memory clock cycle to activate each word line. Thus, a memory chip using circuitry 4400 can function as a dual port memory if clocked at least twice as fast as the corresponding logic circuit.

尽管在图44中未描绘，但电路系统4400还可以包括沿着行或字线的图46(下文进一步描述)的额外电路系统和/或沿着列或比特线的类似额外电路系统。因此，电路系统4400可启动对应电路系统(例如，通过断开一个或多个开关元件，诸如图46的开关元件4613a、4613b及其类似物中的一个或多个)以启动包括地址的断开部分(例如，通过连接电压或允许电流流动至断开部分)。因此，当电路系统的元件(诸如，线或其类似物)包括识别地址的位置时和/或当电路系统的元件(诸如，开关元件)控制至由地址识别的存储器胞元的供应或电压和/或电流时，该电路系统可「对应」。电路系统4400可接着使用行解码器4403及列多任务器4405以解码对应字线及比特线，以从位于经启动的断开部分中的地址取回数据或将数据写入至该地址。Although not depicted in Figure 44, the circuitry 4400 may also include the additional circuitry of Figure 46 (described further below) along rows or word lines and/or similar additional circuitry along columns or bit lines. Accordingly, the circuitry 4400 may activate the corresponding circuitry (eg, by opening one or more switching elements, such as one or more of the switching elements 4613a, 4613b of FIG. 46, and the like) to initiate the opening including the address section (eg, by connecting a voltage or allowing current to flow to a disconnected section). Thus, when an element of a circuit system (such as a wire or the like) includes a location identifying an address and/or when an element of the circuit system (such as a switch element) controls the supply or voltage to the memory cell identified by the address and / or current, the circuit system can "correspond". Circuitry 4400 may then use row decoder 4403 and column multiplexer 4405 to decode corresponding word lines and bit lines to retrieve data from or write data to addresses located in the activated disconnected portion.

如图44中进一步所描绘，电路系统4400可进一步使用被配置为在单一时钟循环期间接收用于读取或写入的两个地址的至少一个行多任务器(描绘为与行解码器4403分开，但可并入其中)和/或至少一个列多任务器(例如，描绘为与列多任务器4405分开，但可并入其中)。因此，实施例可使用行解码器(例如，行解码器4403)及列解码器(其可与列多任务器4405分开或组合)以从两个地址读取或写入至两个地址。例如，行解码器及列解码器可在存储器时钟循环期间从至少一个行多任务器或至少一个列多任务器取回两个地址中的第一地址，且解码对应于第一地址的字线及比特线。此外，行解码器及列解码器可在相同存储器循环期间从至少一个行多任务器或至少一个列多任务器取回两个地址中的第二地址，且解码对应于第二地址的字线及比特线。As further depicted in FIG. 44, circuitry 4400 may further use at least one row multiplexer (depicted as separate from row decoder 4403) configured to receive two addresses for reading or writing during a single clock cycle , but may be incorporated therein) and/or at least one column multiplexer (eg, depicted as separate from, but may be incorporated into, column multiplexer 4405). Thus, embodiments may use row decoders (eg, row decoder 4403) and column decoders (which may be separate or combined with column multiplexer 4405) to read from or write to two addresses. For example, the row and column decoders may retrieve a first of two addresses from at least one row multiplexer or at least one column multiplexer during a memory clock cycle and decode the word line corresponding to the first address and bit lines. Furthermore, the row decoder and the column decoder can retrieve a second one of the two addresses from the at least one row multiplexer or the at least one column multiplexer during the same memory cycle and decode the word line corresponding to the second address and bit lines.

图45A及图45B描绘用于在单端口存储器阵列或垫上提供双端口功能性的现有复制技术。如图45A中所展示，双端口读取可通过跨越存储器阵列或垫使数据的复本保持同步来提供。因此，可从存储器实例的两个复本执行读取，如图45A中所描绘。此外，如图45B中所展示，双端口写入可通过跨越存储器阵列或垫复制所有写入来提供。例如，存储器芯片可能需要使用存储器芯片的逻辑电路重复地发送写入命令，每一数据复本发送一个写入命令。替代地，在一些实施例中，如图45A中所展示，额外电路系统可允许使用存储器实例的逻辑电路发送单写入命令，单写入命令由额外电路系统自动地复制以跨越存储器阵列或垫产生写入数据的复本，以便使复本保持同步。图42、图43及图44的实施例可通过使用多任务器在单一存储器时钟循环中存取两条比特线(例如，如图42中所描绘)和/或通过比对应的逻辑电路更快地时钟控制存储器(例如，如图43及图44中所描绘)及提供额外多任务器以处置额外地址而非复制存储器中的所有数据来缩减来自这些现有复制技术的冗余。45A and 45B depict existing replication techniques for providing dual port functionality on a single port memory array or pad. As shown in Figure 45A, dual port reads can be provided by keeping copies of data synchronized across a memory array or pad. Thus, reads can be performed from both replicas of the memory instance, as depicted in Figure 45A. Furthermore, as shown in Figure 45B, dual port writes can be provided by duplicating all writes across the memory array or pad. For example, a memory chip may need to use the logic circuitry of the memory chip to repeatedly send write commands, one for each copy of the data. Alternatively, in some embodiments, as shown in FIG. 45A, additional circuitry may allow logic using the memory instance to send a single write command that is automatically replicated by the additional circuitry to span the memory array or pad. Make a replica of the written data so that the replicas are kept in sync. The embodiments of Figures 42, 43, and 44 may access two bit lines in a single memory clock cycle (eg, as depicted in Figure 42) by using a multiplexer and/or by faster than corresponding logic circuits Clocking the memory (eg, as depicted in Figures 43 and 44) and providing additional multiplexers to handle additional addresses rather than copying all data in the memory reduces redundancy from these existing replication techniques.

除上文所描述的更快时钟控制和/或额外多任务器以外，本公开的实施例也可使用在存储器阵列内的一些点处断开比特线和/或字线的电路系统。这些实施例可允许对阵列的多个同时存取，只要行解码器及列解码器存取不耦接至断开电路系统的相同部分的不同位置即可。例如，可同时存取具有不同字线及比特线的位置，这是因为断开电路系统可允许行和列解码存取不同地址而无电干扰。在设计存储器芯片期间，可权衡存储器阵列内的断开区的粒度与断开电路系统所需的额外区域。In addition to the faster clocking and/or additional multiplexers described above, embodiments of the present disclosure may also use circuitry that disconnects bit lines and/or word lines at points within the memory array. These embodiments may allow multiple simultaneous accesses to the array as long as the row and column decoder accesses are not coupled to different locations disconnecting the same portion of the circuitry. For example, locations with different word lines and bit lines can be accessed simultaneously because disconnecting the circuitry allows row and column decoding to access different addresses without electrical interference. During the design of a memory chip, the granularity of the disconnected regions within the memory array can be weighed against the additional area required to disconnect the circuitry.

用于实施此同时存取的架构描绘于图46中。具体而言，图46描绘在单端口存储器阵列或垫上提供双端口功能性的电路系统4600的实施例。如图46中所描绘，电路系统4600可包括沿着至少一行及至少一列布置的多个存储器垫(例如，存储器垫4609a、垫4609b及其类似物)。电路系统4600的布局还包括多条字线，诸如对应于行的字线4611a及4611b，以及对应于列的比特线4615a及4615b。The architecture for implementing this simultaneous access is depicted in FIG. 46 . Specifically, Figure 46 depicts an embodiment of a circuit system 4600 that provides dual port functionality on a single port memory array or pad. As depicted in FIG. 46, circuitry 4600 may include a plurality of memory pads (eg, memory pad 4609a, pad 4609b, and the like) arranged along at least one row and at least one column. The layout of circuitry 4600 also includes a plurality of word lines, such as word lines 4611a and 4611b corresponding to rows, and bit lines 4615a and 4615b corresponding to columns.

图46所绘示的实施例包括十二个存储器垫，每个存储器垫具有两条线及八个列。在其他实施例中，基板可包括任何数量的存储器垫，且每个存储器垫可包括任何数量条线及任何数量的列。一些存储器垫可包括相同数量的线及列(如图46中所展示)，而其他存储器垫可包括不同数量的线和/或列。The embodiment depicted in Figure 46 includes twelve memory pads, each memory pad having two lines and eight columns. In other embodiments, the substrate may include any number of memory pads, and each memory pad may include any number of lines and any number of columns. Some memory pads may include the same number of lines and columns (as shown in Figure 46), while other memory pads may include different numbers of lines and/or columns.

尽管在图46中未描绘，但电路系统4600可进一步使用被配置为在单一时钟循环期间接收用于读取或写入的两个(或三个或任何多个)地址的至少一个行多任务器(与行解码器4601a和/或4601b分开或与该行解码器合并)或至少一个列多任务器(例如，列多任务器4603a和/或4603b)。此外，实施例可使用行解码器(例如，行解码器4601a和/或4601b)及列解码器(其可与列多任务器4603a和/或4603b分开或组合)以从两个(或多于两个)地址读取或写入至两个(或多于两个)地址。例如，行解码器及列解码器可在存储器时钟循环期间从至少一个行多任务器或至少一个列多任务器取回两个地址中的第一地址，且解码对应于第一地址的字线及比特线。此外，行解码器及列解码器可在相同存储器循环期间从至少一个行多任务器或至少一个列多任务器取回两个地址中的第二地址，且解码对应于第二地址的字线及比特线。如上文所解释，只要两个地址处于不耦接至断开电路系统(例如，开关元件，诸如4613a、4613b及其类似物)的相同部分的不同位置中，便可在相同存储器时钟循环期间进行存取。另外，电路系统4600可在第一存储器时钟循环期间同时存取前两个地址，且接着在第二存储器时钟循环期间同时存取接下来两个地址。在这些实施例中，若以至少两倍于对应的逻辑电路的速度进行时钟控制，则使用电路系统4600的存储器芯片可用作四端口存储器。Although not depicted in Figure 46, circuitry 4600 may further use at least one row multiplexing configured to receive two (or three or any more) addresses for reading or writing during a single clock cycle or at least one column multiplexer (eg, column multiplexers 4603a and/or 4603b). Furthermore, embodiments may use row decoders (eg, row decoders 4601a and/or 4601b) and column decoders (which may be separate or combined with column multiplexers 4603a and/or 4603b) to extract data from two (or more than two) addresses are read or written to two (or more than two) addresses. For example, the row and column decoders may retrieve a first of two addresses from at least one row multiplexer or at least one column multiplexer during a memory clock cycle and decode the word line corresponding to the first address and bit lines. Furthermore, the row decoder and the column decoder can retrieve a second one of the two addresses from the at least one row multiplexer or the at least one column multiplexer during the same memory cycle and decode the word line corresponding to the second address and bit lines. As explained above, the two addresses can be performed during the same memory clock cycle as long as the two addresses are in different locations that are not coupled to the same portion of disconnected circuitry (eg, switching elements such as 4613a, 4613b, and the like) access. Additionally, the circuitry 4600 may concurrently access the first two addresses during the first memory clock cycle, and then concurrently access the next two addresses during the second memory clock cycle. In these embodiments, a memory chip using circuitry 4600 can function as a quad-port memory if clocked at least twice as fast as the corresponding logic circuit.

图46还包括被配置为用作开关的至少一个行电路及至少一个列电路。例如，诸如4613a、4613b及其类似物的对应开关元件可包含晶体管或任何其他电元件，其被配置为允许或停止电流流动和/或连接或断开电压与连接至诸如4613a、4613b及其类似物的开关元件的字线或比特线。因此，对应开关元件可将电路系统4600分成断开部分。尽管描绘为包含单一行且每一行包含十六列，但电路系统4600内的断开区可取决于电路系统4600的设计而包括不同粒度等级。Figure 46 also includes at least one row circuit and at least one column circuit configured to function as switches. For example, corresponding switching elements such as 4613a, 4613b and the like may comprise transistors or any other electrical element configured to allow or stop current flow and/or connect or disconnect voltages and connections such as 4613a, 4613b and the like word line or bit line of the switching element of the object. Accordingly, corresponding switching elements may divide the circuit system 4600 into disconnected portions. Although depicted as including a single row with each row including sixteen columns, the disconnected regions within circuitry 4600 may include different levels of granularity depending on the design of circuitry 4600 .

电路系统4600可使用控制器(例如，行控制件4607)以启动至少一个行电路及至少一个列电路中的对应者，以便在上文所描述的地址操作期间启动对应断开区。例如，电路系统4600可传输一个或多个控制信号以闭合开关元件(例如，开关元件4613a、4613b及其类似物)中的对应者。在开关元件4613a、4613b及其类似物包含晶体管的实施例中，控制信号可包含断开晶体管的电压。Circuitry 4600 can use a controller (eg, row controls 4607) to enable corresponding ones of at least one row circuit and at least one column circuit to enable corresponding open regions during the address operations described above. For example, circuitry 4600 may transmit one or more control signals to close corresponding ones of switching elements (eg, switching elements 4613a, 4613b, and the like). In embodiments where switching elements 4613a, 4613b, and the like, comprise transistors, the control signal may comprise a voltage that turns off the transistors.

取决于包括地址的断开区，可由电路系统4600启动开关元件中的多于一者。例如，为到达图46的存储器垫4609b内的地址，必须断开允许存取存储器垫4609a的开关元件以及允许存取存储器垫4609b的开关元件。行控制件4607可判定要启动的开关元件，以便根据特定地址取回电路系统4600内的特定地址。Depending on the open region including the address, more than one of the switching elements may be activated by the circuitry 4600 . For example, to reach an address within memory pad 4609b of Figure 46, the switching elements that allow access to memory pad 4609a and the switching elements that allow access to memory pad 4609b must be turned off. Row controls 4607 can determine which switching elements to activate in order to retrieve specific addresses within circuitry 4600 based on specific addresses.

图46表示用以划分存储器阵列(例如，包含存储器垫4609a、垫4609b及其类似物)的字线的电路系统4600的实施例。然而，其他实施例可使用类似电路系统(例如，将存储器芯片4600分成断开区的开关元件)以划分存储器阵列的比特线。因此，电路系统4600的架构可用于双列存取(如图42或图44中所描绘的情况)以及双行存取(如图43或图44中所描绘的情况)中。46 shows an embodiment of circuitry 4600 for dividing word lines of a memory array (eg, including memory pad 4609a, pad 4609b, and the like). However, other embodiments may use similar circuitry (eg, switching elements that divide the memory chip 4600 into disconnected regions) to divide the bit lines of the memory array. Thus, the architecture of the circuitry 4600 can be used in dual column access (as depicted in FIG. 42 or FIG. 44) as well as dual row access (as depicted in FIG. 43 or FIG. 44).

用于对存储器阵列或垫进行多循环存取的处理程序描绘于图47A中。具体而言，图47A为用于在单端口存储器阵列或垫上提供双端口存取(例如，使用图43的电路系统4300或图44的电路系统4400)的处理程序4700的实施例的流程图，可使用符合本公开的行解码器及列解码器执行处理程序4700，诸如分别图43或图44的行解码器4303或4403，及列解码器(其可与一个或多个列多任务器分开或组合，诸如分别描绘于图43或图44中的列多任务器4305或4405)。The processing procedure for multi-cycle access to a memory array or pad is depicted in Figure 47A. Specifically, FIG. 47A is a flowchart of an embodiment of a process 4700 for providing dual-port access on a single-port memory array or pad (eg, using the circuitry 4300 of FIG. 43 or the circuitry 4400 of FIG. 44 ), Process 4700 may be performed using row and column decoders consistent with the present disclosure, such as row decoders 4303 or 4403 of FIG. 43 or FIG. 44, respectively, and column decoders (which may be separate from one or more column multiplexers) or a combination, such as column multiplexers 4305 or 4405 depicted in Figure 43 or Figure 44, respectively).

在步骤4710处，在第一存储器时钟循环期间，该电路系统可使用至少一个行多任务器及至少一个列多任务器以解码对应于两个地址中的第一地址的字线及比特线。例如，至少一个行解码器可启动字线，且至少一个列多任务器可放大来自沿着经启动的字线并对应于第一地址的存储器胞元的电压。可将经放大电压提供至使用包括该电路系统的存储器芯片的逻辑电路，或根据下文所描述的步骤4720缓冲经放大电压。该逻辑电路可包含诸如GPU或CPU的电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。At step 4710, during a first memory clock cycle, the circuitry may use at least one row multiplexer and at least one column multiplexer to decode word lines and bit lines corresponding to a first of the two addresses. For example, at least one row decoder may enable word lines, and at least one column multiplexer may amplify voltages from memory cells along the enabled word lines and corresponding to the first address. The amplified voltage may be provided to a logic circuit using a memory chip including this circuitry, or buffered according to step 4720 described below. The logic circuit may include circuitry such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip, eg, as depicted in Figure 7A.

尽管上文描述为读取操作，但方法4700可类似地处理写入操作。例如，至少一个行解码器可启动字线，且至少一个列多任务器可将电压施加至沿着经启动的字线并对应于第一地址的存储器胞元，以将新数据写入至该存储器胞元。在一些实施例中，该电路系统可将对写入的确认提供至使用包括该电路系统的存储器芯片的逻辑电路，或根据下文步骤4720缓冲该确认。Although described above as read operations, method 4700 may similarly handle write operations. For example, at least one row decoder can activate a word line, and at least one column multiplexer can apply a voltage to a memory cell along the activated word line and corresponding to a first address to write new data to the memory cells. In some embodiments, the circuitry may provide an acknowledgement of the write to logic circuitry using a memory chip that includes the circuitry, or buffer the acknowledgement according to step 4720 below.

在步骤4720处，该电路系统可缓冲第一地址的所取回数据。例如，如图43及图44中所描绘，缓冲器可允许电路系统取回两个地址中的第二地址(如下文描述于步骤4730中)且将两次取回的结果一起传回。缓冲器可包含寄存器、SRAM、非易失性存储器或任何其他数据储存设备。At step 4720, the circuitry may buffer the fetched data for the first address. For example, as depicted in Figures 43 and 44, the buffer may allow the circuitry to fetch the second of the two addresses (as described below in step 4730) and return the results of both fetches back together. The buffers may include registers, SRAM, non-volatile memory, or any other data storage device.

在步骤4730处，在第二存储器时钟循环期间，该电路系统可使用至少一个行多任务器及至少一个列多任务器以解码对应于两个地址中的第二地址的字线及比特线。例如，至少一个行解码器可启动字线，且至少一个列多任务器可放大来自沿着经启动的字线并对应于第二地址的存储器胞元的电压。可将经放大电压提供至使用包括该电路系统的存储器芯片的逻辑电路，无论系个别地抑或连同例如来自步骤4720的经缓冲电压一起。该逻辑电路可包含诸如GPU或CPU的电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。At step 4730, during a second memory clock cycle, the circuitry may use at least one row multiplexer and at least one column multiplexer to decode word lines and bit lines corresponding to a second of the two addresses. For example, at least one row decoder may activate a word line, and at least one column multiplexer may amplify voltages from memory cells along the activated word line and corresponding to the second address. The amplified voltage may be provided to logic circuits using the memory chip including the circuitry, either individually or together with the buffered voltage from step 4720, for example. The logic circuit may include circuitry such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip, eg, as depicted in Figure 7A.

尽管上文描述为读取操作，但方法4700可类似地处理写入操作。例如，至少一个行解码器可启动字线，且至少一个列多任务器可将电压施加至沿着经启动的字线并对应于第二地址的存储器胞元，以将新数据写入至该存储器胞元。在一些实施例中，该电路系统可将对写入的确认提供至使用包括该电路系统的存储器芯片的逻辑电路，无论系个别地抑或连同例如来自步骤4720的经缓冲电压一起。Although described above as read operations, method 4700 may similarly handle write operations. For example, at least one row decoder can activate a word line, and at least one column multiplexer can apply voltages to memory cells along the activated word line and corresponding to the second address to write new data to the memory cells. In some embodiments, the circuitry may provide an acknowledgement of the write to logic circuitry using the memory chip that includes the circuitry, either individually or in conjunction with buffered voltages from step 4720, for example.

在步骤4740处，该电路系统可输出第二地址与经缓冲第一地址的所取回数据。例如，如图43及图44中所描绘，该电路系统可将两次取回的结果(例如，来自步骤4710及4730)一起传回。该电路系统可将结果传回至使用包括该电路系统的存储器芯片的逻辑电路，该逻辑电路可包含诸如GPU或CPU的电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。At step 4740, the circuitry may output the second address and the fetched data of the buffered first address. For example, as depicted in Figures 43 and 44, the circuitry may return the results of the two retrievals (eg, from steps 4710 and 4730) together. The circuitry can communicate results back to logic using the memory chip that includes the circuitry, which can include circuitry such as a GPU or CPU, or can include a processing group on the same substrate as the memory chip, eg, As depicted in Figure 7A.

尽管参考多个循环进行描述，但若两个地址共享一字线，如图42中所描绘，则方法4700可允许对两个地址的单循环存取。例如，步骤4710及4730可在相同存储器时钟循环期间进行，这是因为多个列多任务器可在相同存储器时钟循环期间解码相同字线上的不同比特线。在这些实施例中，可跳过缓冲步骤4720。Although described with reference to multiple cycles, if two addresses share a word line, as depicted in FIG. 42, method 4700 may allow a single cycle access to both addresses. For example, steps 4710 and 4730 may be performed during the same memory clock cycle because multiple column multiplexers may decode different bit lines on the same word line during the same memory clock cycle. In these embodiments, buffering step 4720 may be skipped.

用于同时存取(例如，使用上文所描述的电路系统4600)的处理程序描绘于图47B中。因此，尽管依序地展示，图47B的步骤可全部在相同存储器时钟循环期间进行，且可同时执行至少一些步骤(例如，步骤4760与4780或步骤4770与4790)。具体而言，图47B为用于在单端口存储器阵列或垫上提供双端口存取(例如，使用图42的电路系统4200或图46的电路系统4600)的处理程序4750的实施例的流程图，可使用符合本公开的行解码器及列解码器执行处理程序4750，诸如分别图42或图46的行解码器4203或行解码器4601a及4601b，及列解码器(其可与一个或多个列多任务器分开或组合，诸如分别描绘于图42或图46中的列多任务器4205a及4205b或列多任务器4603a及4306b)。The processing procedure for simultaneous access (eg, using the circuitry 4600 described above) is depicted in Figure 47B. Thus, although shown sequentially, the steps of Figure 47B may all be performed during the same memory clock cycle, and at least some of the steps (eg, steps 4760 and 4780 or steps 4770 and 4790) may be performed concurrently. Specifically, Figure 47B is a flowchart of an embodiment of a process 4750 for providing dual-port access on a single-port memory array or pad (eg, using the circuitry 4200 of Figure 42 or the circuitry 4600 of Figure 46), Process 4750 may be performed using row and column decoders consistent with the present disclosure, such as row decoder 4203 or row decoders 4601a and 4601b of FIG. 42 or FIG. 46, respectively, and column decoders (which may be combined with one or more Column multiplexers are separated or combined, such as column multiplexers 4205a and 4205b or column multiplexers 4603a and 4306b depicted in Figure 42 or Figure 46, respectively).

在步骤4760处，在一存储器时钟循环期间，该电路系统可基于两个地址中的第一地址启动至少一个行电路及至少一个列电路中的对应者。例如，该电路系统可传输一个或多个控制信号以闭合包含至少一个行电路及至少一个列电路的开关元件中的对应者。因此，该电路系统可存取包括两个地址中的第一地址的对应断开区。At step 4760, during a memory clock cycle, the circuitry may enable corresponding ones of the at least one row circuit and the at least one column circuit based on the first of the two addresses. For example, the circuitry may transmit one or more control signals to close corresponding ones of switching elements including at least one row circuit and at least one column circuit. Thus, the circuitry can access the corresponding open region including the first of the two addresses.

在步骤4770处，在该存储器时钟循环期间，该电路系统可使用至少一个行多任务器及至少一个列多任务器以解码对应于第一地址的字线及比特线。例如，至少一个行解码器可启动字线，且至少一个列多任务器可放大来自沿着经启动的字线并对应于第一地址的存储器胞元的电压。可将经放大电压提供至使用包括该电路系统的存储器芯片的逻辑电路。例如，如上文所描述，该逻辑电路可包含诸如GPU或CPU的电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。At step 4770, during the memory clock cycle, the circuitry may use at least one row multiplexer and at least one column multiplexer to decode word lines and bit lines corresponding to the first address. For example, at least one row decoder may enable word lines, and at least one column multiplexer may amplify voltages from memory cells along the enabled word lines and corresponding to the first address. The amplified voltage may be provided to logic circuits using memory chips including the circuitry. For example, as described above, the logic circuit may include circuitry such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip, eg, as depicted in Figure 7A.

尽管上文描述为读取操作，但方法4500可类似地处理写入操作。例如，至少一个行解码器可启动字线，且至少一个列多任务器可将电压施加至沿着经启动的字线并对应于第一地址的存储器胞元，以将新数据写入至该存储器胞元。在一些实施例中，该电路系统可将对写入的确认提供至使用包括该电路系统的存储器芯片的逻辑电路。Although described above as read operations, method 4500 may similarly handle write operations. For example, at least one row decoder can activate a word line, and at least one column multiplexer can apply a voltage to a memory cell along the activated word line and corresponding to a first address to write new data to the memory cells. In some embodiments, the circuitry may provide an acknowledgement of the write to logic circuitry using a memory chip that includes the circuitry.

在步骤4780处，在相同循环期间，该电路系统可基于两个地址中的第二地址启动至少一个行电路及至少一个列电路中的对应者。例如，该电路系统可传输一个或多个控制信号以闭合包含至少一个行电路及至少一个列电路的开关元件中的对应者。因此，该电路系统可存取包括两个地址中的第二地址的对应断开区。At step 4780, during the same cycle, the circuitry may enable corresponding ones of the at least one row circuit and the at least one column circuit based on the second of the two addresses. For example, the circuitry may transmit one or more control signals to close corresponding ones of switching elements including at least one row circuit and at least one column circuit. Thus, the circuitry can access the corresponding open region including the second of the two addresses.

在步骤4790处，在相同循环期间，该电路系统可使用至少一个行多任务器及至少一个列多任务器以解码对应于第二地址的字线及比特线。例如，至少一个行解码器可启动字线，且至少一个列多任务器可放大来自沿着经启动的字线并对应于第二地址的存储器胞元的电压。可将经放大电压提供至使用包括该电路系统的存储器芯片的逻辑电路。例如，如上文所描述，该逻辑电路可包含诸如GPU或CPU的常规电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。At step 4790, during the same cycle, the circuitry may use at least one row multiplexer and at least one column multiplexer to decode word lines and bit lines corresponding to the second address. For example, at least one row decoder may activate a word line, and at least one column multiplexer may amplify voltages from memory cells along the activated word line and corresponding to the second address. The amplified voltage may be provided to logic circuits using memory chips including the circuitry. For example, as described above, the logic circuit may comprise conventional circuitry such as a GPU or CPU, or may comprise a processing group on the same substrate as the memory chip, eg, as depicted in Figure 7A.

尽管上文描述为读取操作，但方法4500可类似地处理写入操作。例如，至少一个行解码器可启动字线，且至少一个列多任务器可将电压施加至沿着经启动的字线并对应于第二地址的存储器胞元，以将新数据写入至该存储器胞元。在一些实施例中，该电路系统可将对写入的确认提供至使用包括该电路系统的存储器芯片的逻辑电路。Although described above as read operations, method 4500 may similarly handle write operations. For example, at least one row decoder can activate a word line, and at least one column multiplexer can apply voltages to memory cells along the activated word line and corresponding to the second address to write new data to the memory cells. In some embodiments, the circuitry may provide an acknowledgement of the write to logic circuitry using a memory chip that includes the circuitry.

尽管参考单一循环进行描述，但若两个地址处于共享字线或比特线(或以其他方式共享至少一个行电路及至少一个列电路中的开关元件)的断开区中，则方法4500可允许对两个地址的多循环存取。例如，步骤4760及4770可在第一存储器时钟循环期间进行，在该第一存储器时钟循环中，第一行解码器及第一列多任务器可解码对应于第一地址的字线及比特线，而步骤4780及4790可在第二存储器时钟循环期间进行，在该第二存储器时钟循环中，第二行解码器及第二列多任务器可解码对应于第二地址的字线及比特线。Although described with reference to a single cycle, method 4500 may allow if two addresses are in open regions that share word lines or bit lines (or otherwise share switching elements in at least one row circuit and at least one column circuit). Multiple round-robin access to two addresses. For example, steps 4760 and 4770 may be performed during a first memory clock cycle in which a first row decoder and a first column multiplexer may decode word lines and bit lines corresponding to a first address , and steps 4780 and 4790 may be performed during a second memory clock cycle in which the second row decoder and second column multiplexer may decode word lines and bit lines corresponding to the second address .

用于沿着行和列两者的双端口存取的架构的另一实施例描绘于图48中。具体而言，图48描绘使用多个行解码器结合多个列多任务器提供沿着行和列两者的双端口存取的电路系统4800的一实施例。在图48中，行解码器4801a可存取第一字线，且列多任务器4803a可解码来自沿着第一字线的一个或多个存储器胞元的数据，而行解码器4801b可存取第二字线，且列多任务器4803b可解码来自沿着第二字线的一个或多个存储器胞元的数据。Another embodiment of an architecture for dual port access along both rows and columns is depicted in FIG. 48 . Specifically, Figure 48 depicts one embodiment of a circuit system 4800 that provides dual port access along both rows and columns using multiple row decoders in conjunction with multiple column multiplexers. In Figure 48, row decoder 4801a can access the first word line, and column multiplexer 4803a can decode data from one or more memory cells along the first word line, while row decoder 4801b can store A second word line is taken, and column multiplexer 4803b can decode data from one or more memory cells along the second word line.

如关于图47B所描述，此存取可在一个存储器时钟循环期间同时进行。因此，类似于图46的架构，图48的架构(包括下文描述于图49中的存储器垫)可允许在相同时钟循环中存取多个地址。例如，图48的架构可包括任何数量的行解码器及任何数量的列多任务器，使得与行解码器及列多任务器的数量对应的数量的地址可全部在单一存储器时钟循环内存取。As described with respect to Figure 47B, this access may be concurrent during one memory clock cycle. Thus, similar to the architecture of FIG. 46, the architecture of FIG. 48 (including the memory pad described below in FIG. 49) may allow multiple addresses to be accessed in the same clock cycle. For example, the architecture of Figure 48 may include any number of row decoders and any number of column multiplexers such that a number of addresses corresponding to the number of row decoders and column multiplexers can all be accessed within a single memory clock cycle.

在其他实施例中，此存取沿着两个存储器时钟循环可依序进行。通过比对应的逻辑电路更快地时钟控制存储器芯片4800，两个存储器时钟循环可等效于使用存储器的逻辑电路的一个时钟循环。例如，如上文所描述，该逻辑电路可包含诸如GPU或CPU的常规电路，或可包含与存储器芯片在相同基板上的处理群组，例如，如图7A中所描绘。In other embodiments, this access may occur sequentially along two memory clock cycles. By clocking the memory chip 4800 faster than the corresponding logic circuit, two memory clock cycles may be equivalent to one clock cycle of the logic circuit using the memory. For example, as described above, the logic circuit may comprise conventional circuitry such as a GPU or CPU, or may comprise a processing group on the same substrate as the memory chip, eg, as depicted in Figure 7A.

其他实施例可允许同时存取。例如，如关于图42所描述，多个列解码器(其可包含列多任务器，诸如4803a及4803b，如图48中所展示)可在单一存储器时钟循环期间读取沿着相同字线的多条比特线。另外或替代地，如关于图46所描述，电路系统4800可并有额外电路系统使得此存取可为同时的。例如，行解码器4801a可存取第一字线，且列多任务器4803a可在相同存储器时钟循环期间解码来自沿着第一字线的存储器胞元的数据，在该存储器时钟循环中，行解码器4801b存取第二字线，且列多任务器4803b解码来自沿着第二字线的存储器胞元的数据。Other embodiments may allow simultaneous access. For example, as described with respect to FIG. 42, multiple column decoders (which may include column multiplexers, such as 4803a and 4803b, as shown in FIG. 48) may read data along the same word line during a single memory clock cycle Multiple bit lines. Additionally or alternatively, as described with respect to FIG. 46, the circuitry 4800 may incorporate additional circuitry such that this access may be simultaneous. For example, row decoder 4801a can access the first word line, and column multiplexer 4803a can decode data from memory cells along the first word line during the same memory clock cycle in which the row Decoder 4801b accesses the second word line, and column multiplexer 4803b decodes data from memory cells along the second word line.

图48的架构可与形成存储器组的经修改存储器垫一起使用，如图49中所展示。在图49中，通过两条字线及两条比特线存取每个存储器胞元(描绘为类似于DRAM的电容器，但也可包含以类似于SRAM或任何其他存储器胞元的方式布置的数个晶体管)。因此，图49的存储器垫4900允许通过两个不同逻辑电路同时存取两个不同比特，或甚至存取相同比特。然而，图49的实施例使用对存储器垫的修改而非在标准DRAM存储器垫上实施双端口解决方案，该存储器垫经线连接以用于单端口存取，如上文实施例一般。The architecture of FIG. 48 can be used with modified memory pads forming memory banks, as shown in FIG. 49 . In Figure 49, each memory cell is accessed by two word lines and two bit lines (depicted as capacitors similar to DRAM, but may also include data arranged in a manner similar to SRAM or any other memory cell transistors). Thus, the memory pad 4900 of Figure 49 allows simultaneous access to two different bits, or even the same bit, by two different logic circuits. However, the embodiment of Figure 49 uses modifications to the memory pads rather than implementing a dual port solution on standard DRAM memory pads that are wired for single port access as in the above embodiments.

尽管描述为具有两个端口，但上文所描述的实施例中的任一者可扩展至多于两个端口。例如，图42、图46、图48及图49的实施例可分别包括额外的列多任务器或行多任务器，以在单一时钟循环期间提供对额外列或行的存取。作为另一实施例，图43及图44的实施例可包括额外的行解码器和/或列多任务器，以在单一时钟循环期间分别提供对额外行或列的存取。Although described as having two ports, any of the embodiments described above may be extended to more than two ports. For example, the embodiments of Figures 42, 46, 48, and 49 may include additional column multiplexers or row multiplexers, respectively, to provide access to additional columns or rows during a single clock cycle. As another example, the embodiments of Figures 43 and 44 may include additional row decoders and/or column multiplexers to provide access to additional rows or columns, respectively, during a single clock cycle.

存储器中的可变字长存取Variable Word Length Access in Memory

如上文及下文进一步所使用，术语「耦接」可包括直接连接、间接连接、电通信及其类似物。As used above and further below, the term "coupled" may include direct connections, indirect connections, electrical communication, and the like.

此外，如「第一」、「第二」及其类似物的术语用以区分具有相同或类似名称或标题的元件或方法步骤，且未必指示空间或时间次序。Furthermore, terms such as "first," "second," and the like, are used to distinguish between elements or method steps having the same or similar names or headings, and do not necessarily indicate a spatial or temporal order.

通常，存储器芯片可包括存储器组。存储器组可耦接至行解码器及列解码器，该解码器被配置为选择要读取或写入的特定字(或其他固定大小的数据单元)。每个存储器组可包括用以储存数据单元的存储器胞元、用以放大来自通过行解码器及列解码器选择的存储器胞元的电压，及任何其他适当电路。Typically, a memory chip may include a memory bank. The memory banks may be coupled to row and column decoders that are configured to select particular words (or other fixed-size units of data) to be read or written. Each memory bank may include memory cells to store data cells, to amplify voltages from memory cells selected by row and column decoders, and any other suitable circuitry.

每个存储器组通常具有特定I/O宽度。例如，I/O宽度可包含字。Each memory bank usually has a specific I/O width. For example, the I/O width can contain words.

虽然由使用存储器芯片的逻辑电路执行的一些处理程序可受益于使用极长字，但一些其他处理程序可仅需要该字的一部分。While some processing routines performed by logic circuits using memory chips may benefit from the use of extremely long words, some other processing routines may require only a portion of the word.

实际上，存储器内计算单元(诸如，与存储器芯片安置于相同基板上的处理器子单元，例如，如图7A中所描绘及描述)频繁地执行仅需要该字的一部分的存储器存取操作。In practice, in-memory computing units, such as processor sub-units disposed on the same substrate as memory chips, eg, as depicted and described in Figure 7A, frequently perform memory access operations that require only a portion of the word.

为了缩减在仅使用一部分时与存取整个字相关联的延时，本公开的实施例可提供用于仅取得一字的一个或多个部分的方法及系统，藉此缩减与传送该字的不需要部分相关联的数据损失且允许存储器设备中的功率节省。To reduce the latency associated with accessing an entire word when only a portion is used, embodiments of the present disclosure may provide methods and systems for fetching only one or more portions of a word, thereby reducing the time required to transmit the word. Part of the associated data loss is not required and power saving in the memory device is allowed.

此外，本公开的实施例也可缩减存储器芯片与其他实体(诸如，逻辑电路，无论系分开的，如CPU及GPU，抑或与存储器芯片包括于相同基板上，诸如图7A中所描绘及描述的处理器子单元)之间的相互作用的功率消耗，该其他实体存取存储器芯片，其可仅接收或写入该字的一部分。Furthermore, embodiments of the present disclosure may also reduce memory chips from other entities, such as logic circuits, whether separate, such as CPUs and GPUs, or included on the same substrate as memory chips, such as that depicted and described in FIG. 7A . The power consumption of the interaction between the processor sub-unit), the other entity accessing the memory chip, which may only receive or write part of the word.

存储器存取命令(例如，来自使用存储器的逻辑电路)可包括存储器中的地址。例如，该地址可包括列地址及行地址，或可例如通过存储器的存储器控制器转移成列地址及行地址。A memory access command (eg, from a logic circuit using the memory) may include an address in the memory. For example, the address may include a column address and a row address, or may be transferred to a column address and a row address, such as by a memory controller of the memory.

在诸如DRAM的许多易失性存储器中，将列地址发送(例如，直接由逻辑电路或使用存储器控制器)至行解码器，该行解码器启动整个行(也被称作字线)且加载包括于该行中的所有比特线。In many volatile memories such as DRAM, the column address is sent (eg, directly by logic circuitry or using a memory controller) to a row decoder, which activates an entire row (also called a word line) and loads the All bit lines included in the row.

该行地址识别经启动行上的比特线，其在包括比特线的存储器组外部转送且传送至下一层级电路系统。例如，下一层级电路系统可包含存储器芯片的I/O总线。在使用存储器内处理的实施例中，下一层级电路系统可包含存储器芯片的处理器子单元(例如，如图7A中所描绘)。The row address identifies the bit line on the enabled row, which is forwarded outside the memory bank that includes the bit line and passed to the next level of circuitry. For example, the next level of circuitry may include the I/O bus of the memory chip. In embodiments using in-memory processing, the next level of circuitry may include the processor subunit of the memory chip (eg, as depicted in Figure 7A).

因此，下文所描述的存储器芯片可包括于如图3A、图3B、图4至图6、图7A至图7D、图11至图13、图16至图19、图22或图23中的任一者中所说明的存储器芯片中，或以其他方式包含该存储器芯片。Accordingly, the memory chip described below may be included in any of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23. One of the memory chips described in, or otherwise included in, the memory chips.

该存储器芯片可通过针对存储器胞元而非逻辑胞元优化的第一制造处理程序来制造。例如，由第一制造处理程序制造的存储器胞元可展现比由第一制造处理程序制造的逻辑电路的临界尺寸小的临界尺寸(例如，小超过2倍、3倍、4倍、5倍、6倍、7倍、8倍、9倍、10倍及其类似物)。例如，第一制造处理程序可包含模拟制造处理程序、DRAM制造处理程序及其类似物。The memory chip may be fabricated with a first fabrication process optimized for memory cells rather than logic cells. For example, memory cells fabricated by the first fabrication process may exhibit critical dimensions that are smaller than the critical dimensions of logic circuits fabricated by the first fabrication process (eg, more than 2 times, 3 times, 4 times, 5 times smaller, 6 times, 7 times, 8 times, 9 times, 10 times and the like). For example, the first fabrication process may include an analog fabrication process, a DRAM fabrication process, and the like.

此存储器芯片可包含集成电路，该集成电路可包括存储器单元。该存储器单元可包括存储器胞元、输出端口及读取电路系统。在一些实施例中，该存储器单元还可以包括处理单元，诸如，如上文所描述的处理器子单元。This memory chip may include integrated circuits, which may include memory cells. The memory cells may include memory cells, output ports, and read circuitry. In some embodiments, the memory unit may also include a processing unit, such as a processor sub-unit as described above.

例如，该读取电路系统可包括缩减单元及存储器内读取路径的第一群组，该存储器内读取路径用于经由输出端口输出多达第一数量的比特。该输出端口可连接至芯片外逻辑电路(诸如，加速器、CPU、GPU或其类似物)或片上处理器子单元，如上文所描述。For example, the read circuitry may include a reduction unit and a first group of in-memory read paths for outputting up to a first number of bits via an output port. The output port may be connected to off-chip logic (such as an accelerator, CPU, GPU, or the like) or to an on-chip processor subunit, as described above.

在一些实施例中，该处理单元可包括缩减单元，可为缩减单元的部分，可不同于缩减单元，或可以其他方式包含缩减单元。In some embodiments, the processing unit may comprise a reduction unit, may be part of a reduction unit, may be different from a reduction unit, or may otherwise include a reduction unit.

存储器内读取路径可包含于集成电路中(例如，可在存储器单元中)，且可包括被配置用于从存储器胞元读取和/或写入至存储器胞元的任何电路和/或链路。例如，存储器内读取路径可包括感测放大器、耦接至存储器胞元的导体、多任务器及其类似物。In-memory read paths may be included in integrated circuits (eg, may be in memory cells), and may include any circuits and/or chains configured to read from and/or write to memory cells road. For example, in-memory read paths may include sense amplifiers, conductors coupled to memory cells, multiplexers, and the like.

该处理单元可被配置为将一读取请求发送至该存储器单元以从该存储器单元读取第二数量的比特。另外或替代地，该读取请求可源自芯片外逻辑电路(诸如，加速器、CPU、GPU或其类似物)。The processing unit may be configured to send a read request to the memory unit to read a second number of bits from the memory unit. Additionally or alternatively, the read request may originate from off-chip logic (such as an accelerator, CPU, GPU, or the like).

该缩减单元可被配置为例如通过使用本文中所描述的部分字存取中的任一者来辅助缩减与存取请求相关的功率消耗。The reduction unit may be configured to assist in reducing power consumption associated with access requests, eg, by using any of the partial word accesses described herein.

该缩减单元可被配置为在由该读取请求触发的读取操作期间基于该第一数量的比特及该第二数量的比特而控制该存储器内读取路径。例如，来自缩减单元的控制信号可影响读取路径的存储器消耗，以缩减与所请求的第二数量的比特不相关的存储器读取路径的能量消耗。例如，该缩减单元可被配置为在第二数量小于第一数量时控制不相关的存储器读取路径。The reduction unit may be configured to control the in-memory read path based on the first number of bits and the second number of bits during a read operation triggered by the read request. For example, the control signal from the reduction unit may affect the memory consumption of the read paths to reduce the energy consumption of the memory read paths unrelated to the requested second number of bits. For example, the reduction unit may be configured to control unrelated memory read paths when the second number is less than the first number.

如上文所解释，该集成电路可包括于如图3A、图3B、图4至图6、图7A至图7D、图11至图13、图16至图19、图22或图23中的任一者中所说明的存储器芯片中，可包括该存储器芯片或以其他方式包含该存储器芯片。As explained above, the integrated circuit may be included in any of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23. Among the memory chips described in one, the memory chip may be included or otherwise included.

不相关的存储器内读取路径可与第一数量的比特中的不相关的比特相关，诸如第一数量的比特中的不包括于第二数量的比特中的比特。The uncorrelated in-memory read paths may be related to uncorrelated bits of the first number of bits, such as bits of the first number of bits that are not included in the second number of bits.

图50说明本公开的一实施例中的集成电路5000，其包括：存储器胞元阵列5050中的存储器胞元5001至5008；输出端口5020，其包括比特5021至5028；读取电路系统5040，其包括存储器读取路径5011至5018；及缩减单元5030。50 illustrates an integrated circuit 5000 in an embodiment of the present disclosure, which includes: memory cells 5001-5008 in a memory cell array 5050; an output port 5020, which includes bits 5021-5028; read circuitry 5040, which Including memory read paths 5011 to 5018; and reduction unit 5030.

当使用对应的存储器读取路径读取第二数量的比特时，第一数量的比特中的不相关比特可对应于不应读取的比特(例如，不包括于第二数量的比特中的比特)。When the second number of bits are read using the corresponding memory read path, irrelevant bits in the first number of bits may correspond to bits that should not be read (eg, bits not included in the second number of bits) ).

在读取操作期间，缩减单元5030可被配置为启动对应于第二数量的比特的存储器读取路径，使得经启动的存储器读取路径可被配置为输送第二数量的比特。在这些实施例中，可仅启动对应于第二数量的比特的存储器读取路径。During a read operation, the reduction unit 5030 may be configured to initiate memory read paths corresponding to the second number of bits, such that the initiated memory read paths may be configured to deliver the second number of bits. In these embodiments, only memory read paths corresponding to the second number of bits may be enabled.

在读取操作期间，缩减单元5030可被配置为切断每一不相关的存储器读取路径的至少一部分。例如，不相关的存储器读取路径可对应于第一数量的比特中的不相关比特。During a read operation, the reduction unit 5030 can be configured to cut off at least a portion of each unrelated memory read path. For example, uncorrelated memory read paths may correspond to uncorrelated bits of the first number of bits.

应注意，替代切断不相关的存储器路径的至少一部分，缩减单元5030可替代地保证不启动不相关的存储器路径。It should be noted that instead of cutting off at least a portion of the irrelevant memory paths, the reduction unit 5030 may instead ensure that the irrelevant memory paths are not enabled.

另外或替代地，在读取操作期间，缩减单元5030可被配置为将不相关的存储器读取路径维持在低功率模式中。例如，低功率模式可包含分别向不相关的存储器路径供应低于正常工作电压或电流的电压或电流。Additionally or alternatively, reduce unit 5030 may be configured to maintain unrelated memory read paths in a low power mode during read operations. For example, a low power mode may include supplying voltages or currents, respectively, to unrelated memory paths that are lower than normal operating voltages or currents.

缩减单元5030可被进一步配置为控制不相关的存储器读取路径的比特线。The reduction unit 5030 may be further configured to control the bit lines of the unrelated memory read paths.

因此，缩减单元5030可被配置为加载相关的存储器读取路径的比特线，且将不相关的存储器读取路径的比特线维持在低功率模式中。例如，仅可加载相关的存储器读取路径的比特线。Thus, the reduction unit 5030 may be configured to load the bit lines of the relevant memory read paths and maintain the bit lines of the unrelated memory read paths in a low power mode. For example, only the bit lines of the associated memory read path may be loaded.

另外或替代地，缩减单元5030可被配置为加载相关的存储器读取路径的比特线，同时将不相关的存储器读取路径的比特线维持为撤销启动。Additionally or alternatively, the reduction unit 5030 may be configured to load the bit lines of the relevant memory read paths while maintaining the bit lines of the unrelated memory read paths as deactivated.

在一些实施例中，缩减单元5030可被配置为在读取操作期间利用相关的存储器读取路径的部分，且将每一不相关的存储器读取路径的一部分维持在低功率模式中，其中该部分不同于比特线。In some embodiments, the reduction unit 5030 may be configured to utilize portions of the memory read paths that are associated during read operations, and maintain a portion of each unassociated memory read path in a low power mode, where the Parts differ from bit lines.

如上文所解释，存储器芯片可使用感测放大器以放大来自包括于存储器芯片中的存储器胞元的电压。因此，缩减单元5030可被配置为在读取操作期间利用相关的存储器读取路径的部分，且将与不相关的存储器读取路径中的至少一些读取路径相关联的感测放大器维持在低功率模式中。As explained above, memory chips may use sense amplifiers to amplify voltages from memory cells included in the memory chips. Accordingly, the reduction unit 5030 can be configured to utilize portions of the memory read paths that are relevant during a read operation and maintain sense amplifiers associated with at least some of the unrelated memory read paths low in power mode.

在这些实施例中，缩减单元5030可被配置为在读取操作期间利用相关的存储器读取路径的部分，且将与所有不相关的存储器读取路径相关联的一个或多个感测放大器维持在低功率模式中。In these embodiments, the reduction unit 5030 may be configured to utilize a portion of the memory read paths that are relevant during a read operation, and maintain one or more sense amplifiers associated with all unrelated memory read paths in low power mode.

另外或替代地，缩减单元5030可被配置为在读取操作期间利用相关的存储器读取路径的部分，且将在与不相关的存储器读取路径相关联的一个或多个感测放大器之后(例如，在空间上和/或在时间上)的不相关的存储器读取路径的部分维持在低功率模式中。Additionally or alternatively, the reduction unit 5030 may be configured to utilize portions of the associated memory read paths during read operations, and will be after one or more sense amplifiers associated with the unassociated memory read paths ( For example, portions of memory read paths that are spatially and/or temporally unrelated are maintained in a low power mode.

在上文所描述的实施例中的任一者中，该存储器单元可包括列多任务器(未示出)。In any of the embodiments described above, the memory unit may comprise a column multiplexer (not shown).

在这些实施例中，缩减单元5030可耦接于列多任务器与输出端口之间。In these embodiments, the reduction unit 5030 may be coupled between the column multiplexer and the output port.

另外或替代地，缩减单元5030可嵌入于列多任务器中。Additionally or alternatively, the reduction unit 5030 may be embedded in a column multiplexer.

另外或替代地，缩减单元5030可耦接于存储器胞元与列多任务器之间。Additionally or alternatively, the reduction unit 5030 may be coupled between the memory cells and the column multiplexer.

缩减单元5030可包含可为可独立控制的缩减子单元。例如，不同的缩减子单元可与不同的存储器单元列相关联。Reduction unit 5030 may include reduction subunits that may be independently controllable. For example, different reduced subcells may be associated with different columns of memory cells.

尽管上文关于读取操作及读取电路系统进行描述，但上文实施例可类似地应用于写入操作及写入电路系统。Although described above with respect to read operations and read circuitry, the above embodiments may be similarly applied to write operations and write circuitry.

例如，根据本公开的集成电路可包括存储器单元，该存储器单元包含存储器胞元、输出端口及写入电路系统。在一些实施例中，该存储器单元还可以包括处理单元，诸如，如上文所描述的处理器子单元。该写入电路系统可包括缩减单元及用于经由输出端口输出多达第一数量的比特的存储器内写入路径的第一群组。该处理单元可被配置为将一写入请求发送至该存储器单元以写入来自该存储器单元的第二数量的比特。另外或替代地，该写入请求可源自芯片外逻辑电路(诸如，加速器、CPU、GPU或其类似物)。缩减单元5030可被配置为在由该写入请求触发的写入操作期间基于该第一数量的比特及该第二数量的比特而控制该存储器写入路径。For example, integrated circuits in accordance with the present disclosure may include memory cells that include memory cells, output ports, and write circuitry. In some embodiments, the memory unit may also include a processing unit, such as a processor sub-unit as described above. The write circuitry may include a reduction unit and a first group of in-memory write paths for outputting up to a first number of bits via an output port. The processing unit may be configured to send a write request to the memory unit to write the second number of bits from the memory unit. Additionally or alternatively, the write request may originate from off-chip logic (such as an accelerator, CPU, GPU, or the like). Reduction unit 5030 may be configured to control the memory write path based on the first number of bits and the second number of bits during a write operation triggered by the write request.

图51说明存储器组5100，该存储器组包括使用列地址及行地址(例如，来自片上处理器子单元或芯片外逻辑电路，诸如加速器、CPU、GPU或其类似物)来寻址的存储器胞元的阵列5111。如图51中所展示，存储器胞元馈接至比特线(竖直)及字线(水平，为简单起见省略许多字线)。此外，行解码器5112可馈入有列地址(例如，来自片上处理器子单元、芯片外逻辑电路，或图51中未展示的存储器控制器)，列多任务器5113可馈入有行地址(例如，来自片上处理器子单元、芯片外逻辑电路，或图51中未展示的存储器控制器)，且列多任务器5113可经由输出总线5115接收来自多达整个线的输出及多达一字的输出。在图51中，列多任务器5113的输出总线5115耦接至主I/O总线5114。在其他实施例中，输出总线5115可耦接至存储器芯片(例如，如图7A中所描绘)的发送列地址及行地址的处理器子单元。为简单起见，未展示将存储器组分成存储器垫。51 illustrates a memory bank 5100 that includes memory cells addressed using column addresses and row addresses (eg, from an on-chip processor sub-unit or off-chip logic such as accelerators, CPUs, GPUs, or the like) Array 5111. As shown in Figure 51, memory cells are fed to bit lines (vertical) and word lines (horizontal, many word lines omitted for simplicity). Additionally, row decoders 5112 may be fed with column addresses (eg, from an on-chip processor subunit, off-chip logic, or a memory controller not shown in Figure 51), and column multiplexers 5113 may be fed with row addresses (eg, from an on-chip processor sub-unit, off-chip logic, or a memory controller not shown in FIG. 51), and the column multiplexer 5113 can receive, via the output bus 5115, outputs from up to an entire line and up to a word output. In FIG. 51, the output bus 5115 of the column multiplexer 5113 is coupled to the main I/O bus 5114. In other embodiments, output bus 5115 may be coupled to a processor subunit of a memory chip (eg, as depicted in Figure 7A) that sends column and row addresses. For simplicity, the grouping of memory banks into memory pads is not shown.

图52说明存储器组5101。在图52中，存储器组还说明为包括存储器内处理(PIM)逻辑5116，该逻辑具有耦接至输出总线5115的输入端。PIM逻辑5116可产生地址(例如，包含列地址及行地址)且经由PIM地址总线5118输出地址以存取存储器组。PIM逻辑5116为还包含处理单元的缩减单元(例如，单元5030)的实施例。PIM逻辑5016可控制图52未展示的辅助缩减功率的其他电路。PIM逻辑5116可进一步控制包括存储器组5101的存储器单元的存储器路径。FIG. 52 illustrates memory bank 5101. In FIG. 52, the memory bank is also illustrated as including processing in memory (PIM) logic 5116 having an input coupled to an output bus 5115. PIM logic 5116 may generate addresses (eg, including column and row addresses) and output the addresses via PIM address bus 5118 to access memory banks. PIM logic 5116 is an embodiment of a reduced unit (eg, unit 5030) that also includes processing units. PIM logic 5016 may control other circuits not shown in FIG. 52 that assist in power reduction. PIM logic 5116 may further control the memory paths of the memory cells including memory bank 5101 .

如上文所解释，在一些状况下，字长(例如，选择一次传送的比特线的数量)可为大的。As explained above, in some cases the word length (eg, the number of bit lines selected for a transfer) may be large.

在这些状况下，用于读取和/或写入的每一字可与可在读取和/或写入操作的各种阶段消耗功率的存储器路径相关联，例如：Under these conditions, each word used for reading and/or writing may be associated with a memory path that may consume power during various phases of the read and/or write operation, such as:

a.加载比特线——为了将比特线加载至所需值(在读取循环中从比特线上的电容器，抑或在写入循环中要写入至电容器的新值)，需要停用位于存储器阵列的末端处的感测放大器且确保保存数据的电容器不放电或充电(否则，储存于其上的数据将被破坏)；及a. Loading the bit line - in order to load the bit line to the desired value (from the capacitor on the bit line in a read cycle, or a new value to be written to the capacitor in a write cycle), the memory on the memory needs to be disabled sense amplifiers at the ends of the array and ensure that the capacitors holding the data do not discharge or charge (otherwise the data stored on them would be destroyed); and

b.经由选择比特线的列多任务器将来自感测放大器的数据移动至芯片的其余部分(移动至将数据传入及传出芯片的I/O总线或移动至将使用数据的嵌入式逻辑，诸如与存储器在相同基板上的处理器子单元)。b. Move the data from the sense amps to the rest of the chip via the column multiplexer that selects the bit line (to the I/O bus that carries the data in and out of the chip or to the embedded logic that will use the data , such as a processor subunit on the same substrate as the memory).

为了达成功率节省，本公开的集成电路可在行启动时间判定一字的一些部分为不相关的且接着针对该字的该不相关的部分将停用信号发送至一个或多个感测放大器。To achieve power savings, the integrated circuits of the present disclosure may determine some portion of a word to be irrelevant at row enable time and then send a disable signal to one or more sense amplifiers for the irrelevant portion of the word .

图53说明存储器单元5102，该存储器单元包括存储器胞元阵列5111、行解码器5112、耦接至输出总线5115的列多任务器5113，及PIM逻辑5116。53 illustrates a memory cell 5102 that includes a memory cell array 5111, a row decoder 5112, a column multiplexer 5113 coupled to an output bus 5115, and PIM logic 5116.

存储器单元5102还包括启用或停用比特至列多任务器5113的通道的开关5201。开关5201可包含模拟开关、被配置为用作开关的晶体管，被配置为控制至存储器单元5102的部分的供应或电压和/或电流流动。感测放大器(未示出)可位于存储器胞元阵列的末端处，例如，在开关5201之前(在空间上和/或在时间上)。The memory cell 5102 also includes a switch 5201 that enables or disables the passage of bits to the column multiplexer 5113 . Switch 5201 may include an analog switch, a transistor configured to function as a switch, configured to control the supply or voltage and/or current flow to portions of memory cell 5102 . A sense amplifier (not shown) may be located at the end of the memory cell array, eg, before switch 5201 (in space and/or in time).

开关5201可由从PIM逻辑5116经由总线5117发送的启用信号控制。当断开时，该开关被配置为断开存储器单元5102的感测放大器(未示出)，且因此不对与感测放大器断开的比特线放电或充电。Switch 5201 may be controlled by an enable signal sent from PIM logic 5116 via bus 5117. When open, the switch is configured to disconnect the sense amplifier (not shown) of the memory cell 5102, and thus not discharge or charge the bit line disconnected from the sense amplifier.

开关5201及PIM逻辑5116可形成缩减单元(例如，缩减单元5030)。Switch 5201 and PIM logic 5116 may form a reduction unit (eg, reduction unit 5030).

在又一实施例中，PIM逻辑5116可将启用信号发送至感测放大器(例如，当感测放大器具有启用输入时)而非发送至开关5201。In yet another embodiment, the PIM logic 5116 may send the enable signal to the sense amplifier (eg, when the sense amplifier has an enable input) instead of the switch 5201 .

比特线可另外或替代地在其他点处断开，例如，不在比特线的末端处及在感测放大器之后断开。例如，比特线可在进入阵列5111之前断开。The bit line may additionally or alternatively be broken at other points, eg, not at the end of the bit line and after the sense amplifier. For example, the bit lines can be broken before entering the array 5111.

在这些实施例中，在从感测放大器及转送硬件(诸如，输出总线5115)的数据传送上，也可节省功率。In these embodiments, power may also be saved on data transfers from sense amplifiers and forwarding hardware, such as output bus 5115.

其他实施例(其可节省较少功率，但可较容易实施)聚焦于节省列多任务器5113的功率且将损失从列多任务器5113转移至下一层级电路系统。例如，如上文所解释，下一层级电路系统可包含存储器芯片的I/O总线(诸如，总线5115)。在使用存储器内处理的实施例中，下一层级电路系统可另外或替代地包含存储器芯片的处理器子单元(诸如，PIM逻辑5116)。Other embodiments, which may save less power, but may be easier to implement, focus on saving the power of the column multiplexer 5113 and shifting the losses from the column multiplexer 5113 to the next level of circuitry. For example, as explained above, the next level of circuitry may include the memory chip's I/O bus (such as bus 5115). In embodiments using in-memory processing, the next level of circuitry may additionally or alternatively include a processor sub-unit of the memory chip (such as PIM logic 5116).

图54A说明分段为区段5202的列多任务器5113。列多任务器5113的每个区段5202可通过从PIM逻辑5116经由总线5119发送的启用和/或停用信号来个别地启用或停用。列多任务器5113也可由地址列总线5118馈入。54A illustrates column multiplexer 5113 segmented into segments 5202. Each segment 5202 of the column multiplexer 5113 can be individually enabled or disabled by enable and/or disable signals sent from the PIM logic 5116 via the bus 5119. The column multiplexer 5113 may also be fed by the address column bus 5118.

图54A的实施例可提供对来自列多任务器5113的输出的不同部分的较佳控制。The embodiment of FIG. 54A may provide better control over different portions of the output from the column multiplexer 5113.

应注意，对不同存储器路径的控制可具有不同分辨率，例如范围为从一比特分辨率至多比特分辨率。前者在功率节省的意义上可能更有效。后者的实施可能较简单且需要较少控制信号。It should be noted that control of different memory paths may have different resolutions, eg ranging from one-bit resolution to multi-bit resolution. The former may be more efficient in the sense of power saving. The latter implementation may be simpler and require fewer control signals.

图54B说明方法5130的实施例。例如，可使用上文关于图50、图51、图52、图53或图54A所描述的存储器单元中的任一者来实施方法5130。54B illustrates an embodiment of method 5130. For example, method 5130 may be implemented using any of the memory cells described above with respect to Figures 50, 51, 52, 53, or 54A.

方法5130可包括步骤5132及5134。Method 5130 may include steps 5132 and 5134.

步骤5132可包括：通过集成电路的处理单元(例如，PIM逻辑5116)将存取请求发送至集成电路的存储器单元以从该存储器单元读取第二数量的比特。该存储器单元可包括存储器胞元(例如，阵列5111的存储器胞元)、输出端口(例如，输出总线5115)，及读取/写入电路系统，该读取/写入电路系统可包括缩减单元(例如，缩减单元5030)及存储器读取/写入路径的第一群组，该存储器读取/写入路径用于经由输出端口输出和/或输入多达第一数量的比特。Step 5132 may include sending, through a processing unit of the integrated circuit (eg, PIM logic 5116), an access request to a memory cell of the integrated circuit to read the second number of bits from the memory cell. The memory cells may include memory cells (eg, memory cells of array 5111), output ports (eg, output bus 5115), and read/write circuitry, which may include reduced cells (eg, reduction unit 5030) and a first group of memory read/write paths for outputting and/or inputting up to a first number of bits via output ports.

存取请求可包含读取请求和/或写入请求。Access requests may include read requests and/or write requests.

存储器输入/输出路径可包含存储器读取路径、存储器写入路径和/或用于读取及写入两者的路径。Memory input/output paths may include memory read paths, memory write paths, and/or paths for both read and write.

步骤5134可包括对存取请求作出响应。Step 5134 may include responding to the access request.

例如，步骤5134可包括在由存取请求触发的存取操作期间通过缩减单元(例如，单元5030)基于第一数量的比特及第二数量的比特而控制存储器读取/写入路径。For example, step 5134 may include controlling a memory read/write path based on the first number of bits and the second number of bits by a reduction unit (eg, unit 5030) during an access operation triggered by the access request.

步骤5134还可以包括以下操作中的任一者和/或以下操作中的任一者的任何组合。下文列出的操作中的任一者可在对存取请求作出响应期间执行，但也可在对存取请求作出响应之前和/或之后执行。Step 5134 may also include any of the following operations and/or any combination of any of the following operations. Any of the operations listed below may be performed during responding to an access request, but may also be performed before and/or after responding to an access request.

因此，步骤5134可包含以下操作中的至少一个：Accordingly, step 5134 may include at least one of the following operations:

a.在第二数量小于第一数量时控制不相关的存储器读取路径，其中不相关的存储器读取路径与第一数量的比特中的不包括于第二数量的比特中的比特相关联；a. controlling uncorrelated memory read paths when the second number is less than the first number, wherein the uncorrelated memory read paths are associated with bits of the first number of bits that are not included in the second number of bits;

b.在读取操作期间启动相关的存储器读取路径，其中相关的存储器读取路径被配置为输送第二数量的比特；b. Initiating an associated memory read path during a read operation, wherein the associated memory read path is configured to deliver the second number of bits;

c.在读取操作期间切断不相关的存储器读取路径中的每个的至少一部分；c. severing at least a portion of each of the unrelated memory read paths during a read operation;

d.在读取操作期间将不相关的存储器读取路径维持在低功率模式中；d. maintaining unrelated memory read paths in a low power mode during read operations;

e.控制不相关的存储器读取路径的比特线；e. Bit lines that control unrelated memory read paths;

f.加载相关的存储器读取路径的比特线且将不相关的存储器读取路径的比特线维持在低功率模式中；f. Load the bit lines of the relevant memory read paths and maintain the bit lines of the unrelated memory read paths in a low power mode;

g.加载相关的存储器读取路径的比特线，同时将不相关的存储器读取路径的比特线维持为撤销启动；g. Load the bit lines of the relevant memory read paths while maintaining the bit lines of the unrelated memory read paths as deactivated;

h.在读取操作期间利用相关的存储器读取路径的部分且将每一不相关的存储器读取路径的一部分维持在低功率模式中，其中该部分不同于比特线；h. Utilize a portion of an associated memory read path during a read operation and maintain a portion of each unassociated memory read path in a low power mode, wherein the portion is distinct from a bit line;

i.在读取操作期间利用相关的存储器读取路径的部分且将用于不相关的存储器读取路径中的至少一些读取路径的感测放大器维持在低功率模式中；i. Utilize portions of the relevant memory read paths and maintain the sense amplifiers for at least some of the unrelated memory read paths in a low power mode during a read operation;

j.在读取操作期间利用相关的存储器读取路径的部分且将不相关的存储器读取路径中的至少一些读取路径的感测放大器维持在低功率模式中；及j. Utilize portions of the associated memory read paths and maintain the sense amplifiers of at least some of the unassociated memory read paths in a low power mode during a read operation; and

k.在读取操作期间利用相关的存储器读取路径的部分且将在不相关的存储器读取路径的感测放大器之后的不相关的存储器读取路径维持在低功率模式中。k. Utilize portions of the relevant memory read paths during read operations and maintain the unrelated memory read paths after the sense amplifiers of the unrelated memory read paths in a low power mode.

低功率模式或闲置模式可包含存储器存取路径的功率消耗低于在存储器存取路径用于存取操作时存储器存取路径的功率消耗的模式。在一些实施例中，低功率模式可甚至涉及切断存储器存取路径。低功率模式可另外或替代地包括不启动存储器存取路径。A low power mode or an idle mode may include a mode in which the power consumption of the memory access path is lower than the power consumption of the memory access path when the memory access path is used for access operations. In some embodiments, the low power mode may even involve cutting off memory access paths. The low power mode may additionally or alternatively include not enabling memory access paths.

应注意，在比特线阶段期间发生的功率缩减可能需要在开放字线之前应知晓存储器存取路径的相关性或不相关性。在别处发生(例如，在列多任务器中)的功率缩减可替代地允许在每次存取时决定存储器存取路径的相关性或不相关性。It should be noted that the power reduction that occurs during the bit line phase may require that the memory access path dependencies or irrelevance should be known before opening the word lines. Power reductions that occur elsewhere (eg, in a column multiplexer) may instead allow the memory access paths to be coherent or irrelevant on each access.

快速及低功率启动以及快速存取存储器Fast and low power startup and fast access memory

DRAM及其他存储器类型(诸如，SRAM、快闪存储器或其类似物)常常从存储器组建置，该存储器组通常建置为允许行和列存取方案。DRAM and other memory types, such as SRAM, flash memory, or the like, are often constructed from memory banks that are typically constructed to allow row and column access schemes.

图55说明存储器芯片5140的实施例，该存储器芯片包括多个存储器垫及相关联逻辑(诸如，行和列解码器，在图55中分别描绘为RD及COL)。在图55的实施例中，垫被分组成组且具有通过其的字线及比特线。存储器垫及相关联逻辑在图55中标明为5141、5142、5143、5144、5145及5146，且共享至少一个总线5147。Figure 55 illustrates an embodiment of a memory chip 5140 that includes multiple memory pads and associated logic (such as row and column decoders, depicted in Figure 55 as RD and COL, respectively). In the embodiment of Figure 55, the pads are grouped into groups and have wordlines and bitlines through them. The memory pads and associated logic are labeled 5141 , 5142 , 5143 , 5144 , 5145 , and 5146 in FIG. 55 and share at least one bus 5147 .

存储器芯片5140可包括于如图3A、图3B、图4至图6、图7A至图7D、图11至图13、图16至图19、图22或图23中的任一者中所说明的存储器芯片中，可包括该存储器芯片或以其他方式包含该存储器芯片。The memory chip 5140 may be included as illustrated in any of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23 The memory chip may include the memory chip or otherwise contain the memory chip.

例如，在DRAM中，与启动新行(例如，准备用于存取的新线)相关联的开销很大。一旦一线经启动(也被称作开放)，该行内的数据便可用于更快存取。在DRAM中，此存取可能以随机方式进行。For example, in DRAM, the overhead associated with starting a new line (eg, preparing a new line for access) is significant. Once a line is activated (also called open), the data within that line is available for faster access. In DRAM, this access may occur in a random fashion.

与启动新线相关联的两个问题为功率及时间：Two issues associated with starting a new line are power and time:

c.由于共同存取该线上的所有电容器及必须加载该线所导致的电流骤增，功率会上升(例如，当开放仅具有几个存储器组的线时，功率可达到若干安培)；及c. The power will rise due to the current surge caused by accessing all capacitors on the line in common and having to load the line (eg, when opening a line with only a few memory banks, the power can reach several amps); and

d.时间延迟问题主要与加载行(字)线及接着加载比特(列)线所花费的时间相关联。d. The time delay problem is mainly associated with the time it takes to load the row (word) lines and then the bit (column) lines.

本公开的一些实施例可包括用以在启动线期间缩减峰值功率消耗且缩减线启动时间的系统及方法。一些实施例可至少在一定程度上牺牲一线内的完全随机存取，以缩减这些功率及时间成本。Some embodiments of the present disclosure may include systems and methods to reduce peak power consumption and reduce line start-up time during start-up of a line. Some embodiments may reduce these power and time costs at the expense of at least some degree of full random access within a line.

例如，在一个实施例中，存储器单元可包括第一存储器垫、第二存储器垫及启动单元，该启动单元被配置为启动包括于第一存储器垫中的存储器胞元的第一群组，而不启动包括于第二存储器垫中的存储器胞元的第二群组。存储器胞元的该第一群组及存储器胞元的该第二群组可均属于该存储器单元的单一行。For example, in one embodiment, a memory cell may include a first memory pad, a second memory pad, and an enable unit configured to enable a first group of memory cells included in the first memory pad, and The second group of memory cells included in the second memory pad is not activated. The first group of memory cells and the second group of memory cells may both belong to a single row of the memory cells.

替代地，该启动单元可被配置为启动包括于第二存储器垫中的存储器胞元的第二群组，而不启动存储器胞元的第一群组。Alternatively, the enabling unit may be configured to enable the second group of memory cells included in the second memory pad without enabling the first group of memory cells.

在一些实施例中，该启动单元可被配置为在启动存储器胞元的第一群组之后启动存储器胞元的第二群组。In some embodiments, the activation unit may be configured to activate the second group of memory cells after initiating the first group of memory cells.

例如，该启动单元可被配置为在存储器胞元的第一群组的启动已完成之后起始的延迟时段期满之后启动存储器胞元的第二群组。For example, the activation unit may be configured to activate the second group of memory cells after expiration of a delay period that begins after activation of the first group of memory cells has completed.

另外或替代地，该启动单元可被配置为基于信号的值而启动存储器胞元的第二群组，该信号在耦接至存储器胞元的第一群组的第一字线区段上产生的。Additionally or alternatively, the enable unit may be configured to enable a second group of memory cells based on the value of a signal generated on a first word line segment coupled to the first group of memory cells of.

在上文所描述的实施例中的任一者中，该启动单元可包括安置于第一字线区段与第二字线区段之间的中间电路。在这些实施例中，第一字线区段可耦接至第一存储器胞元且第二字线区段可耦接至第二存储器胞元。中间电路的非限制性实施例包括开关、触发器、缓冲器、反相器及其类似物，其中的一些贯穿图56至图61加以说明。In any of the embodiments described above, the activation cell may include an intermediate circuit disposed between the first word line segment and the second word line segment. In these embodiments, a first wordline segment may be coupled to a first memory cell and a second wordline segment may be coupled to a second memory cell. Non-limiting examples of intermediate circuits include switches, flip-flops, buffers, inverters, and the like, some of which are described throughout FIGS. 56-61 .

在一些实施例中，第二存储器胞元可耦接至第二字线区段。在这些实施例中，第二字线区段可耦接至通过至少第一存储器垫的旁路字线路径。此类旁路路径的实施例说明于图61中。In some embodiments, the second memory cell may be coupled to the second word line segment. In these embodiments, the second word line segment may be coupled to a bypass word line path through at least the first memory pad. An example of such a bypass path is illustrated in FIG. 61 .

该启动单元可包含控制单元，该控制单元被配置为基于来自与单一行相关联的字线的启动信号而控制电压(和/或电流)至存储器胞元的第一群组及存储器胞元的第二群组的供应。The enable unit may include a control unit configured to control voltages (and/or currents) to the first group of memory cells and to the memory cells based on enable signals from word lines associated with a single row Supply of the second group.

在另一实施例中，存储器单元可包括第一存储器垫、第二存储器垫及启动单元，该启动单元被配置为将启动信号供应至第一存储器垫的存储器胞元的第一群组且延迟该启动信号至第二存储器垫的存储器胞元的第二群组的供应，至少直至存储器胞元的第一群组的启动已完成。存储器胞元的该第一群组及存储器胞元的该第二群组可属于该存储器单元的单一行。In another embodiment, a memory cell may include a first memory pad, a second memory pad, and an enable unit configured to supply an enable signal to a first group of memory cells of the first memory pad and delay The enable signal is supplied to the second group of memory cells of the second memory pad at least until the enablement of the first group of memory cells has been completed. The first group of memory cells and the second group of memory cells may belong to a single row of the memory cells.

例如，该启动单元可包括可被配置为延迟供应启动信号的延迟单元。For example, the activation unit may comprise a delay unit which may be configured to delay supply of the activation signal.

另外或替代地，该启动单元可包括比较器，该比较器可被配置为在其输入端处接收启动信号且基于启动信号的至少一个特性而控制延迟单元。Additionally or alternatively, the enable unit may include a comparator that may be configured to receive an enable signal at an input thereof and to control the delay unit based on at least one characteristic of the enable signal.

在另一实施例中，存储器单元可包括第一存储器垫、第二存储器垫及隔离单元，该隔离单元可被配置为：在第一存储器垫的第一存储器胞元被启动的初始启动时段期间将该第一存储器胞元与第二存储器垫的第二存储器胞元相隔离；及在该初始启动时段之后将该第一存储器胞元耦接至该二存储器胞元。第一存储器胞元及第二存储器胞元可属于存储器单元的单一行。In another embodiment, a memory cell may include a first memory pad, a second memory pad, and an isolation unit that may be configured to: during an initial startup period in which the first memory cell of the first memory pad is enabled isolating the first memory cell from a second memory cell of a second memory pad; and coupling the first memory cell to the two memory cells after the initial startup period. The first memory cell and the second memory cell may belong to a single row of memory cells.

在以下实施例中，可能不需要对存储器垫本身进行修改。在某些示例中，实施例可依赖于对存储器组的少量修改。In the following embodiments, modifications to the memory pads themselves may not be required. In some examples, embodiments may rely on minor modifications to memory banks.

以下图式描绘缩短添加至存储器组的字信号，藉此将字线分裂成数个较短部分的机构。The following figures depict a mechanism for shortening word signals added to a memory bank, thereby splitting the word line into shorter sections.

在以下诸图中，为了清楚起见省略各种存储器组组件。In the following figures, various memory bank components are omitted for clarity.

图56至图61说明存储器组的部分(分别标明为5140(1)、5140(2)、5140(3)、5140(4)、5140(5)及5149(6))，该部分包括分组于不同群组内的行解码器5112及多个存储器垫(诸如，5150(1)、5150(2)、5150(3)、5150(4)、5150(5)、5150(6)、5151(1)、5151(2)、5151(3)、5151(4)、5151(5)、5151(6)、5152(1)、5152(2)、5152(3)、5152(4)、5152(5)及5152(6))。Figures 56-61 illustrate portions of a memory bank (labeled 5140(1), 5140(2), 5140(3), 5140(4), 5140(5), and 5149(6), respectively) that include groups grouped in Row decoders 5112 within different groups and multiple memory pads such as 5150(1), 5150(2), 5150(3), 5150(4), 5150(5), 5150(6), 5151(1 ), 5151(2), 5151(3), 5151(4), 5151(5), 5151(6), 5152(1), 5152(2), 5152(3), 5152(4), 5152(5 ) and 5152(6)).

布置成行的存储器垫可包括不同群组。Memory pads arranged in rows may include different groups.

图56至图59及图61说明存储器垫的九个群组，其中每一群组包括一对存储器垫。可使用任何数量的群组，每一群组具有任何数量的存储器垫。56-59 and 61 illustrate nine groups of memory pads, where each group includes a pair of memory pads. Any number of groups may be used, each group having any number of memory pads.

存储器垫5150(1)、5150(2)、5150(3)、5150(4)、5150(5)及5150(6)布置成行，共享多条存储器线，且分成三个群组：第一上部群组，其包括存储器垫5150(1)及5150(2)；第二上部群组，其包括存储器垫5150(3)及5150(4)；及第三上部群组，其包括存储器垫5150(5)及5150(6)。Memory pads 5150(1), 5150(2), 5150(3), 5150(4), 5150(5), and 5150(6) are arranged in rows, sharing multiple memory lines, and divided into three groups: first upper group, which includes memory pads 5150(1) and 5150(2); a second upper group, which includes memory pads 5150(3) and 5150(4); and a third upper group, which includes memory pad 5150( 5) and 5150(6).

类似地，存储器垫5151(1)、5151(2)、5151(3)、5151(4)、5151(5)及5151(6)布置成行，共享多条存储器线且分成三个群组：第一中间群组，其包括存储器垫5151(1)及5151(2)；第二中间群组，其包括存储器垫5151(3)及5151(4)；及第三中间群组，其包括存储器垫5151(5)及5151(6)。Similarly, memory pads 5151(1), 5151(2), 5151(3), 5151(4), 5151(5), and 5151(6) are arranged in rows, sharing multiple memory lines and divided into three groups: A middle group that includes memory pads 5151(1) and 5151(2); a second middle group that includes memory pads 5151(3) and 5151(4); and a third middle group that includes memory pads 5151(5) and 5151(6).

此外，存储器垫5152(1)、5152(2)、5152(3)、5152(4)、5152(5)及5152(6)布置成行，共享多条存储器线且分组成三个群组：第一下部群组，其包括存储器垫5152(1)及5152(2)；第二下部群组，其包括存储器垫5152(3)及5152(4)；及第三下部群组，其包括存储器垫5152(5)及5152(6)。任何数量的存储器垫可布置成行并共享存储器线，且可分成任何数量的群组。Additionally, memory pads 5152(1), 5152(2), 5152(3), 5152(4), 5152(5), and 5152(6) are arranged in rows, sharing multiple memory lines and grouped into three groups: A lower group, which includes memory pads 5152(1) and 5152(2); a second lower group, which includes memory pads 5152(3) and 5152(4); and a third lower group, which includes memory Pads 5152(5) and 5152(6). Any number of memory pads can be arranged in rows and share memory lines, and can be divided into any number of groups.

例如，每个群组的存储器垫的数量可为一个、两个或可超过两个。For example, the number of memory pads per group may be one, two, or may exceed two.

如上文所解释，启动电路可被配置为启动存储器垫的一个群组，而不启动共享相同存储器线或至少耦接至具有相同线地址的不同存储器线区段的存储器垫的另一群组。As explained above, the enable circuit can be configured to enable one group of memory pads without enabling another group of memory pads that share the same memory line or at least are coupled to different memory line segments with the same line address.

图56至图61说明启动电路的不同示例。在一些实施例中，启动电路的至少一部分(诸如，中间电路)可位于存储器垫群组之间，以允许启动一个群组的存储器垫，而不启动相同行的存储器垫的另一群组。56 to 61 illustrate different examples of start-up circuits. In some embodiments, at least a portion of an enable circuit, such as an intermediate circuit, may be located between groups of memory pads to allow one group of memory pads to be enabled without enabling another group of memory pads of the same row.

图56说明如定位于存储器的第一上部群组的不同线与存储器垫的第二上部群组的不同线之间的中间电路，诸如延迟或隔离电路5153(1)至5153(3)。56 illustrates intermediate circuits, such as delay or isolation circuits 5153(1)-5153(3), as positioned between different lines of the first upper group of memory and different lines of the second upper group of memory pads.

图56还说明如定位于存储器的第二上部群组的不同线与存储器垫的第三上部群组的不同线之间的中间电路，诸如延迟或隔离电路5154(1)至5154(3)。另外，一些延迟或隔离电路定位于由中间群组的存储器垫形成的群组之间。此外，一些延迟或隔离电路定位于由下部群组的存储器垫形成的群组之间。56 also illustrates intermediate circuits, such as delay or isolation circuits 5154(1)-5154(3), as positioned between different lines of the second upper group of memory and different lines of the third upper group of memory pads. Additionally, some delay or isolation circuits are positioned between groups formed by intermediate groups of memory pads. In addition, some delay or isolation circuits are positioned between the groups formed by the lower group of memory pads.

该延迟或隔离电路可延迟或停止字线信号从行解码器5112沿着一行传播至另一群组。The delay or isolation circuit can delay or stop the propagation of word line signals from row decoder 5112 along a row to another group.

图57说明包含触发器(诸如，5155(1)至5155(3)及5156(1)至5156(3))的中间电路，诸如延迟或隔离电路。57 illustrates an intermediate circuit, such as a delay or isolation circuit, including flip-flops, such as 5155(1)-5155(3) and 5156(1)-5156(3).

当将启动信号注入至字线时，启动第一垫群组中的一者(取决于该字线)，而沿着该字线的其他群组保持撤销启动。可在下一时钟循环启动其他群组。例如，可在下一时钟循环启动其他群组中的第二群组，且可在又一时钟循环之后启动其他群组中的第三群组。When an enable signal is injected into a word line, one of the first pad groups (depending on the word line) is enabled, while the other groups along the word line remain de-enabled. Additional groups can be activated on the next clock cycle. For example, a second group of the other groups can be activated on the next clock cycle, and a third group of the other groups can be activated after another clock cycle.

触发器可包括D型触发器或任何其他类型的触发器。为简单起见，从图式省略馈入至D型触发器的时钟。The flip-flops may include D-type flip-flops or any other type of flip-flops. For simplicity, the clock fed to the D-type flip-flop is omitted from the diagram.

因此，对第一群组的存取可使用电力以仅对与第一群组相关联的字线的部分充电，此比对整条字线充电更快且需要更少电流。Thus, access to the first group can use power to charge only a portion of the word lines associated with the first group, which is faster and requires less current than charging the entire word line.

可在存储器垫群组之间使用多于一个触发器，藉此增加开放部分之间的延迟。另外或替代地，实施例可使用较慢时钟以增加延迟。More than one flip-flop can be used between groups of memory pads, thereby increasing the delay between open sections. Additionally or alternatively, embodiments may use slower clocks to increase latency.

此外，经启动的群组可仍含有来自所使用的先前线值的群组。例如，该方法可允许启动新的线区段，同时仍存取先前线的数据，藉此缩减与启动新线相关联的损失。Furthermore, the activated group may still contain the group from the previous line value used. For example, the method may allow a new line segment to be started while still accessing data from a previous line, thereby reducing losses associated with starting a new line.

因此，一些实施例可具有经启动的第一群组且允许先前启动线的其他群组保持在作用中，其中比特线的信号彼此不干扰。Thus, some embodiments may have a first group enabled and allow other groups of previously enabled lines to remain active, where the signals of the bit lines do not interfere with each other.

另外，一些实施例可包括开关及控制信号。该控制信号可由组控制器控制或通过在控制信号之间添加触发器(例如，产生上文所描述的机构具有的相同时序效应)来控制。Additionally, some embodiments may include switches and control signals. This control signal can be controlled by the bank controller or by adding flip-flops between the control signals (eg, to produce the same timing effects that the mechanisms described above have).

图58说明诸如延迟或隔离电路的中间电路，其为开关(诸如，5157(1)至5157(3)及5158(1)至5158(3))且定位于另一群组的一群组之间。定位于群组之间的一组开关可由专用控制信号控制。在图58中，控制信号可由行控制单元5160(1)发送且由不同组开关之间的一个或多个延迟单元的序列(例如，单元5160(2)及5160(3))延迟。Figure 58 illustrates an intermediate circuit, such as a delay or isolation circuit, that is a switch (such as 5157(1)-5157(3) and 5158(1)-5158(3)) and is located in one of a group of another group between. A set of switches positioned between groups can be controlled by dedicated control signals. In Figure 58, control signals may be sent by row control unit 5160(1) and delayed by a sequence of one or more delay units (eg, units 5160(2) and 5160(3)) between different sets of switches.

图59说明诸如延迟或隔离电路的中间电路，其为反相器门或缓冲器(诸如，5159(1)至5159(3)及5159'1(0-5159'(3))的序列且定位于存储器垫群组之间。59 illustrates an intermediate circuit, such as a delay or isolation circuit, which is a sequence of inverter gates or buffers such as 5159(1)-5159(3) and 5159'1(0-5159'(3)) and positioned between memory pad groups.

替代开关，可在存储器垫群组之间使用缓冲器。缓冲器可能不允许开关之间沿着字线降低电压，此为在使用单一晶体管结构时有时会发生的效应。Instead of switches, buffers can be used between groups of memory pads. The buffer may not allow voltage drop between switches along the word line, an effect that sometimes occurs when a single transistor structure is used.

其他实施例可允许更多的随机存取，且通过使用添加至存储器组的区域仍提供极低的启动功率及时间。Other embodiments may allow more random access and still provide very low startup power and time by using regions added to the memory bank.

图60绘示一实施例，用以说明使用接近存储器垫定位的全局字线(诸如，5152(1)至5152(8))。这些字线可能通过或可能不通过存储器垫且经由诸如开关(诸如，5157(1)至5157(8))的中间电路耦接至存储器垫内的字线。该开关可控制将启动哪一存储器垫且允许存储器控制器在每一时间点处仅启动相关线部分。不同于上文所描述的使用线部分的依序启动的实施例，图60的实施例可提供更好的控制。Figure 60 illustrates an embodiment to illustrate the use of global word lines (such as 5152(1)-5152(8)) positioned proximate memory pads. These wordlines may or may not pass through the memory pad and are coupled to wordlines within the memory pad via intermediate circuits such as switches, such as 5157(1)-5157(8). This switch controls which memory pad will be activated and allows the memory controller to activate only the relevant line portion at each point in time. Unlike the embodiments described above that use sequential activation of line segments, the embodiment of FIG. 60 may provide better control.

诸如行部分启用信号5170(1)及7150(2)的启用信号可源自未展示的逻辑，诸如存储器控制器。Enable signals such as row portion enable signals 5170(1) and 7150(2) may originate from logic not shown, such as a memory controller.

图61说明全局字线5180通过存储器垫且形成用于可能不需要在垫外部投送的字线信号的旁路路径。因此，图61中所展示的实施例可以一些存储器密度为代价来缩减存储器组的面积。61 illustrates a global word line 5180 through the memory pad and forming a bypass path for word line signals that may not need to be routed outside the pad. Thus, the embodiment shown in Figure 61 can reduce the area of a memory bank at the expense of some memory density.

在图61中，全局世界线可不间断地通过存储器垫且可能不连接至存储器胞元。区域字线区段可由开关中的一个控制且连接至垫中的存储器胞元。In Figure 61, the global world line may pass uninterrupted through the memory pad and may not be connected to memory cells. The local word line segment can be controlled by one of the switches and connected to the memory cells in the pad.

当存储器垫群组提供字线的实质分割区时，存储器组可实际上支持完全随机存取。When groups of memory pads provide substantial partitioning of word lines, groups of memory may actually support full random access.

用于减缓启动信号沿着字线的散布的另一实施例也可节省一些布线及逻辑，在存储器垫之间使用开关和/或其他缓冲或隔离电路，而不使用专用启用信号及专用线来输送启用信号。Another embodiment for slowing the spread of enable signals along word lines may also save some wiring and logic, using switches and/or other buffering or isolation circuits between memory pads instead of using dedicated enable signals and dedicated lines to Delivery enable signal.

例如，比较器可用以控制开关或其他缓冲或隔离电路。当由比较器监测的字线区段上的信号的电平达到某一电平时，比较器可启动开关或其他缓冲或隔离电路。例如，某一电平可指示完全加载先前字线区段。For example, comparators can be used to control switches or other buffering or isolation circuits. When the level of the signal on the word line segment monitored by the comparator reaches a certain level, the comparator may activate a switch or other buffering or isolation circuit. For example, a certain level may indicate that the previous word line segment is fully loaded.

图62说明用于操作存储器单元的方法5190。例如，可使用上文关于图56至图61所描述的存储器组中的任一者来实施方法5130。62 illustrates a method 5190 for operating a memory cell. For example, method 5130 may be implemented using any of the memory banks described above with respect to FIGS. 56-61 .

方法5190可包括步骤5192及5194。Method 5190 may include steps 5192 and 5194.

步骤5192可包括通过启动单元启动包括于存储器单元的第一存储器垫中的存储器胞元的第一群组，而不启动包括于存储器单元的第二存储器垫中的存储器胞元的第二群组。存储器胞元的该第一群组及存储器胞元的该第二群组可均属于该存储器单元的单一行。Step 5192 may include enabling, by the enabling unit, a first group of memory cells included in a first memory pad of the memory unit without enabling a second group of memory cells included in a second memory pad of the memory unit . The first group of memory cells and the second group of memory cells may both belong to a single row of the memory cells.

步骤5194可包括通过启动单元启动存储器胞元的第二群组，例如，在步骤5192之后。Step 5194 may include activating the second group of memory cells by the activation unit, eg, after step 5192.

可在启动存储器胞元的第一群组时，在完全启动存储器胞元的第一群组之后，在存储器胞元的第一群组的启动已完成之后起始的延迟时段期满之后，在存储器胞元的第一群组撤销启动之后及在类似情况下执行步骤5194。may be activated at the time of activation of the first group of memory cells, after the first group of memory cells is fully activated, after the expiration of a delay period initiated after activation of the first group of memory cells has completed, at Step 5194 is performed after the first group of memory cells is deactivated and in similar circumstances.

延迟时段可为固定或可调整的。例如，延迟时段的持续时间可基于存储器单元的预期存取图案，或可与预期存取图案无关地设定。延迟时段的范围可介于少于一毫秒与多于一秒之间。The delay period may be fixed or adjustable. For example, the duration of the delay period may be based on the expected access pattern of the memory cells, or may be set independently of the expected access pattern. The delay period can range between less than one millisecond and more than one second.

在一些实施例中，步骤5194可基于信号的值起始，该信号在耦接至存储器胞元的第一群组的第一字线区段上产生的。例如，当信号的值超过第一阈值时，其可指示存储器胞元的第一群组完全启动。In some embodiments, step 5194 may be initiated based on the value of a signal generated on a first word line segment coupled to a first group of memory cells. For example, when the value of the signal exceeds a first threshold, it may indicate that the first group of memory cells is fully activated.

步骤5192及5194中的任一者可涉及使用安置于第一字线区段与第二字线区段之间的中间电路(例如，启动单元的中间电路)。第一字线区段可耦接至第一存储器胞元且第二字线区段可耦接至第二存储器胞元。Either of steps 5192 and 5194 may involve the use of an intermediate circuit (eg, an intermediate circuit that activates the cell) disposed between the first word line segment and the second word line segment. The first word line segment can be coupled to the first memory cell and the second word line segment can be coupled to the second memory cell.

中间电路的实施例贯穿图56至图61予以说明。Embodiments of intermediate circuits are described throughout FIGS. 56 to 61 .

步骤5192及5194还可以包括通过控制单元来控制启动信号至存储器胞元的第一群组及存储器胞元的第二群组的供应，该启动信号来自与单一行相关联的字线。Steps 5192 and 5194 may also include controlling, by the control unit, the supply of enable signals from word lines associated with a single row to the first group of memory cells and the second group of memory cells.

使用存储器平行性来加速测试时间及使用向量测试存储器中的逻辑Use memory parallelism to speed up test time and use vectors to test logic in memory

本公开的一些实施例可使用芯片内测试单元来加速测试。Some embodiments of the present disclosure may use an in-chip test unit to speed up testing.

一般而言，存储器芯片测试需要大量测试时间。缩减测试时间可缩减生产成本且还允许进行更多测试，以产生更可靠的产品。In general, memory chip testing requires a large amount of test time. Reducing test time reduces production costs and also allows more tests to be performed, resulting in more reliable products.

图63及图64说明测试器5200及芯片(或芯片的晶圆)5210。测试器5200可包括管理测试的软件。测试器5200可将不同数据序列运行至所有存储器5210，且接着读回该序列以识别存储器5210的发生故障比特位于何处。一旦辨识到，测试器5200便可发出修复比特的命令，且若能够修复问题，则测试器5200可声明存储器5210通过。在其他状况下，可声明一些芯片未通过。63 and 64 illustrate a tester 5200 and a chip (or wafer of chips) 5210. Tester 5200 may include software to manage testing. The tester 5200 can run a different sequence of data to all the memories 5210, and then read the sequence back to identify where the failed bits of the memory 5210 are located. Once identified, the tester 5200 may issue a command to repair the bit, and if the problem can be repaired, the tester 5200 may declare the memory 5210 to pass. In other cases, some chips may be declared to fail.

测试器5200可写入测试序列且接着读回数据以将其与预期结果进行比较。Tester 5200 can write a test sequence and then read back the data to compare it to expected results.

图64展示测试系统，其具有测试器5200及被平行地测试的芯片(诸如，5210)的完整晶圆5202。例如，测试器5200可通过导线总线连接至芯片中的每个。64 shows a test system with a tester 5200 and a complete wafer 5202 of chips (such as 5210) being tested in parallel. For example, tester 5200 may be connected to each of the chips by a wire bus.

如图64中所展示，测试器5200必须数次读取及写入所有存储器芯片，且该数据必须经由外部芯片接口传递。As shown in Figure 64, the tester 5200 must read and write all memory chips several times, and this data must be passed through the external chip interface.

此外，例如使用可编程配置信息测试集成电路的逻辑及存储器组两者可为有益的，可使用规则I/O操作提供该配置信息。In addition, it may be beneficial to test both the logic and memory banks of an integrated circuit, for example, using programmable configuration information, which may be provided using regular I/O operations.

该测试也可受益于集成电路内存在测试单元。The test can also benefit from the presence of test cells within the integrated circuit.

该测试单元可属于集成电路且可分析测试结果，并找到例如逻辑(例如，如图7A中所描绘及所描述的处理器子单元)和/或存储器(例如，跨越多个存储器组)中的故障。The test unit can belong to an integrated circuit and can analyze the test results and find, for example, in logic (eg, a processor subunit as depicted and described in FIG. 7A ) and/or memory (eg, across multiple memory banks) Fault.

存储器测试器通常极简单且根据简单格式与集成电路交换测试向量。例如，可存在写入向量，该写入向量包括成对的要写入的存储器条目的地址与要写入至存储器条目的值。也可存在读取向量，该读取向量包括要读取的存储器条目的地址。写入向量的地址中的至少一些可与读取向量中的至少一些地址相同。写入向量的至少一些其他地址可不同于读取向量的至少一些其他地址。当经编程时，存储器测试器也可接收预期结果向量，该预期结果向量可包括要读取的存储器条目的地址及要读取的预期值。存储器测试器可将预期值与其读取值进行比较。Memory testers are generally very simple and exchange test patterns with integrated circuits according to a simple format. For example, there may be a write vector that includes pairs of the address of the memory entry to be written and the value to be written to the memory entry. There may also be a read vector that includes the address of the memory entry to be read. At least some of the addresses in the write vector may be the same as at least some addresses in the read vector. At least some other addresses of the write vector may be different from at least some other addresses of the read vector. When programmed, the memory tester may also receive an expected result vector, which may include the address of the memory entry to be read and the expected value to be read. The memory tester can compare the expected value with its read value.

根据一实施例，集成电路(具有或不具有集成电路的存储器)的逻辑(例如，处理器子单元)可通过存储器测试器使用相同协议/格式来测试。例如，写入向量中的一些值可为要由集成电路的逻辑执行的命令(且可例如涉及计算和/或存储器存取)。可运用读取向量及预期结果向量来编程存储器测试器，该预期结果向量可包括存储器条目地址，其中的一些储存计算的预期值。因此，存储器测试器可用于测试逻辑以及存储器。存储器测试器通常比逻辑测试器更简单且更便宜，且所提议方法允许使用简单的存储器测试器执行复杂的逻辑测试。According to an embodiment, the logic (eg, processor subunits) of an integrated circuit (memory with or without an integrated circuit) can be tested by a memory tester using the same protocol/format. For example, some of the values written into the vector may be commands to be executed by the logic of the integrated circuit (and may, for example, involve computation and/or memory access). The memory tester can be programmed with a read vector and an expected result vector, which can include memory entry addresses, some of which store the calculated expected value. Thus, a memory tester can be used to test logic as well as memory. Memory testers are generally simpler and less expensive than logic testers, and the proposed method allows complex logic tests to be performed using simple memory testers.

在一些实施例中，存储器内的逻辑可通过仅使用向量(或其他数据结构)而不使用逻辑测试中常见的更复杂机制(诸如，例如经由接口与控制器通信，告知逻辑测试哪一电路)来启用对存储器内的逻辑的测试。In some embodiments, the logic within the memory can use only vectors (or other data structures) without using more complex mechanisms common in logic testing (such as, for example, communicating with a controller via an interface to tell the logic which circuit to test) to enable testing of the logic within the memory.

替代使用测试单元，存储器控制器可被配置为接收存取包括于配置信息中的存储器条目的指令，且执行存取指令并输出结果。Instead of using a test unit, the memory controller may be configured to receive instructions to access memory entries included in the configuration information, and to execute the access instructions and output the results.

图65至图69中所说明的集成电路中的任一者可执行测试，甚至在缺乏测试单元的情况下或在存在不能够执行测试的情况下。Any of the integrated circuits illustrated in FIGS. 65-69 may perform testing, even in the absence of test cells or in the presence of inability to perform testing.

本公开的实施例可包括使用存储器平行性及内部芯片带宽来加速及改良测试时间的方法及系统。Embodiments of the present disclosure may include methods and systems that use memory parallelism and internal chip bandwidth to accelerate and improve test time.

该方法及系统可基于存储器芯片测试本身(相对于测试器运行测试、读取测试结果及分析结果)，保存结果且最终允许测试器读取该结果(且在需要时，将存储器芯片编程回，例如启动冗余机构)。该测试可包括测试存储器或测试存储器组及逻辑(在具有要测试的起作用逻辑部分的计算存储器的状况下，诸如上文在图7中所描述的状况)。The method and system can be based on the memory chip testing itself (running the test with respect to the tester, reading the test results, and analyzing the results), saving the results and ultimately allowing the tester to read the results (and, when needed, program the memory chip back, e.g. activation of redundant mechanisms). This testing may include testing memory or testing memory banks and logic (in the case of computational memory with active logic portions to be tested, such as the conditions described above in FIG. 7).

在一个实施例中，该方法可包括读取及写入芯片内的数据使得外部带宽不限制测试。In one embodiment, the method may include reading and writing data within the chip such that external bandwidth does not limit testing.

在存储器芯片包括处理器子单元的实施例中，每一处理器子单元可通过测试代码或配置来编程。In embodiments where the memory chip includes processor subunits, each processor subunit may be programmed with test code or configuration.

在存储器芯片具有无法执行测试代码的处理器子单元或不具有处理器子单元但具有存储器控制器的实施例中，存储器控制器接着可被配置为读取及写入图案(例如，在外部编程至控制器)且标记故障的位置(例如，将值写入至存储器条目，读取该条目，及接收不同于写入值的值)以供进一步分析。In embodiments where the memory chip has a processor sub-unit that cannot execute the test code or does not have a processor sub-unit but has a memory controller, the memory controller can then be configured to read and write patterns (eg, externally programmed to the controller) and flag the location of the failure (eg, write a value to a memory entry, read the entry, and receive a value different from the written value) for further analysis.

应注意，测试存储器可能需要测试大量比特，例如，测试存储器的每一比特及验证受测比特是否起作用。此外，有时可在不同电压及温度条件下重复存储器测试。It should be noted that testing a memory may require testing a large number of bits, eg, testing each bit of the memory and verifying that the bit under test functions. In addition, memory tests can sometimes be repeated under different voltage and temperature conditions.

对于一些缺陷，可启动一个或多个冗余机构(例如，通过编程快闪存储器或OTP或烧断熔断器)。此外，可能还必须测试存储器芯片的逻辑及模拟电路(例如，控制器、调节器、I/O)。For some defects, one or more redundancy mechanisms may be activated (eg, by programming flash memory or OTP or blowing fuses). In addition, logic and analog circuits (eg, controllers, regulators, I/Os) of the memory chip may also have to be tested.

在一个实施例中，集成电路可包括：基板、安置于基板上的存储器阵列、安置于基板上的处理阵列，及安置于基板上的接口。In one embodiment, an integrated circuit may include a substrate, a memory array disposed on the substrate, a processing array disposed on the substrate, and an interface disposed on the substrate.

本文中所描述的集成电路可包括于如图3A、图3B、图4至图6、图7A至图7D、图11至图13、图16至图19、图22或图23中的任一者中所说明的存储器芯片中，可包括该存储器芯片，或以其他方式包含该存储器芯片。The integrated circuits described herein may be included in any of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23 Among the memory chips described in the above, the memory chip may be included, or the memory chip may be included in other ways.

图65至图69说明各种集成电路5210及测试器5200。65-69 illustrate various integrated circuits 5210 and testers 5200.

该集成电路说明为包括存储器组5212、芯片接口5211(诸如，由该存储器组共享的I/O控制器5214及总线5213)及逻辑单元(在下文中为「逻辑」)5215。图66说明熔断器接口5216及耦接至熔断器接口及不同存储器组的总线5217。The integrated circuit is illustrated as including a memory bank 5212, a chip interface 5211 (such as an I/O controller 5214 and a bus 5213 shared by the memory bank), and a logic unit (hereinafter "logic") 5215. 66 illustrates a fuse interface 5216 and a bus 5217 coupled to the fuse interface and different memory banks.

图65至图70还说明测试处理程序中的各种步骤，诸如：Figures 65-70 also illustrate various steps in the test handler, such as:

a.写入测试序列5221(图65、图67、图68及图69)；a. Write test sequence 5221 (FIG. 65, FIG. 67, FIG. 68, and FIG. 69);

b.读回测试结果5222(图67、图68及图69)；b. Read back test results 5222 (Figure 67, Figure 68 and Figure 69);

c.写入预期结果序列5223(图65)；c. Write the expected result sequence 5223 (Figure 65);

d.读取故障地址以修复5224(图66)；及d. Read the faulty address to repair 5224 (FIG. 66); and

e.程序熔断器5225(图66)。e. Program fuse 5225 (FIG. 66).

每个存储器组可耦接至其自身的逻辑单元5215和/或由该逻辑单元来控制。然而，如上文所描述，可提供对逻辑单元5215的任何存储器组分配。因此，逻辑单元5215的数量可不同于存储器组的数量，逻辑单元可控制多于单一存储器组或一存储器组的一部分，及其类似物。Each memory bank may be coupled to and/or controlled by its own logic unit 5215. However, as described above, any memory bank assignment to logic unit 5215 may be provided. Thus, the number of logic cells 5215 may differ from the number of memory banks, the logic cells may control more than a single memory bank or a portion of a memory bank, and the like.

逻辑单元5215可包括一个或多个测试单元。图65说明逻辑5215内的测试单元(TU)5218。TU可包括于所有或一些逻辑单元5212中。应注意，测试单元可与逻辑单元分开或与逻辑单元集成。Logic unit 5215 may include one or more test units. 65 illustrates a test unit (TU) 5218 within logic 5215. TUs may be included in all or some of the logic units 5212. It should be noted that the test unit may be separate from the logic unit or integrated with the logic unit.

图65还说明TU 5218内的测试图案生成器(标明为GEN)5219。Figure 65 also illustrates a test pattern generator (designated GEN) 5219 within the TU 5218.

测试图案生成器可包括于所有或一些测试单元中。为简单起见，测试图案生成器及测试单元未说明于图66至图70中，但可包括于这些实施例中。The test pattern generator may be included in all or some of the test cells. For simplicity, test pattern generators and test cells are not illustrated in Figures 66-70, but may be included in these embodiments.

该存储器阵列可包括多个存储器组。此外，该处理阵列可包括多个测试单元。所述多个测试单元可被配置为测试多个存储器组以提供测试结果。该接口可被配置为将指示测试结果的信息输出至在集成电路外部的设备。The memory array may include multiple memory banks. Additionally, the processing array may include multiple test cells. The plurality of test units may be configured to test the plurality of memory banks to provide test results. The interface may be configured to output information indicative of the test results to a device external to the integrated circuit.

所述多个测试单元可包括至少一个测试图案生成器，该至少一个测试图案生成器被配置为产生至少一个测试图案以用于测试多个存储器组中的一个或多个。在一些实施例中，如上文所解释，所述多个测试单元中的每个可包括一测试图案生成器，该测试图案生成器被配置为产生一测试图案以供所述多个测试单元中的特定测试单元使用以测试多个存储器组中的至少一个。如上文所指示，图65说明测试单元内的测试图案生成器(GEN)5219。一个或多个或甚至所有逻辑单元可包括测试图案生成器。The plurality of test cells may include at least one test pattern generator configured to generate at least one test pattern for testing one or more of the plurality of memory banks. In some embodiments, as explained above, each of the plurality of test cells may include a test pattern generator configured to generate a test pattern for use in the plurality of test cells A specific test unit is used to test at least one of the plurality of memory banks. As indicated above, Figure 65 illustrates a test pattern generator (GEN) 5219 within a test cell. One or more or even all logic cells may include a test pattern generator.

至少一个测试图案生成器可被配置为从接口接收用于产生至少一个测试图案的指令。测试图案可包括在测试期间应存取(例如，读取和/或写入)的存储器条目和/或要写入至该条目的值，及其类似物。The at least one test pattern generator may be configured to receive instructions from the interface for generating the at least one test pattern. A test pattern may include a memory entry that should be accessed (eg, read and/or written) during testing and/or a value to be written to the entry, and the like.

该接口可被配置为从可在集成电路外部的外部单元接收配置信息，该配置信息包括用于产生至少一个测试图案的指令。The interface may be configured to receive configuration information including instructions for generating at least one test pattern from an external unit, which may be external to the integrated circuit.

至少一个测试图案生成器可被配置为从存储器阵列读取配置信息，该配置信息包括用于产生至少一个测试图案的指令。The at least one test pattern generator may be configured to read configuration information from the memory array, the configuration information including instructions for generating the at least one test pattern.

在一些实施例中，该配置信息可包括向量。In some embodiments, the configuration information may include a vector.

该接口可被配置为从可在集成电路外部的设备接收配置信息，该配置信息可包括可为至少一个测试图案的指令。The interface may be configured to receive configuration information from a device that may be external to the integrated circuit, the configuration information may include instructions that may be at least one test pattern.

例如，至少一个测试图案可包括要在存储器阵列的测试期间存取的存储器阵列条目。For example, at least one test pattern may include memory array entries to be accessed during testing of the memory array.

至少一个测试图案进一步可包括要写入至在存储器阵列的测试期间存取的存储器阵列条目的输入数据。The at least one test pattern may further include input data to be written to memory array entries accessed during testing of the memory array.

另外或替代地，至少一个测试图案进一步可包括要写入至在存储器阵列的测试期间存取的存储器阵列条目的输入数据，及要从在存储器阵列的测试期间存取的存储器阵列条目读取的输出数据的预期值。Additionally or alternatively, the at least one test pattern may further include input data to be written to memory array entries accessed during testing of the memory array, and data to be read from memory array entries accessed during testing of the memory array The expected value of the output data.

在一些实施例中，所述多个测试单元可被配置为从存储器阵列取回一旦由所述多个测试单元执行便使所述多个测试单元测试该存储器阵列的测试指令。In some embodiments, the plurality of test cells may be configured to retrieve test instructions from the memory array that, once executed by the plurality of test cells, cause the plurality of test cells to test the memory array.

例如，该测试指令可包括于配置信息中。For example, the test instructions may be included in configuration information.

配置信息可包括存储器阵列的测试的预期结果。The configuration information may include expected results of testing of the memory array.

另外或替代地，该配置信息可包括要从在存储器阵列的测试期间存取的存储器阵列条目读取的输出数据的值。Additionally or alternatively, the configuration information may include values of output data to be read from memory array entries accessed during testing of the memory array.

另外或替代地，该配置信息可包括向量。Additionally or alternatively, the configuration information may include a vector.

在一些实施例中，所述多个测试单元可被配置为从存储器阵列取回一旦由所述多个测试单元执行便使所述多个测试单元测试该存储器阵列且测试该处理阵列的测试指令。In some embodiments, the plurality of test cells may be configured to retrieve from the memory array test instructions that, once executed by the plurality of test cells, cause the plurality of test cells to test the memory array and test the processing array .

该配置信息可包括向量。The configuration information may include vectors.

另外或替代地，该配置信息可包括存储器阵列及处理阵列的测试的预期结果。Additionally or alternatively, the configuration information may include expected results of testing of the memory array and processing array.

在一些实施例中，如上文所描述，所述多个测试单元可能缺乏测试图案生成器，该测试图案生成器用于产生在多个存储器组的测试期间使用的测试图案。In some embodiments, as described above, the plurality of test cells may lack a test pattern generator for generating test patterns used during testing of the plurality of memory banks.

在这些实施例中，所述多个测试单元中的至少两个可被配置为平行地测试多个存储器组中的至少两个。In these embodiments, at least two of the plurality of test cells may be configured to test at least two of the plurality of memory banks in parallel.

替代地，所述多个测试单元中的至少两个可被配置为串行地测试多个存储器组中的至少两个。Alternatively, at least two of the plurality of test units may be configured to test at least two of the plurality of memory banks in series.

在一些实施例中，指示测试结果的信息可包括故障存储器阵列条目的识别符。In some embodiments, the information indicative of the test result may include an identifier of the failed memory array entry.

在一些实施例中，该接口可被配置为在存储器阵列的测试期间多次取回由多个测试电路获得的部分测试结果。In some embodiments, the interface may be configured to retrieve partial test results obtained by multiple test circuits multiple times during testing of the memory array.

在一些实施例中，该集成电路可包括错误校正单元，该错误校正单元被配置为校正在存储器阵列的测试期间侦测到的至少一个错误。例如，该错误校正单元可被配置为使用任何适当技术来修复存储器误差，例如，通过停用一些存储器字及用冗余字替换这些字。In some embodiments, the integrated circuit may include an error correction unit configured to correct at least one error detected during testing of the memory array. For example, the error correction unit may be configured to repair memory errors using any suitable technique, eg, by disabling some memory words and replacing those words with redundant words.

在上文所描述的实施例中的任一者中，该集成电路可为存储器芯片。In any of the embodiments described above, the integrated circuit may be a memory chip.

例如，该集成电路可包括分布式处理器，其中处理阵列可包括分布式处理器的多个子单元，如图7A中所描绘。For example, the integrated circuit may include a distributed processor, wherein the processing array may include multiple subunits of the distributed processor, as depicted in Figure 7A.

在这些实施例中，该处理器子单元中的每个可与多个存储器组中的对应的专用存储器组相关联。In these embodiments, each of the processor sub-units may be associated with a corresponding dedicated memory bank of the plurality of memory banks.

在上文所描述的实施例中的任一者中，指示测试结果的信息可指示至少一个存储器组的状态。可用一个或多个粒度来提供存储器组的状态：每个存储器字，每一条目群组，或每一完整存储器组。In any of the embodiments described above, the information indicative of the test results may be indicative of the status of at least one memory bank. The state of a memory bank may be provided at one or more granularities: per memory word, per group of entries, or per full memory bank.

图65至图66说明测试器测试阶段中的四个步骤。65-66 illustrate the four steps in the tester testing phase.

在第一步骤中，测试器写入(5221)测试序列，且组的逻辑单元将数据写入至其存储器。该逻辑也可足够复杂以从测试器接收命令且其自身产生序列(如下文所解释)。In a first step, the tester writes (5221) the test sequence and the logic cells of the group write data to their memory. The logic may also be complex enough to receive commands from the tester and generate sequences by itself (as explained below).

在第二步骤中，测试器将预期结果写入(5223)至受测存储器，且逻辑单元将预期结果与从其存储器组读取的数据进行比较，保存错误清单。若逻辑足够复杂以自身产生预期结果的序列(如下文所解释)，则可简化预期结果的写入。In the second step, the tester writes (5223) the expected result to the memory under test, and the logic unit compares the expected result with the data read from its memory bank, saving a list of errors. Writing of the expected result can be simplified if the logic is complex enough to produce the sequence of the expected result by itself (as explained below).

在第三步骤中，测试器从逻辑单元读取(5224)故障地址。In a third step, the tester reads (5224) the fault address from the logic unit.

在第四步骤中，测试器对结果采取动作(5225)且可恢复错误。例如，测试器可连接至特定接口以编程存储器中的熔断器，但也可使用允许编程存储器内的错误校正机构的任何其他机构。In a fourth step, the tester takes action on the result (5225) and the error can be recovered. For example, a tester can be connected to a specific interface to program fuses in memory, but any other mechanism that allows programming of error correction mechanisms in memory can also be used.

在这些实施例中，存储器测试器可使用向量以测试存储器。In these embodiments, the memory tester may use the vectors to test the memory.

例如，每个向量可从输入系列及输出系列建置。For example, each vector can be constructed from an input series and an output series.

输入系列可包括成对的地址与写入至存储器的数据(在许多实施例中，此系列可模型化为允许程序在需要时产生的公式，该程序诸如由逻辑单元执行的程序)。The input series can include pairs of addresses and data written to memory (in many embodiments, this series can be modeled as a formula that allows a program, such as a program executed by a logic unit, to generate when needed).

在一些实施例中，测试图案生成器可产生此类向量。In some embodiments, the test pattern generator can generate such vectors.

应注意，向量为数据结构的一实施例，但一些实施例可使用其他数据结构。该数据结构可与由位于集成电路外部的测试器产生的其他测试数据结构兼容。It should be noted that a vector is one example of a data structure, but some embodiments may use other data structures. This data structure is compatible with other test data structures generated by testers located outside the integrated circuit.

该输出系列可包括地址与数据对，其包含要从存储器读回的预期数据(在一些实施例中，该系列可另外或替代地由程序在运行时间产生，例如通过逻辑单元)。The output series may include address and data pairs containing the expected data to be read back from memory (in some embodiments, the series may additionally or alternatively be generated by the program at runtime, such as by a logic unit).

存储器测试通常包括执行向量清单，每个向量根据输入系列将数据写入至存储器，且接着根据输出系列读回数据并将该数据与其预期数据进行比较。Memory testing typically involves performing a list of vectors, each vector writing data to memory according to the input series, and then reading the data back according to the output series and comparing the data to its expected data.

在失配的状况下，存储器可分类为发生故障的，或若存储器包括用于冗余的机构，则可启动冗余机构使得再次在经启动冗余机构上测试向量。In the event of a mismatch, the memory can be classified as failed, or if the memory includes a mechanism for redundancy, the redundant mechanism can be activated so that the vector is tested again on the activated redundant mechanism.

在存储器包括处理器子单元(如上文关于图7A所描述)或含有许多存储器控制器的实施例中，整个测试可由组的逻辑单元操控。因此，存储器控制器或处理器子单元可执行测试。In embodiments where the memory includes a processor sub-unit (as described above with respect to Figure 7A) or contains many memory controllers, the entire test may be handled by a group of logic units. Thus, the memory controller or processor subunit may perform the test.

该存储器控制器可从测试器编程，且测试结果可保存于控制器本身中以稍后由测试器读取。The memory controller can be programmed from the tester, and the test results can be saved in the controller itself to be read by the tester later.

为了配置及测试逻辑单元的操作，测试器可配置逻辑单元以用于存储器存取且确认结果可通过存储器存取来读取。To configure and test the operation of the logic cells, the tester can configure the logic cells for memory access and verify that the results can be read through the memory access.

例如，输入向量可含有用于逻辑单元的编程序列，且输出向量可含有此测试的预期结果。例如，若诸如处理器子单元的逻辑单元包含被配置为对存储器中的两个地址执行计算的乘法器或加法器，则输入向量可包括将数据写入至存储器的一组命令以及至加法器/乘法器逻辑的一组命令。只要可将加法器/乘法器结果读回至输出向量，便可将结果发送至测试器。For example, the input vector may contain the programming sequence for the logic cell, and the output vector may contain the expected result of this test. For example, if a logic unit such as a processor subunit includes a multiplier or adder configured to perform computations on two addresses in memory, the input vector may include a set of commands to write data to memory and to the adder / A set of commands for multiplier logic. As long as the adder/multiplier result can be read back into the output vector, the result can be sent to the tester.

该测试还可以包括从存储器加载逻辑配置及使逻辑输出发送至存储器。The testing may also include loading a logic configuration from memory and sending logic outputs to memory.

在逻辑单元从存储器加载其配置(例如，若该逻辑为存储器控制器)的实施例中，该逻辑单元可运行来自存储器本身的代码。In embodiments where the logic unit loads its configuration from memory (eg, if the logic is a memory controller), the logic unit may run code from the memory itself.

因此，该输入向量可包括用于逻辑单元的程序，且该程序本身可测试逻辑单元中的各种电路。Thus, the input vector may include a program for the logic unit, and the program itself may test the various circuits in the logic unit.

因此，测试可能不限于接收呈由外部测试器使用的格式的向量。Thus, testing may not be limited to receiving vectors in a format used by external testers.

若加载至逻辑单元的命令指示逻辑单元将结果写回至存储器组中，则测试器可读取这些结果且将这些结果与预期输出系列进行比较。If the command loaded into the logic unit instructs the logic unit to write the results back into the memory bank, the tester can read the results and compare the results to the expected output series.

例如，写入至存储器的向量可为或可包括用于逻辑单元的测试程序(例如，测试可假定存储器有效，但即使存储器无效，写入的测试程序将不工作且测试将未通过，此为可接受的结果，这是因为芯片无论如何为无效的)和/或逻辑单元如何运行代码及将结果写回至存储器。由于逻辑单元的所有测试可经由存储器进行(例如，将逻辑测试输入写入至存储器及将测试结果写回至该存储器)，因此测试器可运用输入序列及预期输出序列来运行简单的向量测试。For example, a vector written to memory may be or may include a test program for the logic unit (eg, the test may assume the memory is valid, but even if the memory is invalid, the written test program will not work and the test will fail, which is acceptable results because the chip is invalid anyway) and/or how the logic unit runs the code and writes the results back to memory. Since all testing of logic cells can be done through memory (eg, writing logic test inputs to memory and writing test results back to the memory), a tester can run simple vector tests with input sequences and expected output sequences.

逻辑配置及结果可存取为读取和/或写入命令。Logic configurations and results can be accessed as read and/or write commands.

图68说明发送写入测试序列5221的测试器5200，该写入测试序列为向量。Figure 68 illustrates a tester 5200 sending a write test sequence 5221 as a vector.

向量的部分包括在耦接至处理阵列的逻辑5215的存储器组5212之间分裂的测试代码5232。Portions of the vector include test code 5232 split between memory banks 5212 coupled to logic 5215 of the processing array.

每个逻辑5215可执行储存于其相关联存储器组中的代码5232，且该执行可包括存取一个或多个存储器组，执行计算及将结果(例如，测试结果5231)储存于存储器组5212中。Each logic 5215 can execute code 5232 stored in its associated memory bank, and the execution can include accessing one or more memory banks, performing calculations, and storing results (eg, test results 5231 ) in memory banks 5212 .

测试结果可由测试器5200发送回(例如，读回结果5222)。Test results may be sent back by tester 5200 (eg, read back results 5222).

此可允许逻辑5215受由I/O控制器5214接收的命令控制。This may allow logic 5215 to be controlled by commands received by I/O controller 5214.

在图68中，I/O控制器5214连接至存储器组及逻辑。在其他实施例中，逻辑可连接于I/O控制器5214与存储器组之间。In Figure 68, the I/O controller 5214 is connected to memory banks and logic. In other embodiments, logic may be connected between the I/O controller 5214 and the memory banks.

图70说明用于测试存储器组的方法5300。例如，可使用上文关于图65至图69所描述的存储器组中的任一者来实施方法5300。70 illustrates a method 5300 for testing a memory bank. For example, method 5300 may be implemented using any of the memory banks described above with respect to FIGS. 65-69.

方法5300可包括步骤5302、5310及5320。步骤5302可包括接收测试集成电路的存储器组的请求。该集成电路可包括：基板、安置于基板上且包含存储器组的存储器阵列、安置于基板上的处理阵列，及安置于基板上的接口。该处理阵列可包括多个测试单元，如上文所描述。Method 5300 may include steps 5302, 5310, and 5320. Step 5302 may include receiving a request to test a memory bank of the integrated circuit. The integrated circuit may include a substrate, a memory array disposed on the substrate and including memory banks, a processing array disposed on the substrate, and an interface disposed on the substrate. The processing array may include multiple test cells, as described above.

在一些实施例中，该请求可包括配置信息、一个或多个向量、命令，及其类似物。In some embodiments, the request may include configuration information, one or more vectors, commands, and the like.

在这些实施例中，该配置信息可包括存储器阵列的测试的预期结果、指令、数据、要从在存储器阵列的测试期间存取的存储器阵列条目读取的输出数据的值、测试图案，及其类似物。In these embodiments, the configuration information may include expected results of testing of the memory array, instructions, data, values of output data to be read from memory array entries accessed during testing of the memory array, test patterns, and the like analog.

该测试图案可包括以下各者中的至少一个：(i)要在存储器阵列的测试期间存取的存储器阵列条目，(ii)要写入至在存储器阵列的测试期间存取的存储器阵列条目的输入数据，或(iii)要从在存储器阵列的测试期间存取的存储器阵列条目读取的输出数据的预期值。The test pattern may include at least one of: (i) memory array entries to be accessed during testing of the memory array, (ii) memory array entries to be written to during testing of the memory array Input data, or (iii) the expected value of output data to be read from memory array entries accessed during testing of the memory array.

步骤5302可包括以下各者中的至少一个和/或其后可接着以下各者中的至少一个：Step 5302 may include at least one of and/or may be followed by at least one of:

a.通过至少一个测试图案生成器从接口接收用于产生至少一个测试图案的指令；a. receiving, through the at least one test pattern generator, an instruction to generate at least one test pattern from an interface;

b.通过该接口及从在集成电路外部的外部单元接收配置信息，该配置信息包括用于产生至少一个测试图案的指令；b. receiving configuration information through the interface and from an external unit external to the integrated circuit, the configuration information including instructions for generating at least one test pattern;

c.通过至少一个测试图案生成器从存储器阵列读取配置信息，该配置信息包括用于产生至少一个测试图案的指令；c. reading configuration information from the memory array by the at least one test pattern generator, the configuration information including instructions for generating the at least one test pattern;

d.通过该接口及从在集成电路外部的外部单元接收配置信息，该配置信息包含为至少一个测试图案的指令；d. receiving configuration information through the interface and from an external unit external to the integrated circuit, the configuration information containing instructions for at least one test pattern;

e.通过多个测试单元及从存储器阵列取回一旦由所述多个测试单元执行便使所述多个测试单元测试存储器阵列的测试指令；及e. Passing and retrieving from the memory array test instructions that, once executed by the plurality of test cells, cause the plurality of test cells to test the memory array; and

f.通过多个测试单元及从该存储器阵列接收一旦由所述多个测试单元执行便使所述多个测试单元测试存储器阵列且测试处理阵列的测试指令。f. Receiving, through and from the memory array, test instructions that, once executed by the plurality of test cells, cause the plurality of test cells to test the memory array and test the processing array.

步骤5302之后可接着步骤5310。步骤5310可包括通过多个测试单元且响应于请求而测试多个存储器组以提供测试结果。Step 5302 may be followed by step 5310. Step 5310 may include testing the plurality of memory banks through the plurality of test units and in response to requests to provide test results.

方法5300还可以包括通过该接口在存储器阵列的测试期间接收由多个测试电路获得的部分测试结果。Method 5300 may also include receiving, via the interface, partial test results obtained by the plurality of test circuits during testing of the memory array.

步骤5310可包括以下各者中的至少一个和/或其后可接着以下各者中的至少一个：Step 5310 may include at least one of and/or may be followed by at least one of:

a.通过一个或多个测试图案生成器(例如，包括于多个测试单元中的一个、一些或全部中)产生测试图案以供一个或多个测试单元用于测试多个存储器组中的至少一个；a. Generate test patterns by one or more test pattern generators (eg, included in one, some, or all of the plurality of test cells) for use by the one or more test cells to test at least one of the plurality of memory banks One;

b.通过所述多个测试单元中的至少两个平行地测试多个存储器组中的至少两个；b. Test at least two of the plurality of memory banks in parallel with at least two of the plurality of test units;

c.通过所述多个测试单元中的至少两个串行地测试多个存储器组中的至少两个；c. serially testing at least two of the plurality of memory banks with at least two of the plurality of test units;

d.将值写入至存储器条目，读取存储器条目及比较结果；及d. Write the value to the memory entry, read the memory entry and compare the result; and

e.通过错误校正单元校正在存储器阵列的测试期间侦测到的至少一个错误。e. Correct at least one error detected during testing of the memory array by the error correction unit.

步骤5310之后可接着步骤5320。步骤5320可包括通过接口及在集成电路外部输出指示测试结果的信息。Step 5310 may be followed by step 5320. Step 5320 may include outputting information indicative of the test results through an interface and external to the integrated circuit.

指示测试结果的该信息可包括故障存储器阵列条目的识别符。通过不发送关于每个存储器条目的读取数据，可节省时间。The information indicative of the test result may include an identifier of the failed memory array entry. Time can be saved by not sending read data for each memory entry.

另外或替代地，指示测试结果的该信息可指示至少一个存储器组的状态。Additionally or alternatively, the information indicative of the test results may be indicative of the status of at least one memory bank.

因此，在一些实施例中，指示测试结果的该信息可比在测试期间写入至存储器组或从存储器组读取的数据单元的总大小小得多，且可比在无测试单元辅助的情况下可从测试存储器的测试器发送的输入数据小得多。Thus, in some embodiments, this information indicative of the test results may be much smaller than the total size of data cells written to or read from the memory bank during the test, and may be possible without test cell assistance The input data sent from the tester testing the memory is much smaller.

受测集成电路可包含如先前诸图中的任一者中所说明的存储器芯片和/或分布式处理器。例如，本文中所描述的集成电路可包括于如图3A、图3B、图4至图6、图7A至图7D、图11至图13、图16至图19、图22或图23中的任一者中所说明的存储器芯片中，可包括该存储器芯片，或以其他方式包含该存储器芯片。The integrated circuit under test may include a memory chip and/or a distributed processor as illustrated in any of the previous figures. For example, the integrated circuits described herein may be included in FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, or 23. The memory chip may be included in any of the described memory chips, or otherwise included.

图71说明用于测试集成电路的存储器组的方法5350的实施例。例如，可使用上文关于图65至图69所描述的存储器组中的任一者来实施方法5350。71 illustrates an embodiment of a method 5350 for testing a memory bank of an integrated circuit. For example, method 5350 may be implemented using any of the memory banks described above with respect to Figures 65-69.

方法5350可包括步骤5352、5355及5358。步骤5352可包括通过集成电路的接口接收包含指令的配置信息。包括接口的集成电路也可包括基板、包含存储器组且安置于基板上的存储器阵列、安置于基板上的处理阵列，及安置于基板上的接口。Method 5350 may include steps 5352, 5355, and 5358. Step 5352 may include receiving configuration information including instructions through an interface of the integrated circuit. An integrated circuit that includes an interface may also include a substrate, a memory array including a memory bank and disposed on the substrate, a processing array disposed on the substrate, and an interface disposed on the substrate.

该配置信息可包括存储器阵列的测试的预期结果、指令、数据、要从在存储器阵列的测试期间存取的存储器阵列条目读取的输出数据的值、测试图案，及其类似物。The configuration information may include expected results of testing of the memory array, instructions, data, values of output data to be read from memory array entries accessed during testing of the memory array, test patterns, and the like.

另外或替代地，该配置信息可包括指令、用以写入指令的存储器条目的地址、输入数据，且也可包括用以接收在指令执行期间计算得的输出值的存储器条目的地址。Additionally or alternatively, the configuration information may include the instruction, the address of the memory entry to which the instruction is written, input data, and may also include the address of the memory entry to receive output values computed during execution of the instruction.

步骤5352之后可接着步骤5355。步骤5355可包括通过处理阵列执行指令，该执行通过存取存储器阵列，执行计算操作及提供结果来进行。Step 5352 may be followed by step 5355. Step 5355 may include executing instructions through the processing array by accessing the memory array, performing computational operations, and providing results.

步骤5355之后可接着步骤5358。步骤5358可包括通过接口及在集成电路外部输出指示结果的信息。Step 5355 may be followed by step 5358. Step 5358 may include outputting information indicative of the result through an interface and external to the integrated circuit.

网络(cyber)安全性及篡改侦测技术Network (cyber) security and tamper detection technology

存储器芯片和/或处理器可为恶意行动者的目标，且可能会受到各种类型的网络攻击。在一些状况下，此类攻击可能尝试改变储存于一个或多个存储器资源中的数据和/或代码。相对于经训练神经网络或取决于储存于存储器中的大量数据的其他类型的人工智能(AI)模型，网络攻击可能尤其成问题。若所储存数据被操纵或甚至遮蔽，则此操纵可为有害的。举例而言，若数据密集型AI模型所依赖的数据被破坏或遮蔽，则依赖于该些模型以识别其他车辆或行人等的自主车辆系统可能会不正确地评估主车辆的环境。结果，可能会发生事故。随着AI模型在广泛技术中变得越来越普遍，针对与此类模型相关联的数据的网络攻击可能造成重大破坏。Memory chips and/or processors may be targeted by malicious actors and may be subject to various types of cyber attacks. In some cases, such attacks may attempt to alter data and/or code stored in one or more memory resources. Cyberattacks can be especially problematic relative to trained neural networks or other types of artificial intelligence (AI) models that depend on large amounts of data stored in memory. If the stored data is manipulated or even obscured, this manipulation can be harmful. For example, autonomous vehicle systems that rely on data-intensive AI models to identify other vehicles, pedestrians, etc. may incorrectly assess the host vehicle's environment if the data on which they rely is corrupted or obscured. As a result, accidents may occur. As AI models become more commonplace across a wide range of technologies, cyberattacks on the data associated with such models can cause significant damage.

在其他状况下，网络攻击可包括一个或多个行动者篡改或尝试篡改与处理器或其他类型的基于集成电路的逻辑元件相关联的操作参数。举例而言，处理器通常经设计以在某些操作规格内操作。涉及篡改的网络攻击可试图改变处理器、存储器单元或其他电路的操作参数中的一个或多个，使得处理器、存储器单元或其他电路超出其设计操作规格(例如，时钟速度、带宽规格、温度限制、操作速率等)。此篡改可导致目标硬件发生故障。In other situations, a cyber attack may involve one or more actors tampering or attempting to tamper with operating parameters associated with processors or other types of integrated circuit-based logic elements. For example, processors are typically designed to operate within certain operating specifications. A cyberattack involving tampering may attempt to alter one or more of the operating parameters of a processor, memory unit, or other circuit so that the processor, memory unit, or other circuit exceeds its designed operating specifications (e.g., clock speed, bandwidth specification, temperature limits, operating rates, etc.). This tampering can cause the target hardware to fail.

用于防御网络攻击的传统技术可包括在处理器层级操作的计算机程序(例如，防病毒软件或防恶意软件的软件)。其他技术可包括使用与路由器或其他硬件相关联的基于软件的防火墙。虽然这些技术可使用在存储器单元外部执行的软件程序来对抗网络攻击，但仍需要用于高效地保护储存于存储器单元中的数据的额外或替代技术，尤其在该数据的准确性及可用性对诸如神经网络等的存储器密集型应用的操作至关重要的情况下。本发明的实施例可提供包含存储器的抵抗对存储器的网络攻击的各种集成电路设计。Conventional techniques for defending against network attacks may include computer programs (eg, anti-virus or anti-malware software) operating at the processor level. Other techniques may include the use of software-based firewalls associated with routers or other hardware. While these techniques can be used to combat cyberattacks using software programs that execute outside of the memory cells, there remains a need for additional or alternative techniques for efficiently protecting the data stored in the memory cells, especially where the accuracy and availability of the data is critical to factors such as where the operation of memory-intensive applications such as neural networks is critical. Embodiments of the present invention may provide various integrated circuit designs including memory that are resistant to cyberattacks on memory.

以安全方式将敏感信息及命令撷取至集成电路(例如，在至芯片/集成电路外部的接口尚未起作用时的开机处理(boot)程序期间)及接着维护集成电路内的敏感信息及命令而不将其曝露于集成电路外部，此可增加敏感信息及命令的安全性。CPU及其他类型的处理单元易受网络攻击，尤其在那些CPU/处理单元与外部存储器一起操作时。包括安置于存储器阵列当中的存储器芯片上的分布式处理器子单元的所公开实施例可以不易受到网络攻击及篡改(例如，这是因为处理在存储器芯片内发生)，该存储器阵列包括多个存储器组。包括在下文更详细地论述的所公开安全措施的任何组合可进一步降低所公开实施例对网络攻击和/或篡改的易感性。Capture sensitive information and commands to the integrated circuit in a secure manner (eg, during the boot process when the interface to the outside of the chip/integrated circuit is not yet functional) and then maintain the sensitive information and commands within the integrated circuit It is not exposed outside the integrated circuit, which increases the security of sensitive information and commands. CPUs and other types of processing units are vulnerable to cyber attacks, especially when those CPUs/processing units operate with external memory. The disclosed embodiments that include distributed processor subunits disposed on memory chips in a memory array that includes multiple memories may be less vulnerable to cyber attacks and tampering (eg, because processing occurs within the memory chips) Group. Inclusion of any combination of the disclosed security measures discussed in greater detail below may further reduce the susceptibility of the disclosed embodiments to cyber attack and/or tampering.

图72A为符合本发明的实施例的包括存储器阵列及处理阵列的集成电路7200的图解表示。举例而言，集成电路7200可包括在以上章节中且贯穿本发明描述的存储器芯片上分布式处理器架构(及特征)中的任一者。存储器阵列及处理阵列可形成于共同基板上，且在某些所公开实施例中，集成电路7200可构成存储器芯片。举例而言，如上文所论述，集成电路7200可包括存储器芯片，该存储器芯片包括多个存储器组及在空间上分布于存储器芯片上的多个处理器子单元，其中多个存储器组中的每一者与多个处理器子单元中的专用的一个或多个相关联。在一些状况下，每一处理器子单元可专用于一个或多个存储器组。Figure 72A is a diagrammatic representation of an integrated circuit 7200 including a memory array and a processing array in accordance with an embodiment of the invention. For example, integrated circuit 7200 may include any of the distributed processor architectures (and features) on a memory chip described in the preceding sections and throughout this disclosure. The memory array and processing array can be formed on a common substrate, and in some disclosed embodiments, the integrated circuit 7200 can constitute a memory chip. For example, as discussed above, integrated circuit 7200 may include a memory chip that includes a plurality of memory banks and a plurality of processor subunits spatially distributed on the memory chip, wherein each of the plurality of memory banks One is associated with a dedicated one or more of the plurality of processor subunits. In some cases, each processor sub-unit may be dedicated to one or more memory banks.

在一些实施例中，存储器阵列可包括多个离散存储器组7210_1、7210_2……7210_J1、7210_Jn，如图72A中所展示。根据本发明的实施例，存储器阵列7210可包含一或多种类型的存储器，包括例如易失性存储器(诸如，RAM、DRAM、SRAM、相变RAM(PRAM)、磁阻式RAM(MRAM)、电阻式RAM(ReRAM)或其类似者)或非易失性存储器(诸如，闪存或ROM)。根据本发明的一些实施例，存储器组7210_1至7210_Jn可包括多个MOS存储器结构。In some embodiments, the memory array may include a plurality of discrete memory banks 7210_1 , 7210_2 . . . 7210_J1 , 7210_Jn, as shown in FIG. 72A . According to embodiments of the invention, memory array 7210 may include one or more types of memory including, for example, volatile memory such as RAM, DRAM, SRAM, phase change RAM (PRAM), magnetoresistive RAM (MRAM), Resistive RAM (ReRAM or the like) or non-volatile memory such as Flash or ROM. According to some embodiments of the present invention, the memory banks 7210_1 to 7210_Jn may include a plurality of MOS memory structures.

如上文所提及，处理阵列可包括多个处理器子单元7220_1至7220_K。在一些实施例中，处理器子单元7220_1至7220_K中的每一者可与多个离散存储器组7210_1至7210_Jn当中的一个或多个离散存储器组相关联。虽然图72A的实例实施例说明每一处理器子单元与两个离散存储器组7210相关联，但应了解，每一处理器子单元可与任何数目个离散的专用存储器组相关联。且反之亦然，每一存储器组可与任何数目个处理器子单元相关联。根据本发明的实施例，包括于集成电路7200的存储器阵列中的离散存储器组的数目可等于、小于或大于包括于集成电路7200的处理阵列中的处理器子单元的数目。As mentioned above, the processing array may include multiple processor subunits 7220_1 to 7220_K. In some embodiments, each of the processor sub-units 7220_1 - 7220_K may be associated with one or more discrete memory banks among the plurality of discrete memory banks 7210_1 - 7210_Jn. Although the example embodiment of Figure 72A illustrates that each processor subunit is associated with two discrete memory banks 7210, it should be appreciated that each processor subunit may be associated with any number of discrete dedicated memory banks. And vice versa, each memory bank may be associated with any number of processor subunits. According to embodiments of the invention, the number of discrete memory banks included in the memory array of integrated circuit 7200 may be equal to, less than, or greater than the number of processor subunits included in the processing array of integrated circuit 7200 .

集成电路7200可进一步包括符合本发明的实施例(且如描述于以上章节中)的多个第一总线7260。每一总线7260可将处理器子单元7220_k连接至对应的专用存储器组7210_j。根据本发明的一些实施例，集成电路7200可进一步包括多个第二总线7261。每一总线7261可将处理器子单元7220_k连接至另一处理器子单元7220_k+1。如图72A中所展示，多个处理器子单元7220_1至7220_K可经由总线7261连接至彼此。虽然图72A将形成回路的多个处理器子单元7220_1至7220_K说明为其经由总线7261串联连接，但应了解，处理器单元7220可用任何其他方式连接。举例而言，在一些状况下，特定处理器子单元可能不经由总线7261连接至其他处理器子单元。在其他状况下，特定处理器子单元可仅连接至一个其他处理器子单元，且在另外其他状况下，特定处理器子单元可经由一个或多个总线7261连接至两个或多于两个其他处理器子单元(例如，形成串联连接、并联连接、分支连接等)。应注意，本文中所描述的集成电路7200的实施例仅为例示性的。在一些状况下，集成电路7200可具有不同的内部组件及连接，且在其他状况下，可省略内部组件及所描述连接中的一个或多个(例如，取决于特定应用的需要)。The integrated circuit 7200 may further include a plurality of first buses 7260 consistent with embodiments of the invention (and as described in the above sections). Each bus 7260 may connect processor sub-unit 7220_k to a corresponding dedicated memory bank 7210_j. According to some embodiments of the present invention, the integrated circuit 7200 may further include a plurality of second buses 7261 . Each bus 7261 may connect processor sub-unit 7220_k to another processor sub-unit 7220_k+1. As shown in FIG. 72A, a plurality of processor sub-units 7220_1-7220_K may be connected to each other via a bus 7261. Although FIG. 72A illustrates the plurality of processor sub-units 7220_1 through 7220_K forming a loop as being connected in series via a bus 7261, it should be understood that the processor units 7220 may be connected in any other manner. For example, in some cases a particular processor sub-unit may not be connected to other processor sub-units via bus 7261. In other cases, a particular processor sub-unit may be connected to only one other processor sub-unit, and in still other cases, a particular processor sub-unit may be connected to two or more than two via one or more buses 7261 Other processor subunits (eg, forming series connections, parallel connections, branch connections, etc.). It should be noted that the embodiments of integrated circuit 7200 described herein are exemplary only. In some cases, integrated circuit 7200 may have different internal components and connections, and in other cases, one or more of the internal components and described connections may be omitted (eg, depending on the needs of a particular application).

返回参看图72A，集成电路7200可包括用于相对于集成电路7200实施至少一个安全措施的一个或多个结构。在一些状况下，这种结构可被配置为侦测操纵或遮蔽(或尝试操纵或遮蔽)储存于存储器组中的一个或多个中的数据的网络攻击。在其他状况下，这种结构可被配置为侦测篡改与集成电路7200相关联的操作参数或篡改直接或间接影响与集成电路7200相关联的一个或多个操作的一个或多个硬件元件(无论包括于集成电路7200内抑或集成电路7200外部)。Referring back to FIG. 72A, integrated circuit 7200 may include one or more structures for implementing at least one security measure relative to integrated circuit 7200. In some cases, such a structure may be configured to detect network attacks that manipulate or obscure (or attempt to manipulate or obscure) data stored in one or more of the memory banks. In other cases, such structures may be configured to detect tampering with operating parameters associated with integrated circuit 7200 or tampering with one or more hardware elements that directly or indirectly affect one or more operations associated with integrated circuit 7200 ( whether included within the integrated circuit 7200 or external to the integrated circuit 7200).

在一些状况下，控制器7240可包括于集成电路7200中。控制器7240可经由一个或多个总线7250连接至例如处理器子单元7220_1……7220_k中的一个或多个。控制器7240亦可连接至存储器组7210_1……7210_Jn中的一个或多个。虽然图72A的实例实施例展示一个控制器7240，但应理解，控制器7240可包括多个处理器元件和/或逻辑电路。在所公开实施例中，控制器7240可被配置为相对于集成电路7200的至少一个操作实施至少一个安全措施。另外，在所公开实施例中，若至少一个安全措施被触发，则控制器7240可被配置为采取(或引起)一个或多个补救动作。In some cases, the controller 7240 may be included in the integrated circuit 7200. The controller 7240 may be connected via one or more buses 7250 to, for example, one or more of the processor subunits 7220_1 . . . 7220_k. The controller 7240 may also be connected to one or more of the memory banks 7210_1 . . . 7210_Jn. Although the example embodiment of FIG. 72A shows one controller 7240, it should be understood that the controller 7240 may include multiple processor elements and/or logic circuits. In the disclosed embodiments, the controller 7240 may be configured to implement at least one security measure with respect to at least one operation of the integrated circuit 7200 . Additionally, in the disclosed embodiments, if at least one safety measure is triggered, the controller 7240 may be configured to take (or cause) one or more remedial actions.

根据本发明的一些实施例，至少一个安全措施可包括用于锁定对集成电路7200的某些方面的存取的控制器实施处理程序。存取锁定涉及使控制器防止自芯片外部对存储器的某些区的存取(读取和/或写入)。可按地址分辨率、存储器组分辨率的部分、存储器组分辨率及其类似者来应用存取控制。在一些状况下，可锁定与集成电路7200相关联的存储器中的一个或多个实体位置(例如，集成电路7200的一个或多个存储器组或存储器组中的一个或多个的任何部分)。在一些实施例中，控制器7240可锁定对与人工智能模型(或其他类型的基于软件的系统)的执行相关联的集成电路7200的某些部分的存取。举例而言，在一些实施例中，控制器7240可锁定对储存于与集成电路7200相关联的存储器中的神经网络模型的权重的存取。应注意，软件程序(亦即，模型)可包括三个组件，包括：程序的输入数据、程序的代码数据及执行程序的输出数据。这种组件亦可适用于神经网络模型。在此模型的操作期间，可产生输入数据并将其馈入至模型，且执行模型可产生输出数据以供读取。然而，与使用所接收输入数据执行模型相关联的程序代码及数据值(例如，预定模型权重等)可保持固定。According to some embodiments of the invention, at least one security measure may include a controller-implemented handler for locking access to certain aspects of the integrated circuit 7200 . Access locking involves having the controller prevent access (read and/or write) to certain areas of memory from outside the chip. Access control may be applied at address resolution, fraction of memory bank resolution, memory bank resolution, and the like. In some cases, one or more physical locations in memory associated with integrated circuit 7200 may be locked (eg, one or more memory banks of integrated circuit 7200 or any portion of one or more of the memory banks). In some embodiments, controller 7240 may lock access to certain portions of integrated circuit 7200 associated with the execution of artificial intelligence models (or other types of software-based systems). For example, in some embodiments, the controller 7240 may lock access to the weights of the neural network model stored in memory associated with the integrated circuit 7200. It should be noted that a software program (ie, a model) may include three components, including: input data for the program, code data for the program, and output data for executing the program. This component can also be adapted for neural network models. During operation of this model, input data may be generated and fed into the model, and executing the model may generate output data for reading. However, the program code and data values (eg, predetermined model weights, etc.) associated with executing the model using the received input data may remain fixed.

如本文中所描述，锁定可指控制器例如不允许自芯片/集成电路外部起始的相对于存储器的某些区的读取或写入操作的操作。芯片/集成电路的I/O可通过的控制器不仅可锁定全部存储器组，而且可锁定存储器组内的存储器地址的任何范围，自单个存储器地址至包括可用存储器组的所有地址的地址范围(或两者之间的任何地址范围)。As described herein, a lock may refer to an operation of a controller that does not allow read or write operations originating from outside the chip/integrated circuit with respect to certain regions of the memory, for example. The I/O-accessible controller of the chip/integrated circuit can lock not only the entire memory bank, but also any range of memory addresses within the memory bank, from a single memory address to an address range including all addresses of the available memory bank (or any address range in between).

因为与接收输入数据及储存输出数据相关联的存储器位置与改变值及与集成电路7200外部的组件(例如，供应输入数据或接收输出数据的组件)的交互相关联，所以锁定对那些存储器位置的存取在一些状况下可能不切实际。另一方面，限制对与模型程序代码及固定数据值相关联的存储器位置的存取可有效抵抗某些类型的网络攻击。因此，在一些实施例中，作为安全措施，可锁定与程序代码及数据值相关联的存储器(例如，不用于写入/接收输入数据及用于读取/提供输出数据的存储器)。限制存取可包括锁定某些存储器位置使得无法对某些程序代码和/或数据值(例如，与基于所接收输入数据执行模型相关联的那些程序代码和/或数据值)进行改变。另外，亦可锁定与中间数据(例如，在执行模型期间产生的数据)相关联的存储器区域以抵抗外部存取。因此，虽然各种运算逻辑(无论为在集成电路7200板上抑或位于集成电路7200外部)可将数据提供至与接收输入数据或撷取所产生输出数据相关联的存储器位置或自该些存储器位置接收数据，但此运算逻辑将不能够基于所接收输入数据来存取或修改储存与程序执行相关联的程序代码及数据值的存储器位置。Because memory locations associated with receiving input data and storing output data are associated with changing values and interactions with components external to integrated circuit 7200 (eg, components supplying input data or receiving output data), locking access to those memory locations Access may be impractical in some situations. On the other hand, restricting access to memory locations associated with model program code and fixed data values can be effective against certain types of network attacks. Thus, in some embodiments, memory associated with program code and data values may be locked (eg, not used to write/receive input data and memory used to read/provide output data) as a security measure. Restricting access may include locking certain memory locations so that changes cannot be made to certain program code and/or data values (eg, those associated with an execution model based on received input data). Additionally, memory regions associated with intermediate data (eg, data generated during execution of the model) may also be locked against external access. Thus, while various operational logic (whether on-board or external to integrated circuit 7200) may provide data to or from memory locations associated with receiving input data or retrieving generated output data Data is received, but the operational logic will not be able to access or modify memory locations storing program code and data values associated with program execution based on the received input data.

除锁定集成电路7200上的存储器位置以提供安全措施以外，亦可藉由限制对被配置为执行与特定程序或模型相关联的程序代码的某些运算逻辑元件(及其存取的存储器区)的存取来实施其他安全措施。在一些状况下，可相对于位于集成电路7200上的运算逻辑(及其相关联的存储器区)(例如，运算存储器(例如，包括运算能力的存储器，诸如本文中所公开的存储器芯片上的分布式处理器)等)实现此存取约束。亦可锁定/限制对与储存于集成电路7200的锁定存储器部分中的代码的任何执行相关联或与对储存于集成电路7200的锁定存储器部分中的数据值的任何存取相关联的运算逻辑(及相关联的存储器位置)的存取，而无关于该运算逻辑是否位于集成电路7200板上。限制对负责执行程序/模型的运算逻辑的存取可进一步确保与对所接收输入数据的操作相关联的代码及数据值仍受到保护以免被操纵、遮蔽等。In addition to locking memory locations on integrated circuit 7200 to provide security measures, certain arithmetic logic elements (and the memory areas they access) that are configured to execute program code associated with a particular program or model may also be restricted by access to implement other security measures. In some cases, relative to the operational logic (and its associated memory regions) located on the integrated circuit 7200 (eg, operational memory (eg, memory that includes computational capabilities, such as the distribution on a memory chip disclosed herein) type processors), etc.) implement this access constraint. Operational logic associated with any execution of code stored in the locked memory portion of integrated circuit 7200 or any access to data values stored in the locked memory portion of integrated circuit 7200 may also be locked/restricted ( and associated memory locations) regardless of whether the arithmetic logic resides on the integrated circuit 7200 board. Restricting access to the computational logic responsible for executing the program/model may further ensure that code and data values associated with operations on received input data remain protected from manipulation, shadowing, and the like.

可用任何合适的方式实现控制器实施的安全措施，包括锁定或限制对与集成电路7200的存储器阵列的某些部分相关联的基于硬件的区的存取。在一些实施例中，可藉由将命令添加或供应至被配置为使控制器7240锁定某些存储器部分的控制器7240来实施此锁定。在一些实施例中，待锁定的基于硬件的存储器部分可由特定存储器地址(例如，与存储器组7210_1……7210_J2等的任何存储器元件相关联的地址)指明。在一些实施例中，存储器的锁定区可在程序或模型执行期间保持固定。在其他状况下，锁定区可为可配置的。亦即，在一些状况下，可向控制器7240供应命令使得在程序或模型的执行期间，锁定区可改变。举例而言，在特定时间，可将某些存储器位置添加至存储器的锁定区。或在特定时间，可自存储器的锁定区排除某些存储器位置(例如，先前锁定的存储器位置)。Controller-implemented security measures may be implemented in any suitable manner, including locking or restricting access to hardware-based regions associated with certain portions of the memory array of integrated circuit 7200. In some embodiments, this locking may be implemented by adding or supplying commands to the controller 7240 that are configured to cause the controller 7240 to lock certain portions of memory. In some embodiments, the portion of hardware-based memory to be locked may be specified by a particular memory address (eg, the address associated with any memory element of memory banks 7210_1 . . . 7210_J2, etc.). In some embodiments, locked regions of memory may remain fixed during program or model execution. In other cases, the lock zone may be configurable. That is, in some cases, commands may be supplied to the controller 7240 such that during execution of a program or model, the locked region may be changed. For example, certain memory locations may be added to a locked area of memory at certain times. Or at certain times, certain memory locations (eg, previously locked memory locations) may be excluded from the locked area of memory.

可用任何合适的方式实现某些存储器位置的锁定。在一些状况下，锁定存储器位置的记录(例如，储存及识别锁定存储器地址的文件、数据库、数据结构等)可为可由控制器7240存取的，使得控制器7240可判定某一存储器请求是否与锁定存储器位置相关。在一些状况下，控制器7240维护锁定地址的数据库以使用控制对某些存储器位置的存取。在其他状况下，控制器可具有可配置直至锁定的表或一个或多个寄存器的集合，且可包括识别待锁定的存储器位置(例如，应限制自芯片外部对该些存储器位置的存储器存取)的固定预定值。举例而言，当请求存储器存取时，控制器7240可比较与存储器存取请求相关联的存储器地址与锁定存储器地址。若判定与存储器存取请求相关联的存储器地址在锁定存储器地址的列表内，则可拒绝存储器存取请求(例如，读取抑或写入操作)。Locking of certain memory locations may be accomplished in any suitable manner. In some cases, records of locked memory locations (eg, files, databases, data structures, etc. that store and identify locked memory addresses) may be accessible by controller 7240 so that controller 7240 can determine whether a certain memory request is related to Lock memory location related. In some cases, the controller 7240 maintains a database of lock addresses to use to control access to certain memory locations. In other cases, the controller may have a table or set of one or more registers configurable up to a lock, and may include identifying memory locations to be locked (eg, memory access to these memory locations from outside the chip should be restricted) ) is a fixed predetermined value. For example, when a memory access is requested, the controller 7240 can compare the memory address associated with the memory access request to the locked memory address. If it is determined that the memory address associated with the memory access request is within the list of locked memory addresses, the memory access request (eg, read or write operation) may be denied.

如上文所论述，至少一个安全措施可包括锁定对不用于接收输入数据或用于提供对所产生输出数据的存取的存储器阵列7210的某些存储器部分的存取。在一些状况下，可调整锁定区内的存储器部分。举例而言，可将锁定存储器部分解除锁定，且可锁定非锁定存储器部分。任何合适的方法可用于将锁定存储器部分解除锁定。举例而言，所实施的安全措施可包括需要用于将锁定存储器区的一个或多个部分解除锁定的复杂密码。As discussed above, at least one security measure may include locking access to certain memory portions of the memory array 7210 that are not used for receiving input data or for providing access to generated output data. In some cases, the portion of memory within the locked area may be adjusted. For example, the locked memory portion can be unlocked, and the unlocked memory portion can be locked. Any suitable method may be used to partially unlock the locked memory. For example, the security measures implemented may include requiring a complex password for unlocking one or more portions of a locked memory area.

在侦测到对抗所实施的安全措施的任何动作后，可触发所实施的安全措施。举例而言，尝试对锁定存储器部分进行存取(无论为读取抑或写入请求)可触发安全措施。另外，若所键入的复杂密码(例如，试图将锁定存储器部分解除锁定)不匹配预定复杂密码，则可触发安全措施。在一些状况下，若在可允许的阈值数目次复杂密码条目尝试(例如，1次、2次、3次等)中未提供正确的复杂密码，则可触发安全措施。The implemented security measures may be triggered upon detection of any action against the implemented security measures. For example, an attempted access (whether a read or write request) to a locked memory portion can trigger a security measure. Additionally, if a complex password entered (eg, an attempt to partially unlock the locked memory) does not match a predetermined complex password, security measures may be triggered. In some cases, security measures may be triggered if the correct complex password is not provided within an allowable threshold number of complex password entry attempts (eg, 1, 2, 3, etc.).

可在任何合适的时间锁定存储器部分。举例而言，在一些状况下，可在程序执行期间的各个时间锁定存储器部分。在其他状况下，可在起动后或在程序/模型执行之前锁定存储器部分。举例而言，可连同程序/模型程序代码的编程或在产生及储存待由程序/模型存取的数据后判定及识别待锁定的存储器地址。藉此，可在程序/模型执行开始时或之后、在已产生及储存待由程序/模型使用的数据之后等的时间期间减少或消除对存储器阵列7210的攻击的漏洞。The memory section may be locked at any suitable time. For example, in some cases, portions of memory may be locked at various times during program execution. In other cases, portions of memory may be locked after startup or before program/model execution. For example, the memory address to be locked may be determined and identified in conjunction with programming of the program/model program code or after generating and storing data to be accessed by the program/model. Thereby, vulnerabilities to attacks on the memory array 7210 may be reduced or eliminated during times at or after the start of program/model execution, after data to be used by the program/model has been generated and stored, and the like.

可藉由任何合适的方法或在任何合适的时间实现锁定存储器的解除锁定。如上文所描述，可在接收到正确的复杂密码或密码等之后将锁定存储器部分解除锁定。在其他状况下，可藉由重新启动(藉由命令或藉由断电及通电)或删除整个存储器阵列7210将锁定存储器解除锁定。另外或替代地，可实施释放命令序列以将一个或多个存储器部分解除锁定。Unlocking of the locked memory may be accomplished by any suitable method or at any suitable time. As described above, the locked memory portion may be unlocked upon receipt of the correct complex code or password or the like. In other cases, locked memory can be unlocked by rebooting (either by command or by powering off and on) or deleting the entire memory array 7210. Additionally or alternatively, a release command sequence may be implemented to unlock one or more memory portions.

根据本发明的实施例且如上文所描述，控制器7240可被配置为控制去往和来自集成电路7200的业务(traffic)，尤其来自集成电路7200外部的源的业务。举例而言，如图72A中所展示，可藉由控制器7240控制在集成电路7200外部的组件与在集成电路7200内部的组件(例如，存储器阵列7210或处理器子单元7220)之间的业务。此业务可通过控制器7240或由控制器7240控制或监视的一个或多个总线(例如，7250、7260或7261)。According to embodiments of the present invention and as described above, the controller 7240 may be configured to control traffic to and from the integrated circuit 7200 , particularly traffic from sources external to the integrated circuit 7200 . For example, as shown in Figure 72A, traffic between components external to integrated circuit 7200 and components internal to integrated circuit 7200 (eg, memory array 7210 or processor subunit 7220) may be controlled by controller 7240 . This traffic may pass through the controller 7240 or one or more buses controlled or monitored by the controller 7240 (eg, 7250, 7260, or 7261).

根据本发明的一些实施例，集成电路7200可在开机处理程序期间接收不可改变数据(例如，固定数据；例如模型权重、系数等)及某些命令(例如，代码；例如识别待锁定的存储器部分)。此处，不可改变数据可指在程序或模型的执行期间保持固定且可保持不变直至后续开机处理程序的数据。在程序执行期间，集成电路7200可与可改变数据交互，该可改变数据可包括待处理的输入数据和/或由与集成电路7200相关联的处理产生的输出数据。如上文所论述，可在程序或模型执行期间限制对存储器阵列7210或处理阵列7220的存取。举例而言，存取可限于存储器阵列7210的某些部分或限于某些处理器子单元，该些处理器子单元与以下各者相关联：关于待写入的传入输入数据的处理或与待写入的传入输入数据的交互，或关于待读取的所产生输出数据的处理或与待读取的所产生输出数据的交互。在程序或模型执行期间，可锁定含有不可改变数据的存储器部分且藉此使其不可存取。在一些实施例中，与待锁定的存储器部分相关联的不可改变数据和/或命令可包括于任何适当的数据结构中。举例而言，可经由可在开机序列期间或之后存取的一个或多个配置文件使此类数据和/或命令可用于控制器7240。According to some embodiments of the invention, the integrated circuit 7200 may receive unchangeable data (eg, fixed data; eg, model weights, coefficients, etc.) and certain commands (eg, code; eg, identify the portion of memory to be locked) during the power-on handler ). Here, unchangeable data may refer to data that remains fixed during execution of a program or model and may remain unchanged until subsequent power-on processing of the program. During program execution, integrated circuit 7200 may interact with changeable data, which may include input data to be processed and/or output data produced by processing associated with integrated circuit 7200 . As discussed above, access to memory array 7210 or processing array 7220 may be restricted during program or model execution. For example, access may be limited to certain portions of the memory array 7210 or to certain processor sub-units associated with the processing of incoming input data to be written or with Interaction of incoming input data to be written, or processing with respect to generated output data to be read, or interaction with generated output data to be read. During execution of a program or model, portions of memory containing immutable data may be locked and thereby rendered inaccessible. In some embodiments, unchangeable data and/or commands associated with the portion of memory to be locked may be included in any suitable data structure. Such data and/or commands may be made available to controller 7240, for example, via one or more configuration files that may be accessed during or after the power-on sequence.

返回参看图72A，集成电路7200可进一步包括通信端口7230。如图72A中所展示，控制器7240可耦接于通信端口7230与总线7250之间，该总线在处理子单元7220_1至7220_K之间共享。在一些实施例中，通信端口7230可间接地或直接地耦接至主计算机7270，该主计算机与可包括例如非易失性存储器的主机存储器7280相关联。在一些实施例中，主计算机7270可从其相关联的主机存储器7280撷取可改变数据7281(例如，待在程序或模型的执行期间使用的输入数据)、不可改变数据7282和/或命令7283。可改变数据7181、不可改变数据7282及命令7283可在开机处理程序期间经由7230自主计算机7270上传至控制器7240。Referring back to FIG. 72A , the integrated circuit 7200 may further include a communication port 7230. As shown in FIG. 72A, a controller 7240 may be coupled between a communication port 7230 and a bus 7250, which is shared among the processing sub-units 7220_1-7220_K. In some embodiments, communication port 7230 may be indirectly or directly coupled to a host computer 7270 associated with host memory 7280, which may include, for example, non-volatile memory. In some embodiments, the host computer 7270 may retrieve changeable data 7281 (eg, input data to be used during execution of a program or model), unchangeable data 7282 and/or commands 7283 from its associated host memory 7280 . Changeable data 7181, unchangeable data 7282, and commands 7283 may be uploaded to controller 7240 via 7230 host computer 7270 during the power-on process.

图72B为符合本发明的实施例的集成电路内部的存储器区的图解表示。如所展示，图72B描绘包括于主机存储器7280中的数据结构的实例。72B is a diagrammatic representation of a memory region within an integrated circuit consistent with embodiments of the present invention. As shown, FIG. 72B depicts an example of a data structure included in host memory 7280.

现参看图73A，其为符合本发明的实施例的集成电路的另一实例。如图73A中所展示，控制器7240可包括网络攻击侦测器7241及响应模块7242。在本发明的一些实施例中，控制器7240可被配置为储存或存取存取控制规则7243。根据本发明的一些实施例，存取控制规则7243可包括于控制器7240可存取的配置文件中。在一些实施例中，存取控制规则7243可在开机处理程序期间上传至控制器7240。存取控制规则7243可包含提示与以下各者中的任一者相关联的存取规则的信息：可改变数据7281、不可改变数据7282及命令7283以及其对应存储器位置。如上文所解释，存取控制规则7243或配置文件可包括识别存储器阵列7210当中的某些存储器地址的信息。在一些实施例中，控制器7240可被配置为提供锁定机制和/或功能，该锁定机制和/或功能锁定存储器阵列7210的各种地址，例如用于储存命令或不可改变数据的地址。Referring now to FIG. 73A, another example of an integrated circuit consistent with embodiments of the present invention. As shown in FIG. 73A, the controller 7240 may include a network attack detector 7241 and a response module 7242. In some embodiments of the invention, the controller 7240 may be configured to store or access the access control rules 7243. According to some embodiments of the invention, the access control rules 7243 may be included in a configuration file accessible by the controller 7240. In some embodiments, the access control rules 7243 may be uploaded to the controller 7240 during the boot process. Access control rules 7243 may include information prompting access rules associated with any of: changeable data 7281, unchangeable data 7282, and commands 7283 and their corresponding memory locations. As explained above, the access control rules 7243 or profiles may include information identifying certain memory addresses within the memory array 7210. In some embodiments, controller 7240 may be configured to provide locking mechanisms and/or functions that lock various addresses of memory array 7210, such as addresses used to store commands or unchangeable data.

控制器7240可被配置为强制执行存取控制规则7243，例如以防止未经授权实体改变不可改变数据或命令。在一些实施例中，可根据存取控制规则7243禁止对不可改变数据或命令的读取。根据本发明的一些实施例，控制器7240可被配置为判定是否对某些命令或不可改变数据的至少一部分进行了存取尝试。控制器7240(例如，包括网络攻击侦测器7241)可比较与存取请求相关联的存储器地址与用于不可改变数据及命令的存储器地址，以侦测是否已对一个或多个锁定存储器位置进行了未经授权存取尝试。以此方式，例如，控制器7240的网络攻击侦测器7241可被配置为判定是否发生疑似网络攻击，例如更改一个或多个命令或改变或遮蔽与一个或多个锁定存储器部分相关联的不可改变数据的请求。响应模块7242可被配置为判定如何对侦测到的网络攻击作出响应和/或实施对侦测到的网络攻击的响应。举例而言，在一些状况下，响应于侦测到对一个或多个锁定存储器位置中的数据或命令的攻击，控制器7240的响应模块7242可实施或使得实施响应，该响应可包括例如停止一个或多个操作，诸如与侦测到的攻击相关联的存储器存取操作。对侦测到的攻击的响应还可包括停止与程序或模型的执行相关联的一个或多个操作，传回所尝试攻击的警告或其他指示符，向主机确证(assert)提示线，或删除整个存储器等。Controller 7240 may be configured to enforce access control rules 7243, eg, to prevent unauthorized entities from altering immutable data or commands. In some embodiments, reads of unchangeable data or commands may be prohibited according to access control rules 7243. According to some embodiments of the invention, the controller 7240 may be configured to determine whether an access attempt has been made to certain commands or at least a portion of the immutable data. Controller 7240 (eg, including network attack detector 7241 ) can compare memory addresses associated with access requests to memory addresses for unchangeable data and commands to detect whether one or more memory locations have been locked An unauthorized access attempt was made. In this manner, for example, the cyber attack detector 7241 of the controller 7240 may be configured to determine whether a suspected cyber attack has occurred, such as altering one or more commands or altering or masking non-accessible data associated with one or more locked memory portions. A request to change data. The response module 7242 can be configured to determine how to respond to and/or implement a response to a detected cyber attack. For example, in some conditions, in response to detecting an attack on data or commands in one or more locked memory locations, the response module 7242 of the controller 7240 can implement or cause to implement a response, which can include, for example, stopping One or more operations, such as memory access operations associated with the detected attack. Responding to a detected attack may also include stopping one or more operations associated with the execution of the program or model, returning a warning or other indicator of the attempted attack, asserting the line to the host, or deleting entire memory, etc.

除锁定存储器部分以外，亦可实施用于抵御网络攻击的其他技术以提供与集成电路7200相关联的所描述安全措施。举例而言，在一些实施例中，控制器7240可被配置为在与集成电路7200相关联的不同存储器位置及处理器子单元内复制程序或模型。以此方式，可独立地执行程序/模型及程序/模型的复制者，且可比较独立程序/模型执行的结果。举例而言，可在两个存储器组7210中复制且在集成电路7200中的不同处理器子单元7220执行程序/模型。在其他实施例中，可在两个不同集成电路7200中复制程序/模型。在任一状况下，可比较程序/模型执行的结果以判定复制程序/模型执行之间是否存在任何差异。执行结果(例如，中间执行结果、最终执行结果等)的侦测到的差异可提示存在已变更程序/模型或其相关联数据的一个或多个方面的网络攻击。在一些实施例中，可指派不同存储器组7210及处理器子单元7220以基于相同输入数据执行两个复制模型。在一些实施例中，可在基于相同输入数据执行两个复制模型期间比较中间结果，且若在同一阶段，两个中间结果之间存在失配，则作为潜在的补救动作，可暂时中止执行。在同一集成电路的处理器子单元执行两个复制模型的状况下，该集成电路亦可比较结果。此可在不通知集成电路外部的任何实体关于两个复制模型的执行的情况下进行。换言之，芯片外部的实体不知晓复制模型正在集成电路上并行地运行。In addition to locking portions of memory, other techniques for defending against network attacks may also be implemented to provide the described security measures associated with integrated circuit 7200. For example, in some embodiments, controller 7240 may be configured to replicate programs or models within different memory locations and processor subunits associated with integrated circuit 7200. In this way, programs/models and replicas of programs/models can be executed independently, and the results of execution of independent programs/models can be compared. For example, a program/model may be replicated in the two memory banks 7210 and executed by different processor sub-units 7220 in the integrated circuit 7200. In other embodiments, the program/model may be replicated in two different integrated circuits 7200. In either case, the results of the program/model executions can be compared to determine if there are any differences between the replicated program/model executions. Detected differences in execution results (eg, intermediate execution results, final execution results, etc.) may indicate a cyber attack that has altered one or more aspects of the program/model or its associated data. In some embodiments, different memory banks 7210 and processor subunits 7220 may be assigned to perform the two replication models based on the same input data. In some embodiments, intermediate results may be compared during execution of two replicated models based on the same input data, and if there is a mismatch between the two intermediate results at the same stage, execution may be temporarily suspended as a potential remedial action. The integrated circuit can also compare the results in the case where the processor subunits of the same integrated circuit execute two replicated models. This can be done without informing any entity external to the integrated circuit about the execution of the two replication models. In other words, entities external to the chip are unaware that the replication model is running in parallel on the integrated circuit.

虽然将单个程序/模型复制描述为用于侦测可能网络攻击的一个实例，但可使用任何数目个复制(例如，1个、2个、3个或多于3个)以侦测可能网络攻击。随着复制及独立程序/模型执行的数目增加，网络攻击的侦测的置信度亦可增加。复制的较大数目亦可降低网络攻击的潜在成功率，因为攻击者可能更难影响多个程序/模型复制者。可在运行时间判定程序或模型复制者的数目，以进一步增加网络攻击者成功地影响程序或模型执行的困难。Although a single program/model replication is described as one example for detecting possible cyber attacks, any number of replications (eg, 1, 2, 3, or more than 3) may be used to detect possible cyber attacks . As the number of replications and independent program/model executions increases, the confidence in the detection of a network attack can also increase. A larger number of replications can also reduce the potential success rate of a network attack, as it may be more difficult for an attacker to affect multiple program/model replicators. The number of program or model replicators can be determined at runtime to further increase the difficulty for a cyber attacker to successfully influence program or model execution.

在一些实施例中，复制模型可在彼此不同的一个或多个方面中不相同。在此实例中，可使与两个程序/模型相关联的代码彼此不同，但该些程序/模型可经设计使得两者传回相同输出结果。至少以此方式，两个程序/模型可被视为彼此的复制者。举例而言，两个神经网络模型在一层中相对于彼此可能具有不同的神经元排序。然而，尽管模型程序代码具有此改变，但两个神经网络模型均可传回相同输出结果。以此方式复制程序/模型可使得网络攻击者更难以识别待破解的程序或模型的这些有效复制者，且结果，复制模型/程序不仅可提供用以提供冗余以最小化网络攻击影响的方式，而且可增强网络攻击侦测(例如，藉由突出显示篡改或未经授权存取，其中网络攻击者更改一个程序/模型或其数据，但未能对程序/模型复制者作出对应改变)。In some embodiments, the replication models may differ in one or more aspects that differ from each other. In this example, the code associated with the two programs/models can be made different from each other, but the programs/models can be designed so that both return the same output. At least in this way, the two programs/models can be considered replicas of each other. For example, two neural network models may have different orderings of neurons in a layer relative to each other. However, despite this change in the model program code, both neural network models can return the same output. Duplicating programs/models in this manner can make it more difficult for a cyber attacker to identify these valid replicas of the program or model to be cracked, and as a result, duplicating a model/program can not only provide a way to provide redundancy to minimize the impact of a cyber attack , and can enhance detection of cyber attacks (eg, by highlighting tampering or unauthorized access where a cyber attacker changes a program/model or its data, but fails to make corresponding changes to the program/model replicator).

在许多状况下，复制程序/模型(尤其包括展现代码差异的复制程序/模型)可经设计使得其输出不完全匹配，而是构成软值(例如，近似相同的输出值)，而非准确的固定值。在这种实施例中，可比较(例如，使用专用模块或藉由主机处理器)来自两个或多于两个有效程序/模型复制者的输出结果，以判定其输出结果(无论为中间结果抑或最终结果)之间的差是否处于预定范围内。所输出软值的差不超过预定阈值或范围可被视为无篡改、未经授权存取等的证据。另一方面，若所输出软值的差超过预定阈值或范围，则此差可被视为已发生呈篡改、对存储器的未经授权存取等的形式的网络攻击的证据。在这些状况下，将触发复制程序/模型安全措施且可采取一个或多个补救动作(例如，停止执行程序或模型，关闭集成电路7200的一个或多个操作，在具有有限功能性的安全模式中操作，连同许多其他动作)。In many cases, replicators/models (especially including replicators/models that exhibit code differences) may be designed such that their outputs do not match exactly, but instead constitute soft values (eg, approximately the same output values), rather than exact values Fixed value. In such an embodiment, the output results from two or more active program/model replicas may be compared (eg, using dedicated modules or by a host processor) to determine their output results (whether intermediate results) or the final result) is within a predetermined range. The difference in the output soft values not exceeding a predetermined threshold or range may be considered evidence of no tampering, unauthorized access, or the like. On the other hand, if the difference in the output soft values exceeds a predetermined threshold or range, this difference may be considered evidence that a network attack in the form of tampering, unauthorized access to memory, or the like has occurred. Under these conditions, duplicate program/model security measures will be triggered and one or more remedial actions may be taken (eg, stop execution of the program or model, shut down one or more operations of integrated circuit 7200, in a safe mode with limited functionality operation, along with many other actions).

与集成电路7200相关联的安全措施亦可涉及对与执行中或已执行程序或模型相关联的数据的定量分析。举例而言，在一些实施例中，控制器7240可被配置为计算关于储存于存储器阵列7210的至少一部分中的数据的一个或多个校验和(checksum)/散列/循环冗余检查(CRC)/校验(parity)值。可将所计算的值与一个或多个预定值进行比较。若所比较值之间存在偏差，则此偏差可解译为篡改储存于存储器阵列7210的至少部分中的数据的证据。在一些实施例中，可针对与存储器阵列7210相关联的所有存储器位置而计算校验和/散列/CRC/校验位值以识别数据的改变。在此实例中，可藉由例如主计算机7270或与集成电路7200相关联的处理器读取所讨论的整个存储器(或存储器组)，以用于计算校验和/散列/CRC/校验位值。在其他状况下，可针对与存储器阵列7210相关联的存储器位置的预定子集而计算校验和/散列/CRC/校验位值，以识别与存储器位置的子集相关联的数据的改变。在一些实施例中，控制器7240可被配置为计算与预定数据路径相关联(例如，与存储器存取图案(pattern)相关联)的校验和/散列/CRC/校验位值，且所计算值可彼此进行比较或与预定值进行比较以判定是否已发生篡改或另一形式的网络攻击。Security measures associated with integrated circuit 7200 may also involve quantitative analysis of data associated with executing or executed programs or models. For example, in some embodiments, the controller 7240 may be configured to calculate one or more checksums/hashes/cyclic redundancy checks ( CRC)/parity value. The calculated value may be compared to one or more predetermined values. If there is a discrepancy between the compared values, this discrepancy can be interpreted as evidence of tampering with data stored in at least a portion of the memory array 7210. In some embodiments, checksum/hash/CRC/check bit values may be calculated for all memory locations associated with memory array 7210 to identify changes in data. In this example, the entire memory (or bank of memory) in question may be read by, for example, the host computer 7270 or a processor associated with the integrated circuit 7200, for use in calculating the checksum/hash/CRC/check bit value. In other cases, checksum/hash/CRC/check bit values may be computed for a predetermined subset of memory locations associated with memory array 7210 to identify changes in data associated with the subset of memory locations . In some embodiments, the controller 7240 may be configured to calculate a checksum/hash/CRC/check bit value associated with a predetermined data path (eg, associated with a memory access pattern), and The calculated values may be compared to each other or to predetermined values to determine whether tampering or another form of network attack has occurred.

藉由保护集成电路7200内或集成电路7200可存取的位置中的一个或多个预定值(例如，预期校验和/散列/CRC/校验位值、中间或最终输出结果的预期差值、与某些值相关联的预期差范围等)，可使集成电路7200甚至更安全地抵抗网络攻击。举例而言，在一些实施例中，一个或多个预定值可储存于存储器阵列7210的寄存器中，且可在模型的每次运行期间或之后用以(例如，藉由集成电路7200的控制器7240)评估中间或最终输出结果、校验和等。在一些状况下，可使用“保存最后结果数据”命令来更新寄存器值以在运作中计算预定值，且可将所计算值保存于寄存器或另一存储器位置中。以此方式，有效输出值可用以在每一程序或模型执行或部分执行之后更新用于比较的预定值。此技术可增加网络攻击者在尝试修改或以其他方式篡改经设计以曝露网络攻击者活动的一个或多个预定参考值时可能体验的困难。By protecting one or more predetermined values within integrated circuit 7200 or in locations accessible to integrated circuit 7200 (eg, expected checksum/hash/CRC/check bit value, expected difference in intermediate or final output results) values, expected difference ranges associated with certain values, etc.), can make the integrated circuit 7200 even more secure against cyber attacks. For example, in some embodiments, one or more predetermined values may be stored in registers of memory array 7210 and may be used during or after each run of the model (eg, by a controller of integrated circuit 7200 ) 7240) Evaluate intermediate or final output results, checksums, etc. In some cases, the "save last result data" command may be used to update register values to compute predetermined values on the fly, and the computed values may be saved in a register or another memory location. In this way, valid output values can be used to update predetermined values for comparison after each program or model execution or partial execution. This technique may increase the difficulty a cyber attacker may experience when attempting to modify or otherwise tamper with one or more predetermined reference values designed to expose the cyber attacker's activity.

在操作中，CRC计算器可用以追踪存储器存取。举例而言，此计算电路可安置于存储器组层级处、处理器子单元中或控制器处，其中每一计算电路可被配置为在进行每一存储器存取时累加至CRC计算器。In operation, a CRC calculator may be used to track memory accesses. For example, such calculation circuits may be disposed at the memory bank level, in a processor sub-unit, or at the controller, where each calculation circuit may be configured to accumulate to a CRC calculator as each memory access is made.

现参看图74A，其提供集成电路7200的另一实施例的图解表示。在由图74A表示的实例实施例中，控制器7240可包括篡改侦测器7245及响应模块7246。类似于其他所公开实施例，篡改侦测器7245可被配置为侦测潜在篡改尝试的证据。根据本发明的一些实施例，与集成电路7200相关联且由控制器7240实施的安全措施例如可包括将实际程序/模型操作图案与预定/所允许操作图案进行比较。若在一个或多个方面中，实际程序/模型操作图案与预定/所允许操作图案不同，则可触发安全措施。且若触发安全措施，则控制器7240的响应模块7246可被配置为作为响应而实施一个或多个补救措施。Referring now to FIG. 74A, a diagrammatic representation of another embodiment of an integrated circuit 7200 is provided. In the example embodiment represented by FIG. 74A, the controller 7240 may include a tamper detector 7245 and a response module 7246. Similar to the other disclosed embodiments, the tamper detector 7245 may be configured to detect evidence of potential tamper attempts. According to some embodiments of the invention, security measures associated with integrated circuit 7200 and implemented by controller 7240 may include, for example, comparing actual program/model operating patterns to predetermined/allowed operating patterns. If, in one or more aspects, the actual program/model operating pattern differs from the predetermined/allowed operating pattern, a safety measure may be triggered. And if a safety measure is triggered, the response module 7246 of the controller 7240 can be configured to implement one or more remedial measures in response.

图74C为根据例示性所公开实施例的可位于芯片内的各个点处的侦测元件的图解表示。如上文所描述，可使用位于芯片内的各个点处的侦测元件执行网络攻击及篡改的侦测，如展示于例如图74C中。举例而言，某一代码可与某一时间段内的预期数目个处理事件相关联。图74C中所展示的侦测器可对系统在某一时间段(由时间计数器监视)期间经历的事件(由事件计数器监视)的数目进行计数。若事件的数目超过某一预定阈值(例如，在预定义时间段期间的预期事件的数目)，则可提示篡改。此类侦测器可包括于系统的多个点中以监视各种类型的事件，如图74C中所展示。74C is a diagrammatic representation of detection elements that may be located at various points within a chip, according to an exemplary disclosed embodiment. As described above, detection of network attacks and tampering can be performed using detection elements located at various points within the chip, as shown, for example, in Figure 74C. For example, a certain code may be associated with an expected number of processing events within a certain period of time. The detector shown in FIG. 74C may count the number of events (monitored by the event counter) experienced by the system during a certain period of time (monitored by the time counter). If the number of events exceeds some predetermined threshold (eg, the number of expected events during a predefined time period), tampering may be prompted. Such detectors may be included at various points in the system to monitor various types of events, as shown in Figure 74C.

更具体而言，在一些实施例中，控制器7240可被配置为储存或存取预期程序/模型操作图案7244。举例而言，在一些状况下，操作图案可表示为提示每时间图案的所允许负载及每时间图案的禁止或不合法负载的曲线7247。篡改尝试可使存储器阵列7210或处理阵列7220在某些操作规格的外操作。此可使存储器阵列7210或处理阵列7220产生热或发生故障，且可使得能够改变与存储器阵列7210或处理阵列7220相关的数据或代码。这种改变可导致操作图案超出如由曲线7247提示的所允许操作图案。More specifically, in some embodiments, controller 7240 may be configured to store or access expected program/model operation patterns 7244. For example, in some cases, the operational pattern may be represented as a curve 7247 that suggests allowed loads per time pattern and prohibited or illegal loads per time pattern. A tamper attempt may cause the memory array 7210 or the processing array 7220 to operate outside certain operating specifications. This can cause the memory array 7210 or the processing array 7220 to heat up or fail, and can enable changes to data or code associated with the memory array 7210 or the processing array 7220. This change can result in operating patterns that exceed the allowed operating patterns as suggested by curve 7247.

根据本发明的一些实施例，控制器7240可被配置为监视与存储器阵列7210或处理阵列7220相关联的操作图案。操作图案可与存取请求的数目、存取请求的类型、存取请求的时序等相关联。控制器7240可经进一步配置以在操作图案不同于可允许操作图案的情况下侦测篡改攻击。According to some embodiments of the invention, the controller 7240 may be configured to monitor the operational patterns associated with the memory array 7210 or the processing array 7220. The operation pattern may be associated with the number of access requests, types of access requests, timing of access requests, and the like. The controller 7240 may be further configured to detect tampering attacks if the operational pattern is different from the allowable operational pattern.

应注意，所公开实施例不仅可用以抵御网络攻击，而且用以抵御操作中的非恶意错误。举例而言，所公开实施例亦可有效保护诸如集成电路7200的系统免受由诸如温度或电压改变或电平的环境因素引起的错误的影响，尤其在这种电平超出用于集成电路7200的操作规格的情况下。It should be noted that the disclosed embodiments can be used to defend not only against network attacks, but also against non-malicious errors in operation. For example, the disclosed embodiments are also effective in protecting systems such as integrated circuit 7200 from errors caused by environmental factors such as temperature or voltage changes or levels, especially at such levels beyond those used for integrated circuit 7200 operating specifications.

响应于侦测到疑似网络攻击(例如，作为对所触发安全措施的响应)，可实施任何合适的补救动作。举例而言，补救动作可包括停止与程序/模型执行相关联的一个或多个操作，在安全模式中操作与集成电路7200相关联的一个或多个组件，将集成电路7200的一个或多个组件锁定至额外输入或存取等。In response to detecting a suspected cyber attack (eg, in response to triggered security measures), any suitable remedial action may be implemented. For example, remedial actions may include ceasing one or more operations associated with program/model execution, operating one or more components associated with integrated circuit 7200 in a safe mode, Components are locked to additional input or access, etc.

图74B提供根据例示性所公开实施例的保护集成电路以防篡改的方法7450的流程图表示。举例而言，步骤7452可包括使用与集成电路相关联的控制器相对于集成电路的操作实施至少一个安全措施。在步骤7454处，若触发至少一个安全措施，则可采取一个或多个补救动作。集成电路包括：基板；存储器阵列，其安置于基板上，该存储器阵列包括多个离散存储器组；及处理阵列，其安置于基板上，该处理阵列包括多个处理器子单元，该多个处理器子单元中的每一者与多个离散存储器组当中的一个或多个离散存储器组相关联。74B provides a flowchart representation of a method 7450 of protecting an integrated circuit from tampering in accordance with an illustratively disclosed embodiment. For example, step 7452 may include implementing at least one security measure with respect to operation of the integrated circuit using a controller associated with the integrated circuit. At step 7454, if at least one security measure is triggered, one or more remedial actions may be taken. The integrated circuit includes: a substrate; a memory array disposed on the substrate, the memory array including a plurality of discrete memory banks; and a processing array disposed on the substrate, the processing array including a plurality of processor subunits, the plurality of processing Each of the memory subunits is associated with one or more discrete memory banks among the plurality of discrete memory banks.

在一些实施例中，所公开安全措施可实施于多个存储器芯片中，且所公开安全机制中的至少一个或多个可针对每一存储器芯片/集成电路而实施。在一些状况下，每一存储器芯片/集成电路可实施相同的安全措施，但在一些状况下，不同的存储器芯片/集成电路可实施不同的安全措施(例如，当不同的安全措施可能更适合于与特定集成电路相关联的某一类型的操作时)。在一些实施例中，可由集成电路的特定控制器实施多于一个安全措施。举例而言，特定集成电路可实施任何数目或类型的所公开安全措施。另外，特定集成电路控制器可被配置为响应于所触发安全措施而实施多个不同的补救措施。In some embodiments, the disclosed security measures may be implemented in multiple memory chips, and at least one or more of the disclosed security mechanisms may be implemented for each memory chip/integrated circuit. In some cases, each memory chip/integrated circuit may implement the same security measures, but in some cases, different memory chips/integrated circuits may implement different security measures (eg, when different security measures may be more suitable for when a certain type of operation is associated with a particular integrated circuit). In some embodiments, more than one security measure may be implemented by a particular controller of the integrated circuit. For example, a particular integrated circuit may implement any number or type of disclosed security measures. Additionally, a particular integrated circuit controller may be configured to implement a number of different remedial actions in response to triggered safety measures.

亦应注意，可组合上述安全机制中的两者或多于两者以改善针对网络攻击或篡改攻击的安全性。另外，可跨越不同集成电路而实施安全措施，且这些集成电路可协调安全措施实施。举例而言，可在一个存储器芯片内执行或可跨越不同存储器芯片执行模型复制。在此实例中，可比较来自一个存储器芯片的结果或来自两个或多于两个存储器芯片的结果以侦测潜在网络攻击或篡改攻击。在一些实施例中，跨越多个集成电路而应用的复制安全措施可包括以下各者中的一个或多个：所公开的存取锁定机制、散列保护机制、模型复制、程序/模型执行图案分析，或这些或其他所公开实施例的任何组合。It should also be noted that two or more of the above security mechanisms may be combined to improve security against network attacks or tampering attacks. Additionally, security measures may be implemented across different integrated circuits, and the integrated circuits may coordinate security measure implementation. For example, model replication can be performed within one memory chip or can be performed across different memory chips. In this example, results from one memory chip or results from two or more memory chips can be compared to detect potential network attacks or tampering attacks. In some embodiments, replication security measures applied across multiple integrated circuits may include one or more of the following: disclosed access locking mechanisms, hash protection mechanisms, model replication, program/model execution patterns analysis, or any combination of these or other disclosed embodiments.

DRAM中的多端口处理器子单元Multiport processor subunit in DRAM

如上文所描述，本发明所公开的实施例可包括分布式处理器存储器芯片，该存储器芯片包括处理器子单元阵列及存储器组阵列，其中处理器子单元中的每一者可专用于存储器组阵列中的至少一者。如在以下章节中所论述，分布式处理器存储器芯片可充当可扩展系统的基础。亦即，在一些状况下，分布式处理器存储器芯片可包括被配置为将数据从一个分布式处理器存储器芯片传送至另一分布式处理器存储器芯片的一个或多个通信端口。以此方式，任何所要数目个分布式处理器存储器芯片可链接在一起(例如，串联、并联、以回路或其任何组合)以形成分布式处理器存储器芯片的可扩展阵列。此阵列可提供用于高效地执行存储器密集型操作及用于扩展与存储器密集型操作的效能相关联的计算资源的灵活解决方案。因为分布式处理器存储器芯片可包括具有不同时序图案的时钟，所以本发明所公开的实施例包括用以甚至在存在时钟时序差异的情况下亦准确地控制分布式处理器存储器芯片之间的数据传送的特征。此些实施例可使得能够在不同的分布式处理器存储器芯片间进行高效数据共享。As described above, the disclosed embodiments may include a distributed processor memory chip that includes an array of processor subunits and an array of memory banks, wherein each of the processor subunits may be dedicated to a memory bank at least one of the arrays. As discussed in the following sections, distributed processor memory chips can serve as the basis for scalable systems. That is, in some cases, a distributed processor memory chip may include one or more communication ports configured to transfer data from one distributed processor memory chip to another distributed processor memory chip. In this manner, any desired number of distributed processor memory chips may be linked together (eg, in series, in parallel, in loops, or any combination thereof) to form a scalable array of distributed processor memory chips. Such an array may provide a flexible solution for efficiently performing memory-intensive operations and for scaling computing resources associated with the performance of memory-intensive operations. Because distributed processor memory chips may include clocks with different timing patterns, the disclosed embodiments include methods to accurately control data between distributed processor memory chips even in the presence of clock timing differences Transmission characteristics. Such embodiments may enable efficient data sharing among different distributed processor memory chips.

图75A为符合本发明的实施例的包括多个分布式处理器存储器芯片的可扩展处理器存储器系统的图解表示。根据本发明的实施例，可扩展处理器存储器系统可包括多个分布式处理器存储器芯片，诸如第一分布式处理器存储器芯片7500、第二分布式处理器存储器芯片7500'及第三分布式处理器存储器芯片7500”。第一分布式处理器存储器芯片7500、第二分布式处理器存储器芯片7500'及第三分布式处理器存储器芯片7500”中的每一者可包括与描述于本发明分布式处理器中的实施例中的任一者相关联的配置和/或特征中的任一者。75A is a diagrammatic representation of a scalable processor memory system including multiple distributed processor memory chips consistent with an embodiment of the present invention. According to an embodiment of the present invention, a scalable processor memory system may include a plurality of distributed processor memory chips, such as a first distributed processor memory chip 7500, a second distributed processor memory chip 7500', and a third distributed processor memory chip 7500'. The processor memory chip 7500". Each of the first distributed processor memory chip 7500, the second distributed processor memory chip 7500', and the third distributed processor memory chip 7500" may be included and described in this disclosure Any of the configurations and/or features associated with any of the embodiments in a distributed processor.

在一些实施例中，第一分布式处理器存储器芯片7500、第二分布式处理器存储器芯片7500'及第三分布式处理器存储器芯片7500”中的每一者可类似于图7200中所展示的集成芯片7200而实施。如图75A中所展示，第一分布式处理器存储器芯片7500可包含存储器阵列7510、处理阵列7520及控制器7540。存储器阵列7510、处理阵列7520及控制器7540可类似于图72A中的存储器阵列7210、处理阵列7220及控制器7240而配置。In some embodiments, each of the first distributed processor memory chip 7500 , the second distributed processor memory chip 7500 ′, and the third distributed processor memory chip 7500 ″ may be similar to that shown in diagram 7200 75A, a first distributed processor memory chip 7500 may include a memory array 7510, a processing array 7520, and a controller 7540. The memory array 7510, processing array 7520, and controller 7540 may be similar It is configured in the memory array 7210, the processing array 7220 and the controller 7240 in FIG. 72A.

根据本发明的实施例，第一分布式处理器存储器芯片7500可包括第一通信端口7530。在一些实施例中，第一通信端口7530可被配置为与一个或多个外部实体通信。举例而言，通信端口7530可被配置为建立分布式处理器存储器芯片7500与除另一分布式处理器存储器芯片(诸如，分布式处理器存储器芯片7500'及7500”)以外的外部实体之间的通信连接。举例而言，通信端口7530可间接地或直接地耦接至主计算机(例如，如图72A中所说明)或任何其他运算装置、通信模块等。According to an embodiment of the present invention, the first distributed processor memory chip 7500 may include a first communication port 7530 . In some embodiments, the first communication port 7530 may be configured to communicate with one or more external entities. For example, communication port 7530 may be configured to establish communication between distributed processor memory chip 7500 and an external entity other than another distributed processor memory chip, such as distributed processor memory chips 7500' and 7500" For example, the communication port 7530 may be coupled indirectly or directly to a host computer (eg, as illustrated in Figure 72A) or any other computing device, communication module, or the like.

根据本发明的实施例，第一分布式处理器存储器芯片7500可进一步包含被配置为与例如7500'或7500”的其他分布式处理器存储器芯片通信的一个或多个额外通信端口。在一些实施例中，一个或多个额外通信端口可包括第二通信端口7531及第三通信端口7532，如图75A中所展示。第二通信端口7531可被配置为与第二分布式处理器存储器芯片7500'通信，且建立第一分布式处理器存储器芯片7500与第二分布式处理器存储器芯片7500'之间的通信连接。类似地，第三通信端口7532可被配置为与第三分布式处理器存储器芯片7500'通信，且建立第一分布式处理器存储器芯片7500与第三分布式处理器存储器芯片7500”之间的通信连接。在一些实施例中，第一分布式处理器存储器芯片7500(及本文中所公开的存储器芯片中的任一者)可包括多个通信端口，包括任何适当数目个通信端口(例如，2个、3个、4个、5个、6个、7个、8个、9个、10个、20个、50个、100个、1000个等)。According to an embodiment of the invention, the first distributed processor memory chip 7500 may further include one or more additional communication ports configured to communicate with other distributed processor memory chips, eg, 7500' or 7500". In some implementations For example, the one or more additional communication ports may include a second communication port 7531 and a third communication port 7532, as shown in Figure 75A. The second communication port 7531 may be configured to communicate with the second distributed processor memory chip 7500 'communication and establishes a communication connection between the first distributed processor memory chip 7500 and the second distributed processor memory chip 7500'. Similarly, the third communication port 7532 may be configured to communicate with the third distributed processor The memory chip 7500' communicates and establishes a communication connection between the first distributed processor memory chip 7500 and the third distributed processor memory chip 7500". In some embodiments, the first distributed processor memory chip 7500 (and any of the memory chips disclosed herein) may include multiple communication ports, including any suitable number of communication ports (eg, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 1000, etc.).

在一些实施例中，第一通信端口、第二通信端口及第三通信端口与对应总线相关联。对应总线可为第一通信端口、第二通信端口及第三通信端口中的每一者所共同的总线。在一些实施例中，与第一通信端口、第二通信端口及第三通信端口中的每一者相关联的对应总线皆连接至多个离散存储器组。在一些实施例中，第一通信端口连接至存储器芯片内部的主总线或包括于存储器芯片中的至少一个处理器子单元中的至少一者。在一些实施例中，第二通信端口连接至存储器芯片内部的总线或包括于存储器芯片中的至少一个处理器子单元中的至少一者。In some embodiments, the first communication port, the second communication port, and the third communication port are associated with corresponding buses. The corresponding bus may be a bus common to each of the first communication port, the second communication port, and the third communication port. In some embodiments, corresponding buses associated with each of the first communication port, the second communication port, and the third communication port are connected to a plurality of discrete memory banks. In some embodiments, the first communication port is connected to at least one of a main bus inside the memory chip or at least one processor sub-unit included in the memory chip. In some embodiments, the second communication port is connected to at least one of a bus internal to the memory chip or at least one processor sub-unit included in the memory chip.

虽然相对于第一分布式处理器存储器芯片7500解释了所公开的分布式处理器存储器芯片的配置，但应注意，第二处理器存储器芯片7500'及第三处理器存储器芯片7500”可类似于第一分布式处理器存储器芯片7500而配置。举例而言，第二分布式处理器存储器芯片7500'亦可包含存储器阵列7510'、处理阵列7520'、控制器7540'和/或多个通信端口，诸如端口7530'、7531'及7532'。类似地，第三分布式处理器存储器芯片7500”可包含存储器阵列7510”、处理阵列7520”、控制器7540”和/或多个通信端口，诸如端口7530”、7531”及7532”。在一些实施例中，第二分布式处理器存储器芯片7500'的第二通信端口7531'及第三通信端口7532'可被配置为分别与第三分布式处理器存储器芯片7500”及第一分布式处理器存储器芯片7500通信。类似地，第三分布式处理器存储器芯片7500”的第二通信端口7531”及第三通信端口7532”可被配置为分别与第一分布式处理器存储器芯片7500及第二分布式处理器存储器芯片7500'通信。分布式处理器存储器芯片间的此配置类似性可便利基于所公开的分布式处理器存储器芯片而扩展运算系统。另外，与每一分布式处理器存储器芯片相关联的通信端口的所公开布置及配置可使得能够灵活地布置分布式处理器存储器芯片的阵列(例如，包括串联连接、并联连接、环形连接、星形连接或网络连接等)。While the configuration of the disclosed distributed processor memory chip is explained with respect to the first distributed processor memory chip 7500, it should be noted that the second processor memory chip 7500' and the third processor memory chip 7500" may be similar to A first distributed processor memory chip 7500. For example, a second distributed processor memory chip 7500' may also include a memory array 7510', a processing array 7520', a controller 7540', and/or multiple communication ports , such as ports 7530', 7531', and 7532'. Similarly, a third distributed processor memory chip 7500" may include a memory array 7510", a processing array 7520", a controller 7540", and/or multiple communication ports, such as Ports 7530", 7531" and 7532". In some embodiments, the second communication port 7531' and the third communication port 7532' of the second distributed processor memory chip 7500' may be configured to communicate with the third distributed processor memory chip 7500" and the first distribution, respectively Similarly, the second communication port 7531" and the third communication port 7532" of the third distributed processor memory chip 7500" may be configured to communicate with the first distributed processor memory chip 7500, respectively. and the second distributed processor memory chip 7500'. This configuration similarity among distributed processor memory chips may facilitate scaling of computing systems based on the disclosed distributed processor memory chips. Additionally, the disclosed arrangement and configuration of communication ports associated with each distributed processor memory chip may enable flexible arrangement of arrays of distributed processor memory chips (eg, including series connections, parallel connections, ring connections, star connections form connection or network connection, etc.).

根据本发明的实施例，例如第一至第三分布式处理器存储器芯片7500、7500'及7500”的分布式处理器存储器芯片可经由总线7533彼此通信。在一些实施例中，总线7533可连接两个不同的分布式处理器存储器芯片的两个通信端口。举例而言，第一处理器存储器芯片7500的第二通信端口7531可经由总线7533连接至第二处理器存储器芯片7500'的第三通信端口7532'。根据本发明的实施例，例如第一至第三分布式处理器存储器芯片7500、7500'及7500”的分布式处理器存储器芯片亦可经由诸如总线7534的总线与外部实体(例如，主计算机)通信。举例而言，第一分布式处理器存储器芯片7500的第一通信端口7530可经由总线7534连接至一个或多个外部实体。分布式处理器存储器芯片可用各种方式彼此连接。在一些状况下，分布式处理器存储器芯片可展现串联连接性，其中每一分布式处理器存储器芯片连接至一对邻近分布式处理器存储器芯片。在其他状况下，分布式处理器存储器芯片可展现较高程度的连接性，其中至少一个分布式处理器存储器芯片连接至两个或多于两个其他分布式处理器存储器芯片。在一些状况下，多个存储器芯片内的所有分布式处理器存储器芯片可连接至多个存储器芯片中的所有其他分布式处理器存储器芯片。According to embodiments of the present invention, distributed processor memory chips such as the first through third distributed processor memory chips 7500, 7500' and 7500" may communicate with each other via a bus 7533. In some embodiments, the bus 7533 may connect Two communication ports of two different distributed processor memory chips. For example, a second communication port 7531 of a first processor memory chip 7500 can be connected via bus 7533 to a third communication port of a second processor memory chip 7500' Communication port 7532'. According to embodiments of the present invention, distributed processor memory chips such as the first to third distributed processor memory chips 7500, 7500' and 7500" can also communicate with external entities ( For example, host computer) communication. For example, the first communication port 7530 of the first distributed processor memory chip 7500 may be connected to one or more external entities via the bus 7534. Distributed processor memory chips can be connected to each other in various ways. In some cases, distributed processor memory chips may exhibit series connectivity, with each distributed processor memory chip connected to a pair of adjacent distributed processor memory chips. In other cases, distributed processor memory chips may exhibit a higher degree of connectivity, with at least one distributed processor memory chip connected to two or more other distributed processor memory chips. In some cases, all distributed processor memory chips within the plurality of memory chips may be connected to all other distributed processor memory chips within the plurality of memory chips.

如图75A中所展示，总线7533(或与图75A的实施例相关联的任何其他总线)可为单向的。虽然图75A将总线7533说明为单向的且具有某一数据传送流(如由图75A中所展示的箭头提示)，但总线7533(或图75A中的任何其他总线)可实施为双向总线。根据本发明的一些实施例，连接于两个分布式处理器存储器芯片之间的总线可被配置为具有比连接于分布式处理器存储器芯片与外部实体之间的总线的通信速度高的通信速度。在一些实施例中，分布式处理器存储器芯片与外部实体之间的通信可在有限时间期间发生，例如在执行准备(从主计算机加载程序代码、输入数据、权重数据等)期间，在将由神经网络模型的执行产生的结果等输出至主计算机的时段期间发生。在与芯片7500、7500'及7500”的分布式处理器相关联的一个或多个程序的执行期间(例如，在与人工智能应用程序相关联的存储器密集型操作期间等)，分布式处理器存储器芯片之间的通信可经由总线7533、7533'等进行。在一些实施例中，相比两个处理器存储器芯片之间的通信，分布式处理器存储器芯片与外部实体之间的通信发生的频率可能较低。根据通信要求及实施例，分布式处理器存储器芯片与外部实体之间的总线可被配置为具有等于、大于或小于分布式处理器存储器芯片之间的总线的通信速度的通信速度。As shown in Figure 75A, bus 7533 (or any other bus associated with the embodiment of Figure 75A) may be unidirectional. Although Figure 75A illustrates bus 7533 as being unidirectional with some data transfer flow (as suggested by the arrows shown in Figure 75A), bus 7533 (or any other bus in Figure 75A) may be implemented as a bidirectional bus. According to some embodiments of the present invention, a bus connected between two distributed processor memory chips may be configured to have a higher communication speed than a communication speed of a bus connected between the distributed processor memory chips and an external entity . In some embodiments, communication between distributed processor memory chips and external entities may occur during limited periods of time, such as during preparation for execution (loading of program code, input data, weighting data, etc. Occurs during a period in which results and the like resulting from the execution of the network model are output to the host computer. During execution of one or more programs associated with the distributed processors of chips 7500, 7500' and 7500" (eg, during memory-intensive operations associated with artificial intelligence applications, etc.), the distributed processors Communication between memory chips may occur via buses 7533, 7533', etc. In some embodiments, communication between distributed processor memory chips and external entities occurs compared to communication between two processor memory chips The frequency may be lower. Depending on the communication requirements and the embodiment, the bus between the distributed processor memory chips and the external entity can be configured to have a communication speed equal to, greater than, or less than the communication speed of the bus between the distributed processor memory chips speed.

在一些实施例中，如由图75A表示，诸如第一至第三分布式处理器存储器芯片7500、7500'及7500”的多个分布式处理器存储器芯片可被配置为彼此通信。如所提到，此能力可便利可扩展分布式处理器存储器芯片系统的组装。举例而言，来自第一至第三处理器存储器芯片7500、7500'及7500”的存储器阵列7510、7510'及7510”及处理阵列7520、7520'及7520”在藉由通信通道(诸如，图75A中所展示的总线)链接时可被视为实际上属于单个分布式处理器存储器芯片。In some embodiments, as represented by Figure 75A, multiple distributed processor memory chips, such as first through third distributed processor memory chips 7500, 7500' and 7500", may be configured to communicate with each other. As mentioned Thus, this capability can facilitate the assembly of scalable distributed processor memory chip systems. For example, memory arrays 7510, 7510' and 7510" from first through third processor memory chips 7500, 7500' and 7500" and The processing arrays 7520, 7520', and 7520" may be considered to actually belong to a single distributed processor memory chip when linked by a communication channel, such as the bus shown in Figure 75A.

根据本发明的实施例，可用任何合适的方式管理多个分布式处理器存储器芯片之间的通信和/或分布式处理器存储器芯片与一个或多个外部实体之间的通信。在一些实施例中，可藉由诸如分布式处理器存储器芯片7500中的处理阵列7520的处理资源来管理这些通信。在一些其他实施例中，例如为了减轻由分布式处理器的阵列提供的处理资源所受的由通信管理强加的运算负荷，分布式处理器存储器芯片的诸如控制器7540、7540'、7540”等的控制器可被配置为管理分布式处理器存储器芯片之间的通信和/或分布式处理器存储器芯片与一个或多个外部实体之间的通信。举例而言，相对于其他分布式处理器存储器芯片，第一至第三处理器存储器芯片7500、7500'及7500”的每一控制器7540、7540'及7540”可被配置为管理与其对应分布式处理器存储器芯片相关的通信。在一些实施例中，控制器7540、7540'及7540”可被配置为经由诸如端口7531、7531'、7531”、7532、7532'及7532”等的对应通信端口控制这些通信。Communications between multiple distributed processor memory chips and/or communications between distributed processor memory chips and one or more external entities may be managed in any suitable manner in accordance with embodiments of the present invention. In some embodiments, these communications may be managed by processing resources such as processing array 7520 in distributed processor memory chip 7500. In some other embodiments, distributed processor memory chips such as controllers 7540, 7540', 7540", etc., such as to relieve the computational load imposed by communication management on the processing resources provided by the array of distributed processors, The controller may be configured to manage communications between distributed processor memory chips and/or between distributed processor memory chips and one or more external entities. For example, with respect to other distributed processors The memory chips, each controller 7540, 7540' and 7540" of the first to third processor memory chips 7500, 7500' and 7500" may be configured to manage communications related to their corresponding distributed processor memory chips. In some In an embodiment, the controllers 7540, 7540' and 7540" may be configured to control these communications via corresponding communication ports such as ports 7531, 7531', 7531", 7532, 7532' and 7532".

控制器7540、7540'及7540”亦可被配置为在考虑可存在于分布式处理器存储器芯片间的时序差的同时管理分布式处理器存储器芯片之间的通信。举例而言，分布式处理器存储器芯片(例如，7500)可由内部时钟馈入，该内部时钟相对于其他分布式处理器存储器芯片(例如，7500'及7500”)的时钟可能不同。因此，在一些实施例中，控制器7540可被配置为实施用于考虑分布式处理器存储器芯片间的不同时钟时序图案的一个或多个策略，且藉由考虑分布式处理器存储器芯片之间的可能时间偏差来管理分布式处理器存储器芯片之间的通信。Controllers 7540, 7540' and 7540" may also be configured to manage communications between distributed processor memory chips while taking into account timing differences that may exist between distributed processor memory chips. For example, distributed processing A processor memory chip (eg, 7500) may be fed by an internal clock that may be different relative to the clock of other distributed processor memory chips (eg, 7500' and 7500"). Accordingly, in some embodiments, controller 7540 may be configured to implement one or more strategies for accounting for different clock timing patterns among distributed processor memory chips, and by taking into account distributed processor memory chips between possible time skew to manage communication between distributed processor memory chips.

举例而言，在一些实施例中，第一分布式处理器存储器芯片7500的控制器7540可被配置为使得能够在某些条件下将数据从第一分布式处理器存储器芯片7500传送至第二处理器存储器芯片7500'。在一些状况下，若第一分布式处理器存储器芯片7500的一个或多个处理器子单元未准备好传送数据，则控制器7540可抑制数据传送。替代地或另外，若第二分布式处理器存储器芯片7500'的接收处理器子单元未准备好接收数据，则控制器7540可抑制数据传送。在一些状况下，控制器7540可在确定发送处理器子单元准备好发送数据且接收处理器子单元准备好接收数据之后，起始将数据从发送处理器子单元(例如，在芯片7500中)传送至接收处理器子单元(例如，在芯片7500'中)。在其他实施例中，控制器7540可仅基于发送处理器子单元是否准备好发送数据来起始数据传送，尤其在数据可在控制器7540或7540'中缓冲例如直至接收处理器子单元准备好接收所传送数据的情况下。For example, in some embodiments, the controller 7540 of the first distributed processor memory chip 7500 may be configured to enable the transfer of data from the first distributed processor memory chip 7500 to the second under certain conditions Processor memory chip 7500'. In some cases, the controller 7540 may inhibit data transfer if one or more processor sub-units of the first distributed processor memory chip 7500 are not ready to transfer data. Alternatively or additionally, the controller 7540 may inhibit data transfer if the receive processor subunit of the second distributed processor memory chip 7500' is not ready to receive data. In some cases, the controller 7540 may initiate data transfer from the transmit processor sub-unit (eg, in chip 7500) after determining that the transmit processor sub-unit is ready to transmit data and the receive processor sub-unit is ready to receive data to the receive processor subunit (eg, in chip 7500'). In other embodiments, the controller 7540 may initiate a data transfer based solely on whether the transmit processor subunit is ready to send data, especially where data may be buffered in the controller 7540 or 7540', eg, until the receive processor subunit is ready When receiving the transmitted data.

根据本发明的实施例，控制器7540可被配置为判定是否满足一个或多个其他时序约束以便使得能够进行数据传送。这种时间约束可与以下各者相关：从发送处理器子单元的传送时间到接收处理器子单元中的接收时间之间的时间差、来自外部实体(例如，主计算机)的对所处理数据的存取请求、对与发送或接收处理器子单元相关联的存储器资源(例如，存储器阵列)执行的刷新操作，以及其他。According to an embodiment of the invention, the controller 7540 may be configured to determine whether one or more other timing constraints are satisfied in order to enable data transfer. Such time constraints may be related to the time difference from the transmit time of the transmit processor sub-unit to the receive time in the receive processor sub-unit, from external entities (eg, host computers) to processing data. Access requests, refresh operations performed on memory resources (eg, memory arrays) associated with transmit or receive processor subunits, and others.

图75E为符合本发明的实施例的实例时序图。图75E说明以下实例。75E is an example timing diagram consistent with embodiments of the present invention. Figure 75E illustrates the following example.

在一些实施例中，控制器7540及与分布式处理器存储器芯片相关联的其他控制器可被配置为使用时钟启用信号来管理芯片之间的数据传送。举例而言，处理阵列7520可由时钟馈入。在一些实施例中，可例如藉由控制器7540使用时钟启用信号(例如，在图75A展示为“至CE”)来控制一个或多个处理器子单元是否对所供应时钟信号作出响应。每一处理器子单元，例如7520_1至7520_K，可执行程序代码，且程序代码可包括通信命令。根据本发明的一些实施例，控制器7540可藉由控制至处理器子单元7520_1至7520_K的时钟启用信号来控制通信命令的时序。举例而言，根据一些实施例，当发送处理器子单元(例如，在第一处理器存储器芯片7500中)经编程以在某一循环(例如，第1000个时钟循环)传送数据且接收处理器子单元(例如，在第二处理器存储器芯片7500'中)经编程以在某一循环(例如，第1000个时钟循环)接收数据时，第一处理器存储器芯片7500的控制器7540及第二处理器存储器芯片7500'的控制器7540'可能不允许数据传送，直至发送处理器子单元及接收处理器子单元两者均准备好执行数据传送。举例而言，控制器7540可藉由向芯片7500中的发送处理器子单元供应某一时钟启用信号(例如，逻辑低)来“抑制”从发送处理器子单元的数据传送，该时钟启用信号可防止发送处理器子单元响应于所接收时钟信号而发送数据。某一时钟启用信号可“冻结”整个分布式处理器存储器芯片或分布式处理器存储器芯片的任何部分。另一方面，控制器7540可藉由向发送处理器子单元供应相反的时钟启用信号(例如，逻辑高)来使发送处理器子单元起始数据传送，该时钟启用信号使发送处理器子单元对所接收时钟信号作出响应。可使用由控制器7540'发出的时钟启用信号来控制类似操作，例如藉由芯片7500'中的接收处理器子单元接收或不接收。In some embodiments, controller 7540 and other controllers associated with distributed processor memory chips may be configured to use clock enable signals to manage data transfers between chips. For example, the processing array 7520 may be fed by a clock. In some embodiments, whether one or more processor sub-units are responsive to a supplied clock signal may be controlled, eg, by controller 7540 using a clock enable signal (eg, shown in Figure 75A as "to CE"). Each processor sub-unit, eg, 7520_1 through 7520_K, can execute program code, and the program code may include communication commands. According to some embodiments of the present invention, the controller 7540 may control the timing of communication commands by controlling the clock enable signals to the processor subunits 7520_1 to 7520_K. For example, according to some embodiments, when the sending processor sub-unit (eg, in the first processor memory chip 7500) is programmed to transmit data at a certain cycle (eg, the 1000th clock cycle) and the receiving processor The subunit (eg, in the second processor memory chip 7500') is programmed to receive data at a certain cycle (eg, the 1000th clock cycle), the controller 7540 of the first processor memory chip 7500 and the second The controller 7540' of the processor memory chip 7500' may not allow the data transfer until both the transmitting processor sub-unit and the receiving processor sub-unit are ready to perform the data transfer. For example, the controller 7540 may "suppress" data transfers from the transmit processor subunit by supplying a certain clock enable signal (eg, logic low) to the transmit processor subunit in the chip 7500, the clock enable signal The transmit processor subunit may be prevented from transmitting data in response to the received clock signal. A certain clock enable signal can "freeze" the entire distributed processor memory chip or any portion of the distributed processor memory chip. On the other hand, the controller 7540 can cause the transmit processor subunit to initiate a data transfer by supplying an opposite clock enable signal (eg, logic high) to the transmit processor subunit that causes the transmit processor subunit to Respond to the received clock signal. Similar operations may be controlled using a clock enable signal issued by controller 7540', eg, reception or non-reception by a receive processor sub-unit in chip 7500'.

在一些实施例中，可将时钟启用信号发送至处理器存储器芯片(例如，7500)中的所有处理器子单元(例如，7520_1至7520_K)。一般而言，时钟启用信号可具有使处理器子单元对其各别时钟信号作出响应或忽略那些时钟信号的效应。举例而言，在一些状况下，当时钟启用信号为高(取决于特定应用的惯例)时，处理器子单元可对其时钟信号作出响应且可根据其时钟信号时序执行一个或多个指令。另一方面，当时钟启用信号为低时，防止处理器子单元对其时钟信号作出响应，使得其不响应于时钟时序而执行指令。换言之，当时钟启用信号为低时，处理器子单元可忽略所接收时钟信号。In some embodiments, the clock enable signal may be sent to all processor subunits (eg, 7520_1 through 7520_K) in a processor memory chip (eg, 7500). In general, clock enable signals may have the effect of causing processor subunits to respond to their respective clock signals or to ignore those clock signals. For example, in some cases, when the clock enable signal is high (depending on the conventions of a particular application), the processor sub-unit may respond to its clock signal and may execute one or more instructions according to its clock signal timing. On the other hand, when the clock enable signal is low, the processor subunit is prevented from responding to its clock signal so that it does not execute instructions in response to the clock timing. In other words, the processor subunit may ignore the received clock signal when the clock enable signal is low.

返回图75A的实例，控制器7540、7540'或7540”中的任一者可被配置为使用时钟启用信号，从而藉由使各别阵列中的一个或多个处理器子单元对所接收时钟信号作出响应或不作出响应来控制各别分布式处理器存储器芯片的操作。在一些实施例中，控制器7540、7540'或7540”可被配置为选择性地推进程序代码执行，例如在此代码与数据传送操作及其时序相关或包括数据传送操作及其时序时。在一些实施例中，控制器7540、7540'或7540”可被配置为使用时钟启用信号来控制两个不同的分布式处理器存储器芯片之间经由通信端口7531、7531'、7531”、7532、7532'及7532”等中的任一者的数据传输的时序。在一些实施例中，控制器7540、7540'或7540”可被配置为使用时钟启用信号来控制两个不同的分布式处理器存储器芯片之间经由通信端口7531、7531'、7531”、7532、7532'及7532”等中的任一者的数据接收的时间。Returning to the example of FIG. 75A, any of the controllers 7540, 7540' or 7540" may be configured to use a clock enable signal to cause one or more processor subunits in the respective array to respond to the received clock The signals respond or do not respond to control the operation of the respective distributed processor memory chips. In some embodiments, the controller 7540, 7540' or 7540" may be configured to selectively advance program code execution, such as herein When code is related to or includes data transfer operations and their timing. In some embodiments, the controller 7540, 7540' or 7540" may be configured to use a clock enable signal to control communication between two different distributed processor memory chips via communication ports 7531, 7531', 7531", 7532, 7532' and 7532", etc., the timing of data transfers. In some embodiments, the controller 7540, 7540' or 7540" may be configured to use a clock enable signal to control two different distributed processors Time of data reception between memory chips via any of communication ports 7531, 7531', 7531", 7532, 7532' and 7532", etc.

在一些实施例中，两个不同的分布式处理器存储器芯片之间的数据传送时序可基于编译优化步骤而配置。编译可允许建置处理程序，其中可将任务高效地指派给处理子单元而不受连接于两个不同处理器存储器芯片之间的总线上的传输延迟影响。编译可由主计算机中的编译程序执行，或传输至主计算机。通常，两个不同处理器存储器芯片之间的总线上的传送延迟将导致需要数据的处理子单元的数据瓶颈。所公开编译可用使得处理单元能够甚至在总线上具有不利传输延迟的情况下仍连续地接收数据的方式调度数据传输。In some embodiments, the timing of data transfers between two different distributed processor memory chips may be configured based on a compilation optimization step. Compilation can allow for the construction of handlers in which tasks can be efficiently assigned to processing subunits without being affected by propagation delays on buses connecting two different processor memory chips. Compilation can be performed by a compiler program in the host computer, or transferred to the host computer. Often, transfer delays on the bus between two different processor memory chips will cause a data bottleneck for the processing subunits that require the data. The disclosed compilation may schedule data transfers in a manner that enables a processing unit to continuously receive data even with unfavorable transfer delays on the bus.

虽然图75A的实施例针对每个分布式处理器存储器芯片(7500'、7500”、7500”')包括三个端口，但根据所公开实施例，任何数目个端口可包括于分布式处理器存储器芯片中。举例而言，在一些状况下，分布式处理器存储器芯片可包括更多或更少端口。在图75B的实施例中，每一分布式处理器存储器芯片(例如，7500A至7500I)可配置有多个端口。这些端口可大体上彼此相同或可能不同。在所展示的实例中，每一分布式处理器存储器芯片包括五个端口，包括一主机通信端口7570及四个芯片端口7572。主机通信端口7570可被配置为在阵列(如图75B中所展示)中的分布式处理器中的任一者与例如相对于分布式处理器存储器芯片的阵列位于远程的主计算机之间进行通信(经由总线7534)。芯片端口7572可被配置为使得能够经由总线7535在分布式处理器存储器芯片之间进行通信。Although the embodiment of Figure 75A includes three ports for each distributed processor memory chip (7500', 7500", 7500"'), any number of ports may be included in the distributed processor memory in accordance with the disclosed embodiments in the chip. For example, in some cases, distributed processor memory chips may include more or fewer ports. In the embodiment of Figure 75B, each distributed processor memory chip (eg, 7500A-7500I) may be configured with multiple ports. These ports may be substantially the same as each other or may be different. In the example shown, each distributed processor memory chip includes five ports, including one host communication port 7570 and four chip ports 7572. The host communication port 7570 can be configured to communicate between any of the distributed processors in the array (as shown in FIG. 75B ) and a host computer remotely located, for example, with respect to the array of distributed processor memory chips (via bus 7534). Chip port 7572 may be configured to enable communication between distributed processor memory chips via bus 7535 .

任何数目个分布式处理器存储器芯片可彼此连接。在图75B中所展示的每分布式处理器包括四个芯片端口的实例中可实现阵列，在该阵列中，每一分布式处理器存储器芯片连接至两个或多于两个其他分布式处理器存储器芯片，且在一些状况下，某些芯片可连接至四个其他分布式处理器存储器芯片。在分布式处理器存储器芯片中包括更多芯片端口可实现分布式处理器存储器芯片之间的更多互连性。Any number of distributed processor memory chips may be connected to each other. In the example shown in Figure 75B including four chip ports per distributed processor, an array may be implemented in which each distributed processor memory chip is connected to two or more other distributed processors processor memory chips, and in some cases some chips may be connected to four other distributed processor memory chips. Including more chip ports in a distributed processor memory chip enables more interconnectivity between distributed processor memory chips.

另外，虽然分布式处理器存储器芯片7500A至7500I在图75B中展示为具有两种不同类型的通信端口7570及7572，但在一些状况下，单种类型的通信端口可包括于每一分布式处理器存储器芯片中。在其他状况下，多于两种不同类型的通信端口可包括于分布式处理器存储器芯片中的一个或多个中。在图75C的实例中，分布式处理器存储器芯片7500A'至7500C'中的每一者包括两个(或多于两个)相同类型的通信端口7570。在此实施例中，通信端口7570可被配置为使得能够经由总线7534与诸如主计算机的外部实体进行通信，且亦可被配置为使得能够经由总线7535在分布式处理器存储器芯片之间(例如，在分布式处理器存储器芯片7500B'与7500C'之间)进行通信。Additionally, although distributed processor memory chips 7500A-7500I are shown in FIG. 75B as having two different types of communication ports 7570 and 7572, in some cases a single type of communication port may be included in each distributed processing in the memory chip. In other cases, more than two different types of communication ports may be included in one or more of the distributed processor memory chips. In the example of FIG. 75C, each of the distributed processor memory chips 7500A'-7500C' includes two (or more than two) communication ports 7570 of the same type. In this embodiment, the communication port 7570 may be configured to enable communication with an external entity, such as a host computer, via the bus 7534, and may also be configured to enable communication between distributed processor memory chips (eg, via the bus 7535) , between distributed processor memory chips 7500B' and 7500C').

在一些实施例中，设置于一个或多个分布式处理器存储器芯片上的端口可用以提供对多于一个主机的存取。举例而言，在图75D中所展示的实施例中，分布式处理器存储器芯片包括两个或多于两个端口7570。端口7570可构成主机端口、芯片端口，或主机端口与芯片端口的组合。在所展示的实例中，两个端口7570及7570'可使两个不同主机(例如，主计算机或计算元件或其他类型的逻辑单元)能够经由总线7534及7534'存取分布式处理器存储器芯片7500A。此实施例可使两个(或多于两个)不同主计算机能够存取分布式处理器存储器芯片7500A。然而，在其他实施例中，总线7534及7534'两者可连接至同一主机实体，例如其中该主机实体需要额外带宽或对分布式处理器存储器芯片7500A的处理器子单元/存储器组中的一个或多个的并行存取。In some embodiments, ports provided on one or more distributed processor memory chips may be used to provide access to more than one host. For example, in the embodiment shown in FIG. 75D, the distributed processor memory chip includes two or more ports 7570. Port 7570 may constitute a host port, a chip port, or a combination of host port and chip port. In the example shown, two ports 7570 and 7570' may enable two different hosts (eg, host computers or computing elements or other types of logic units) to access distributed processor memory chips via buses 7534 and 7534' 7500A. This embodiment may enable two (or more than two) different host computers to access the distributed processor memory chip 7500A. However, in other embodiments, both buses 7534 and 7534' may be connected to the same host entity, such as where the host entity requires additional bandwidth or to one of the processor subunits/memory banks of distributed processor memory chip 7500A or multiple concurrent accesses.

在一些状况下，如图75D中所展示，多于一个控制器7540及7540'可用以控制对分布式处理器存储器芯片7500A的分布式处理器子单元/存储器组的存取。在其他状况下，单个控制器可用以处置来自一个或多个外部主机实体的通信。In some cases, as shown in Figure 75D, more than one controller 7540 and 7540' may be used to control access to the distributed processor subunits/memory banks of distributed processor memory chip 7500A. In other cases, a single controller may be used to handle communications from one or more external host entities.

另外，分布式处理器存储器芯片7500A内部的一个或多个总线可使得能够对分布式处理器存储器芯片7500A的分布式处理器子单元/存储器组进行并行存取。举例而言，分布式处理器存储器芯片7500A可包括第一总线7580及第二总线7580'，该些总线使得能够对例如分布式处理器子单元7520_1至7520_6及其对应的专用存储器组7510_1至7510_6进行并行存取。此配置可允许同时存取分布式处理器存储器芯片7500A中的两个不同位置。另外，在不同时使用所有端口的状况下，该些端口可共享分布式处理器存储器芯片7500A内的硬件资源(例如，共同总线和/或共同控制器)，且可构成多任务(mux)至该硬件的IO。Additionally, one or more buses within the distributed processor memory chip 7500A may enable parallel access to the distributed processor subunits/memory banks of the distributed processor memory chip 7500A. For example, the distributed processor memory chip 7500A may include a first bus 7580 and a second bus 7580' that enable access to, for example, distributed processor subunits 7520_1 to 7520_6 and their corresponding dedicated memory banks 7510_1 to 7510_6 parallel access. This configuration may allow simultaneous access to two different locations in the distributed processor memory chip 7500A. Additionally, without using all ports at the same time, the ports may share hardware resources (eg, a common bus and/or a common controller) within the distributed processor memory chip 7500A, and may constitute multiplexing (mux) to IO of this hardware.

在一些实施例中，运算单元中的一些(例如，处理器子单元7520_1至7520_6)可连接至额外端口(7570')或控制器，而其他者不连接至额外端口或控制器。然而，来自不连接至额外端口7570'的运算单元的数据可通过到连接至端口7570'的运算单元的连接的内部网格(grid)。以此方式，可同时在两个端口7570及7570'处执行通信而无需添加额外总线。In some embodiments, some of the arithmetic units (eg, processor sub-units 7520_1 to 7520_6) may be connected to additional ports (7570') or controllers, while others are not connected to additional ports or controllers. However, data from arithmetic units not connected to additional ports 7570' may pass through an internal grid of connections to arithmetic units connected to port 7570'. In this way, communication can be performed at both ports 7570 and 7570' simultaneously without adding an additional bus.

虽然通信端口(例如，7530至7532)及控制器(例如，7540)已说明为分开组件，但应了解，通信端口及控制器(或任何其他组件)可实施为根据本发明的实施例的集成单元。图76提供符合本发明的实施例的具有整合的控制器及接口模块的分布式处理器存储器芯片7600的图解表示。如图76中所展示，处理器存储器芯片7600可实施为具有整合的控制器及接口模块7547，该模块被配置为执行图75中的控制器7540以及通信端口7530、7531及7532的功能。如图76中所展示，控制器及接口模块7547被配置为经由类似于通信端口(例如，7530、7531及7532)的接口7548_1至7548_N与诸如外部实体、一个或多个分布式处理器存储器芯片等的多个不同实体通信。控制器及接口模块7547可经进一步配置以控制分布式处理器存储器芯片之间或分布式处理器存储器芯片7600与诸如主计算机的外部实体之间的通信。在一些实施例中，控制器及接口模块7547可包括被配置为与一个或多个其他分布式处理器存储器芯片及与诸如主计算机、通信模块等的外部实体并行地通信的通信接口7548_1至7548_N。Although the communication ports (eg, 7530-7532) and the controller (eg, 7540) have been illustrated as separate components, it should be understood that the communication ports and the controller (or any other component) may be implemented as integrated in accordance with embodiments of the present invention unit. 76 provides a diagrammatic representation of a distributed processor memory chip 7600 with integrated controller and interface modules, consistent with embodiments of the present invention. As shown in FIG. 76, the processor memory chip 7600 may be implemented with an integrated controller and interface module 7547 configured to perform the functions of the controller 7540 and communication ports 7530, 7531 and 7532 in FIG. As shown in FIG. 76, a controller and interface module 7547 is configured to communicate with external entities, such as one or more distributed processor memory chips, via interfaces 7548_1 through 7548_N similar to communication ports (eg, 7530, 7531, and 7532) communication with multiple different entities, etc. The controller and interface module 7547 may be further configured to control communications between the distributed processor memory chips or between the distributed processor memory chips 7600 and an external entity such as a host computer. In some embodiments, controller and interface module 7547 may include communication interfaces 7548_1 to 7548_N configured to communicate in parallel with one or more other distributed processor memory chips and with external entities such as host computers, communication modules, etc. .

图77提供表示符合本发明的实施例的用于在图75中所展示的可扩展处理器存储器系统中的分布式处理器存储器芯片之间传送数据的处理程序的流程图。出于说明的目的，将参看图75描述用于传送数据的流程，且假定数据是从第一处理器存储器芯片7500传送至第二处理器存储器芯片7500'。77 provides a flowchart representing a process for transferring data between distributed processor memory chips in the scalable processor memory system shown in FIG. 75, consistent with an embodiment of the present invention. For illustrative purposes, a flow for transferring data will be described with reference to FIG. 75 and assume that data is transferred from a first processor memory chip 7500 to a second processor memory chip 7500'.

在步骤S7710处，可接收数据传送请求。然而，应注意且如上文所描述，在一些实施例中，数据传送请求可能并非必需的。举例而言，在一些状况下，数据传送的时序可为预定的(例如，藉由特定软件代码)。在此状况下，数据传送可在无分开的数据传送请求的情况下继续进行。步骤S7710可由例如控制器7540以及其他者执行。在一些实施例中，数据传送请求可包括将数据从第一分布式处理器存储器芯片7500的一个处理器子单元传送至第二分布式处理器存储器芯片7500'的另一处理器子单元的请求。At step S7710, a data transfer request may be received. It should be noted, however, and as described above, that in some embodiments, a data transfer request may not be necessary. For example, in some cases, the timing of data transfers may be predetermined (eg, by specific software code). In this situation, the data transfer can continue without a separate data transfer request. Step S7710 may be performed by, for example, the controller 7540 and others. In some embodiments, the data transfer request may include a request to transfer data from one processor subunit of the first distributed processor memory chip 7500 to another processor subunit of the second distributed processor memory chip 7500' .

在步骤S7720处，可判定数据传送时序。如所提到，数据传送时序可为预定的且可取决于特定软件程序的执行次序。步骤S7720可由例如控制器7540以及其他者执行。在一些实施例中，可藉由考虑(1)发送处理器子单元是否准备好传送数据和/或(2)接收处理器子单元是否准备好接收数据，来判定数据传送时序。根据本发明的实施例，亦可考虑是否满足一个或多个其他时序约束以使得能够进行此数据传送。一个或多个时间约束可与以下各者相关：从发送处理器子单元的传送时间到接收处理器子单元处的接收时间之间的时间差、来自外部实体(例如，主计算机)的对所处理数据的存取请求、对与发送或接收处理器子单元相关联的存储器资源(例如，存储器阵列)执行的刷新操作等。根据本发明的实施例，处理子单元可由时钟馈入。在一些实施例中，可例如使用时钟启用信号来控制供应至处理子单元的时钟。根据本发明的一些实施例，控制器7540可藉由控制给处理器子单元7520_1至7520_K的时钟启用信号来控制通信命令的时序。At step S7720, the data transfer timing may be determined. As mentioned, the data transfer timing may be predetermined and may depend on the order of execution of particular software programs. Step S7720 may be performed by, for example, the controller 7540 and others. In some embodiments, data transfer timing may be determined by considering (1) whether the transmit processor subunit is ready to transmit data and/or (2) whether the receive processor subunit is ready to receive data. Whether one or more other timing constraints are satisfied to enable this data transfer may also be considered in accordance with embodiments of the present invention. One or more time constraints may be related to: the time difference from the transmit time of the transmit processor subunit to the receive time at the receive processor subunit, the Access requests for data, refresh operations performed on memory resources (eg, memory arrays) associated with transmit or receive processor subunits, and the like. According to an embodiment of the invention, the processing subunit may be fed by a clock. In some embodiments, the clock supplied to the processing subunit may be controlled, eg, using a clock enable signal. According to some embodiments of the present invention, the controller 7540 may control the timing of communication commands by controlling the clock enable signals to the processor sub-units 7520_1 to 7520_K.

在步骤S7730处，可基于在步骤S7720处判定的数据传送时序而执行数据传输。步骤S7730可由例如控制器7540以及其他者执行。举例而言，第一处理器存储器芯片7500的发送处理器子单元可根据在步骤S7720处判定的数据传送时序将数据传送至第二处理器存储器芯片7500'的接收处理器子单元。At step S7730, data transfer may be performed based on the data transfer timing determined at step S7720. Step S7730 may be performed by, for example, the controller 7540 and others. For example, the transmit processor subunit of the first processor memory chip 7500 may transmit data to the receive processor subunit of the second processor memory chip 7500' according to the data transfer timing determined at step S7720.

所公开架构可适用于多种应用。举例而言，在一些状况下，以上架构可便利在不同分布式处理器存储器芯片间共享数据，诸如与神经网络(尤其为大型神经网络)相关联的权重或神经元值或部分神经元值。另外，在诸如SUM、AVG等的某些运算中可能需要来自多个不同的分布式处理器存储器芯片的数据。在此状况下，所公开架构可便利共享来自多个分布式处理器存储器芯片的此数据。又另外，例如，所公开架构可便利在分布式处理器存储器芯片之间共享记录以支持查询的接合操作。The disclosed architecture is applicable to a variety of applications. For example, in some cases, the above architecture may facilitate sharing data among different distributed processor memory chips, such as weights or neuron values or partial neuron values associated with neural networks, especially large neural networks. Additionally, data from multiple different distributed processor memory chips may be required in certain operations such as SUM, AVG, and the like. In this case, the disclosed architecture may facilitate sharing this data from multiple distributed processor memory chips. Still further, for example, the disclosed architecture may facilitate sharing of records among distributed processor memory chips to support splicing operations for queries.

亦应注意，虽然已相对于分布式处理器存储器芯片描述了以上实施例，但相同原理及技术可应用于例如不包括分布式处理器子单元的常规存储器芯片。举例而言，在一些状况下，多个存储器芯片可一起组合成多端口存储器芯片，以形成甚至不具有处理器子单元的阵列的存储器芯片的阵列。在另一实施例中，多个存储器芯片可组合在一起以形成所连接存储器的阵列，从而实际上向主机提供包含多个存储器芯片的一个较大存储器。It should also be noted that although the above embodiments have been described with respect to distributed processor memory chips, the same principles and techniques may be applied, for example, to conventional memory chips that do not include distributed processor subunits. For example, in some cases, multiple memory chips may be combined together into a multi-port memory chip to form an array of memory chips that do not even have an array of processor subunits. In another embodiment, multiple memory chips can be grouped together to form an array of connected memories, effectively providing the host with one larger memory containing multiple memory chips.

端口的内部连接可至主总线或至包括于处理阵列中的内部处理器子单元中的一者。The port's internal connection may be to the main bus or to one of the internal processor subunits included in the processing array.

存储器内(in-memory)零侦测In-memory zero detection

本发明的一些实施例有关于用于侦测储存于多个存储器组的一个或多个特定地址中的零值的存储器单元。所公开存储器单元的此零值侦测特征可适用于减少运算系统的功率消耗，且另外或替代地，亦可减少用于自存储器撷取零值所需的处理时间。此特征可在以下系统中尤其相关：在该系统中，读取的大量数据实际上为0值且亦用于计算运算，诸如乘法\加法\减法\及更多运算，对于该些运算，自存储器撷取零值可能不必要(例如，零值与任何其他值的乘积为零)，且运算电路可使用操作数中的一者为零的事实且在时间及能量上更高效地计算结果。在此些状况下，可使用对零值的存在的侦测来代替存储器存取及自存储器撷取零值。Some embodiments of the present invention relate to memory cells for detecting zero values stored in one or more specific addresses of a plurality of memory banks. This zero value detection feature of the disclosed memory cells may be useful for reducing power consumption of computing systems, and additionally or alternatively, may also reduce the processing time required for retrieving zero values from memory. This feature may be especially relevant in systems where a large amount of data read is actually 0-valued and also used for computational operations, such as multiplication\addition\subtraction\ and more operations, for which self- Memory fetching of zero values may not be necessary (eg, the product of a zero value and any other value is zero), and the arithmetic circuit may use the fact that one of the operands is zero and compute the result more efficiently in time and energy. In such cases, the detection of the presence of zero values may be used in place of memory accesses and retrieving zero values from memory.

贯穿此章节，相对于读取功能来描述所公开实施例。然而，应注意，所公开架构及技术同样适用于零值写入操作，或在其他值可能更经常出现的状况下，亦用于其他特定预定非零值操作。Throughout this section, the disclosed embodiments are described with respect to read functionality. It should be noted, however, that the disclosed architecture and techniques are equally applicable to zero-valued write operations, or other specific predetermined non-zero-valued operations where other values may occur more often.

在所公开实施例中，替代自存储器撷取零值，当在特定地址处侦测到此值时，存储器单元可将零值指示符传回至存储器单元外部的一个或多个电路(例如，位于存储器单元外部的一个或多个处理器、CPU等)。零值为多位零值零(例如，零值字节，零值字，小于一字节、大于一字节的多位零值，及其类似者)。零值指示符为提示储存于存储器中的零值的1位信号，因此相比传送储存于存储器中的n个数据位，传送提示信号的1位零值为有益的。所传输的零提示可将用于传送的能量消耗减少至1/n，且可加速运算，例如其中在藉由神经元的权重计算输入、卷积、将核心应用于输入数据以及与经训练神经网络、人工智能及广泛其他类型的运算相关联的许多其他计算中涉及乘法运算。为提供此功能性，所公开存储器单元可包括一个或多个零值侦测逻辑单元，该一个或多个零值侦测逻辑单元可侦测存储器中的特定位置中存在零值，防止撷取零值(例如，经由读取命令)且使得替代地将零值指示符传输至存储器单元外部的电路系统(例如，使用存储器的一或多条控制线、与存储器单元相关联的一个或多个总线等)。可在存储器垫层级、在组层级、在子组层级、在芯片层级等执行零值侦测。In the disclosed embodiments, instead of retrieving a zero value from memory, when this value is detected at a particular address, the memory cell may communicate a zero value indicator back to one or more circuits external to the memory cell (eg, one or more processors, CPUs, etc.) located outside the memory unit. A zero value is a multi-bit zero-value zero (eg, a zero-value byte, a zero-value word, a multi-bit zero value less than one byte, greater than one byte, and the like). The zero value indicator is a 1-bit signal that hints at a zero value stored in memory, so it is beneficial to transmit a 1-bit zero value for an indication signal rather than transmitting n data bits stored in memory. The transmitted zero hints can reduce the energy consumption for transmission to 1/n and can speed up operations, such as where inputs are computed by the weights of neurons, convolutions, applying cores to input data, and interfacing with trained neurons. Multiplication is involved in many other computations associated with networking, artificial intelligence, and a wide range of other types of operations. To provide this functionality, the disclosed memory cells can include one or more zero-value detection logic cells that can detect the presence of a zero value in a particular location in the memory, preventing fetching A zero value (eg, via a read command) and causes the zero value indicator to be communicated instead to circuitry external to the memory cell (eg, using one or more control lines of the memory, one or more associated with the memory cell bus, etc.). Zero detection may be performed at the memory pad level, at the bank level, at the subgroup level, at the chip level, and so on.

应注意，虽然相对于将零指示符递送至在存储器芯片外部的位置而描述了所公开实施例，但所公开实施例及特征亦可在处理可在存储器芯片内部进行的系统中提供显著益处。举例而言，在诸如本文中所公开的分布式处理器存储器芯片的实施例中，可藉由对应处理器子单元对各种存储器组中的数据执行处理。在许多状况下，诸如相关联数据可包括许多零的神经网络执行或数据分析，所公开技术可加速处理和/或减少与由分布式处理器存储器芯片中的处理器子单元执行的处理相关联的功率消耗。It should be noted that while the disclosed embodiments are described with respect to delivering a zero indicator to a location external to the memory chip, the disclosed embodiments and features may also provide significant benefits in processing systems that may be internal to the memory chip. For example, in embodiments such as the distributed processor memory chips disclosed herein, processing may be performed on data in various memory banks by corresponding processor subunits. In many situations, such as neural network execution or data analysis, where the associated data may include many zeros, the disclosed techniques may speed up processing and/or reduce processing associated with processing performed by processor sub-units in distributed processor memory chips of power consumption.

图78A说明符合本发明的实施例的用于在芯片层级侦测储存于多个存储器组的一个或多个特定地址中的零值的系统7800，该多个存储器组实施于存储器芯片7810中。系统7800可包括存储器芯片7810及主机7820。存储器芯片7810可包括多个控制单元且每一控制单元可具有专用存储器组。举例而言，控制单元可用可操作方式连接至专用存储器组。78A illustrates a system 7800 for detecting, at the chip level, zero values stored in one or more specific addresses of a plurality of memory banks implemented in a memory chip 7810, in accordance with an embodiment of the invention. System 7800 may include memory chip 7810 and host 7820. The memory chip 7810 may include multiple control units and each control unit may have a dedicated memory bank. For example, the control unit may be operably connected to a dedicated memory bank.

在一些状况下，例如相对于此处所公开的分布式处理器存储器芯片，存储器芯片内的处理可涉及存储器存取(无论为读取抑或写入)，该些分布式处理器存储器芯片包括在空间上分布于存储器组的阵列当中的处理器子单元。甚至在存储器芯片内部的处理的状况下，侦测与读取或写入命令相关联的零值的所公开技术可允许内部处理器单元或子单元放弃传送实际零值。实情为，响应于零值侦测及零值指示符传输(例如，至一个或多个内部处理子单元)，分布式处理器存储器芯片可节省否则将已用于传输存储器芯片内的零数据值的能量。In some cases, such as with respect to the distributed processor memory chips disclosed herein, processing within the memory chips may involve memory accesses (whether reads or writes) that are included in the space processor sub-units distributed among an array of memory banks. Even in the case of processing inside the memory chip, the disclosed techniques of detecting zero values associated with read or write commands may allow internal processor units or subunits to forego transmitting actual zero values. Indeed, in response to zero value detection and zero value indicator transmission (eg, to one or more internal processing sub-units), distributed processor memory chips can save zero data values that would otherwise have been used to transmit zero data values within the memory chip energy of.

在另一实例中，存储器芯片7810及主机7820中的每一者可包括输入/输出(IO)，以使得能够在存储器芯片7810与主机7820之间进行通信。每一IO可与零值指示符线7830A及总线7840A耦接。零值指示符线7830A可将零值指示符从存储器芯片7810传送至主机7820，其中零值指示符可包括在侦测到储存于由主机7820请求的存储器组的特定地址中的零值后由存储器芯片7810产生的1位信号。在经由零值指示符线7830A接收到零值指示符后，主机7820可执行与零值指示符相关联的一个或多个预定义动作。举例而言，若主机7820向存储器芯片7810请求撷取用于乘法的操作数，则主机7820可更高效地计算乘法，这是因为主机7820将从所接收零值指示符确认(不接收实际存储器值)操作数中的一者为零。主机7820亦可经由总线7840将指令、数据及其他输入提供至存储器芯片7810，且从存储器芯片7810读取输出。在从主机7820接收到通信后，存储器芯片7810可撷取与所接收通信相关联的数据，且经由总线7840将所撷取数据传送至主机7820。In another example, each of memory chip 7810 and host 7820 may include input/output (IO) to enable communication between memory chip 7810 and host 7820. Each IO may be coupled with a zero value indicator line 7830A and a bus 7840A. The zero-value indicator line 7830A may transmit a zero-value indicator from the memory chip 7810 to the host 7820, where the zero-value indicator may include a zero value stored in a particular address of a memory bank requested by the host 7820 after detection of a zero value. 1-bit signal generated by memory chip 7810. Upon receiving the zero-value indicator via the zero-value indicator line 7830A, the host 7820 may perform one or more predefined actions associated with the zero-value indicator. For example, if the host 7820 requests the memory chip 7810 to retrieve operands for multiplication, the host 7820 can compute the multiplication more efficiently because the host 7820 will acknowledge from the received zero-value indicator (no actual memory is received) value) one of the operands is zero. Host 7820 may also provide instructions, data, and other inputs to memory chip 7810 via bus 7840, and read outputs from memory chip 7810. After receiving the communication from the host 7820, the memory chip 7810 can retrieve the data associated with the received communication and transmit the retrieved data to the host 7820 via the bus 7840.

在一些实施例中，主机可将零值指示符而非零数据值发送至存储器芯片。以此方式，存储器芯片(例如，安置于存储器芯片上的控制器)可储存或刷新存储器中的零值而不必接收零数据值。此更新可基于零值指示符(例如，作为写入命令的部分)的接收而发生。In some embodiments, the host may send a zero value indicator to the memory chip instead of a zero data value. In this way, a memory chip (eg, a controller disposed on the memory chip) can store or refresh zero values in memory without having to receive zero data values. This update may occur based on receipt of a zero-value indicator (eg, as part of a write command).

图78B说明符合本发明的实施例的用于在存储器组层级侦测储存于多个存储器组7811A至7811B的一个或多个特定地址中的零值的存储器芯片7810。存储器芯片7810可包括多个存储器组7811A至7811B及IO总线7812。尽管图78B描绘实施于存储器芯片7810的两个存储器组7811A至7811B，但存储器芯片7810可包括任何数目个存储器组。78B illustrates a memory chip 7810 for detecting, at the memory bank level, zero values stored in one or more specific addresses of a plurality of memory banks 7811A-7811B, in accordance with an embodiment of the invention. The memory chip 7810 may include a plurality of memory banks 7811A-7811B and an IO bus 7812. Although FIG. 78B depicts two memory banks 7811A-7811B implemented in memory chip 7810, memory chip 7810 may include any number of memory banks.

IO总线7812可被配置为经由总线7840B将数据传送至外部芯片(例如，图78A中的主机7820)/从该外部芯片传送数据。总线7840B可类似于图78A中的总线7840A起作用。IO7812亦可经由零值指示符线7830B传输零值指示符，其中零值指示符线7830B可类似于图78A中的零值指示符线7830A起作用。IO总线7812亦可被配置为经由内部零值指示符线7831及总线7841与存储器组7811A至7811B通信。IO总线7812可将来自外部芯片的所接收数据传输至存储器组7811A至7811B中的一者。举例而言，IO总线7812可经由总线7841传送数据，该数据包含用以读取储存于存储器组7811A的特定地址中的数据的指令。多任务器可包括于IO总线7812与存储器组7811A至7811B之间，且可藉由内部零值指示符线7831及总线7841A连接。多任务器可被配置为将来自IO总线7812的所接收数据传输至特定存储器组，且可经进一步配置以将来自特定存储器组的所接收数据或所接收零值指示符传输至IO总线7812。IO bus 7812 can be configured to transfer data to/from an external chip (eg, host 7820 in Figure 78A) via bus 7840B. Bus 7840B may function similarly to bus 7840A in Figure 78A. IO 7812 may also transmit a zero-value indicator via zero-value indicator line 7830B, which may function similarly to zero-value indicator line 7830A in Figure 78A. IO bus 7812 may also be configured to communicate with memory banks 7811A-7811B via internal zero-value indicator line 7831 and bus 7841. IO bus 7812 may transfer received data from external chips to one of memory banks 7811A-7811B. For example, IO bus 7812 may transmit data via bus 7841 including instructions to read data stored in a particular address of memory bank 7811A. A multiplexer may be included between IO bus 7812 and memory banks 7811A-7811B, and may be connected by internal zero indicator line 7831 and bus 7841A. The multiplexer may be configured to transfer received data from the IO bus 7812 to a particular memory bank, and may be further configured to transfer received data or a received zero value indicator from the particular memory bank to the IO bus 7812.

在一些状况下，主机实体可仅被配置为接收常规数据传输，且可不经装备以解译所公开的零值指示符或对该零值指示符作出响应。在此状况下，所公开实施例(例如，控制器/芯片IO等)可在至主机IO的数据线上重新产生零值来代替零值指示符信号，且因此可节省芯片内部的数据传输功率。In some cases, the host entity may only be configured to receive regular data transmissions, and may not be equipped to interpret or respond to the disclosed zero-value indicator. In this case, the disclosed embodiments (eg, controller/chip IO, etc.) can regenerate a zero value on the data line to the host IO in place of the zero value indicator signal, and thus can save data transfer power inside the chip .

存储器组7811A至7811B中的每一者可包括控制单元。控制单元可侦测储存于存储器组的所请求地址中的零值。在侦测到所储存零值后，控制单元可产生零值指示符且经由内部零值指示符线7831将所产生的零值指示符传输至IO总线7812，其中零值指示符经由零值指示符线7830B进一步传送至外部芯片。Each of the memory banks 7811A-7811B may include a control unit. The control unit may detect a zero value stored in the requested address of the memory bank. After detecting the stored zero value, the control unit may generate a zero value indicator and transmit the generated zero value indicator to the IO bus 7812 via the internal zero value indicator line 7831, where the zero value indicator is indicated via the zero value value The hook line 7830B is further transmitted to the external chip.

图79说明符合本发明的实施例的用于在存储器垫层级侦测储存于多个存储器垫的特定地址中的一个或多个中的零值的存储器组7911。在一些实施例中，存储器组7911可组织成存储器垫7912A至7912B，该些存储器垫中的每一者可被独立地控制及独立地存取。存储器组7911可包括存储器垫控制器7913A至7913B，该些控制器可包括零值侦测逻辑单元7914A至7914B。存储器垫控制器7913A至7913B中的每一者可允许对存储器垫7912A至7912B上的位置进行读取及写入。存储器组7911可进一步包括读取停用组件、区域感测放大器7915A至7915B和/或全局感测放大器7916。79 illustrates a memory bank 7911 for detecting, at the memory pad level, zero values stored in one or more of specific addresses of a plurality of memory pads, in accordance with an embodiment of the present invention. In some embodiments, memory bank 7911 can be organized into memory pads 7912A-7912B, each of which can be independently controlled and independently accessed. Memory bank 7911 may include memory pad controllers 7913A-7913B, which may include zero value detection logic units 7914A-7914B. Each of the memory pad controllers 7913A-7913B may allow reading and writing to locations on the memory pads 7912A-7912B. Memory bank 7911 may further include read disable components, regional sense amplifiers 7915A-7915B, and/or global sense amplifier 7916.

存储器垫7912A至7912B中的每一者可包括多个存储器胞元。多个存储器胞元中的每一者可储存一个二进制信息位。举例而言，存储器胞元中的任一者可个别地储存零值。若特定存储器垫中的所有存储器胞元皆储存零值，则零值可与整个存储器垫相关联。Each of memory pads 7912A-7912B may include multiple memory cells. Each of the plurality of memory cells can store one bit of binary information. For example, any of the memory cells may individually store a zero value. If all memory cells in a particular memory pad store zero values, then the zero values may be associated with the entire memory pad.

存储器垫控制器7913A至7913B中的每一者可被配置为存取专用存储器垫，且读取储存于专用存储器垫中的数据或将数据写入专用垫中。Each of the memory pad controllers 7913A-7913B can be configured to access the dedicated memory pads, and to read or write data stored in the dedicated memory pads.

在一些实施例中，零值侦测逻辑单元7914A或7914B可实施于存储器组7911中。一个或多个零值侦测逻辑单元7914A至7914B可与存储器组、存储器子组、存储器垫及一个或多个存储器胞元的集合相关联。零值侦测逻辑单元7914A或7914B可侦测所请求的特定地址(例如，存储器垫7912A或7912B)储存零值。该侦测可用许多方法执行。In some embodiments, zero value detection logic unit 7914A or 7914B may be implemented in memory bank 7911. One or more zero-value detection logic units 7914A-7914B may be associated with memory banks, memory sub-banks, memory pads, and sets of one or more memory cells. The zero value detection logic unit 7914A or 7914B may detect that a particular address (eg, memory pad 7912A or 7912B) is requested to store a zero value. This detection can be performed in a number of ways.

第一方法可包括使用相对于零的数字比较器。数字比较器可被配置为获取两个数字作为二进制形式的输入，且判定第一数字(所撷取数据)是否等于第二数字(零)。若数字比较器判定两个数字相等，则零值侦测逻辑单元可产生零值指示符。零值指示符可为1位信号，且可使可将数据位发送至下一层级(例如，图78B中的IO总线7812)的放大器(例如，区域感测放大器7915A至7915B)、传输器及缓冲器停用。零值指示符可经由零值指示符线7931A或7931B进一步传输至全局感测放大器7916，但在一些状况下，可绕过全局感测放大器。A first method may include using a digital comparator relative to zero. The digital comparator may be configured to take two numbers as inputs in binary form, and to determine whether the first number (the captured data) is equal to the second number (zero). If the digital comparator determines that the two numbers are equal, the zero value detection logic unit may generate a zero value indicator. The zero value indicator can be a 1-bit signal and can cause the data bits to be sent to amplifiers (eg, regional sense amplifiers 7915A-7915B), transmitters, and The buffer is deactivated. The zero value indicator may be further transmitted to the global sense amplifier 7916 via the zero value indicator line 7931A or 7931B, but in some cases, the global sense amplifier may be bypassed.

用于零侦测的第二方法可包括使用模拟比较器。除了将两个模拟输入的电压用于比较以外，模拟比较器亦可类似于数字比较器起作用。举例而言，可感测所有位，且比较器可充当信号之间的逻辑或(OR)函数。A second method for zero detection may include using an analog comparator. Analog comparators can also function similarly to digital comparators, except that the voltages of the two analog inputs are used for comparison. For example, all bits can be sensed, and the comparator can act as a logical OR (OR) function between the signals.

用于零值侦测的第三方法可包括使用从区域感测放大器7915A至7915B至全局感测放大器7916中的传送信号，其中全局感测放大器7916被配置为感测输入中的任一者是否为高(非零)且使用该逻辑信号以控制放大器的下一层级。区域感测放大器7915A至7915B及全局感测放大器7916可包括多个晶体管，该多个晶体管被配置为感测来自多个存储器组的低功率信号，且该些放大器将小的电压摆动放大至较高电压电平使得储存于多个存储器组中的数据可由诸如存储器垫控制器7913A或7913B的至少一个控制器解译。举例而言，存储器胞元可按行及列布置于存储器组7911上。每一线可附接至行中的每一存储器胞元。沿着行延行的线被称作字线，该些字线藉由将电压选择性地施加至字线来启动。沿着列延行的线被称作位线，且两个这种互补位线可在存储器阵列的边缘处附接至感测放大器。感测放大器的数目可对应于存储器组7911上的位线(列)的数目。为了从特定存储器胞元读取位，接通沿着胞元行的字线，从而启动该行中的所有存储器胞元。来自每一胞元的所储存值(0或1)接着在与特定胞元相关联的位线上可用。在两个互补位线的末端处，感测放大器可将小的电压放大至正常逻辑电平。可接着将来自所要胞元的位自胞元的感测放大器锁存至缓冲器中且置于输出总线上。A third method for zero detection may include using transmit signals from regional sense amplifiers 7915A-7915B into a global sense amplifier 7916, where the global sense amplifier 7916 is configured to sense whether any of the inputs are is high (non-zero) and this logic signal is used to control the next stage of the amplifier. Regional sense amplifiers 7915A-7915B and global sense amplifier 7916 may include multiple transistors configured to sense low power signals from multiple memory banks, and the amplifiers amplify small voltage swings to higher voltages. The high voltage level enables data stored in multiple memory banks to be interpreted by at least one controller, such as memory pad controller 7913A or 7913B. For example, memory cells may be arranged on memory bank 7911 in rows and columns. Each line can be attached to each memory cell in the row. The lines running along the row are called word lines, which are activated by selectively applying voltages to the word lines. The lines running along the columns are called bit lines, and two such complementary bit lines can be attached to sense amplifiers at the edges of the memory array. The number of sense amplifiers may correspond to the number of bit lines (columns) on the memory bank 7911. To read a bit from a particular memory cell, a word line along a row of cells is turned on, enabling all memory cells in that row. The stored value (0 or 1) from each cell is then available on the bit line associated with the particular cell. At the ends of the two complementary bit lines, sense amplifiers can amplify small voltages to normal logic levels. The bits from the desired cell can then be latched from the cell's sense amplifier into a buffer and placed on the output bus.

用于零值侦测的第四方法可包括：若值为0，则针对保存至存储器且在写入时间储存的每一字使用一额外位，且在读出数据时使用该额外位以知晓数据是否为零。该方法可避免将所有零写入至存储器，因此节省更多能量。A fourth method for zero value detection may include: if the value is 0, use an extra bit for each word saved to memory and stored at write time, and use the extra bit when reading data to know Whether the data is zero. This method avoids writing all zeros to memory, thus saving more energy.

如上文且贯穿本发明所描述，一些实施例可包括存储器单元(诸如，存储器单元7800)，该存储器单元包括多个处理器子单元。这些处理器子单元可在空间上分布于单个基板(例如，诸如存储器单元7800的存储器芯片的基板)上。此外，多个处理器子单元中的每一者可专用于存储器单元7800的多个存储器组当中的对应存储器组。且专用于对应处理器子单元的这些存储器组亦可在空间上分布于基板上。在一些实施例中，存储器单元7800可与特定任务(例如，执行与运行神经网络相关联的一个或多个操作等)相关联，且存储器单元7800的处理器子单元中的每一者可负责执行此任务的一部分。举例而言，每一处理器子单元可装备有可包括数据处置及存储器操作、算术及逻辑运算等的指令。在一些状况下，零值侦测逻辑可被配置为将零值指示符提供至在空间上分布于存储器单元7800上的所描述处理器子单元中的一个或多个。As described above and throughout this disclosure, some embodiments may include a memory unit, such as memory unit 7800, that includes a plurality of processor sub-units. These processor subunits may be spatially distributed on a single substrate (eg, a substrate such as a memory chip of memory cell 7800). Furthermore, each of the plurality of processor sub-units may be dedicated to a corresponding memory bank among the plurality of memory banks of memory unit 7800. And these memory banks dedicated to the corresponding processor sub-units can also be spatially distributed on the substrate. In some embodiments, memory unit 7800 may be associated with a particular task (eg, performing one or more operations associated with running a neural network, etc.), and each of the processor sub-units of memory unit 7800 may be responsible for perform part of this task. For example, each processor sub-unit may be equipped with instructions that may include data manipulation and memory operations, arithmetic and logical operations, and the like. In some cases, zero-value detection logic may be configured to provide a zero-value indicator to one or more of the described processor sub-units that are spatially distributed over memory unit 7800.

现参看图80，其为说明符合本发明的实施例的侦测储存于多个存储器组的特定地址中的零值的例示性方法8000的流程图。方法8000可由存储器芯片(例如，图78B的存储器芯片7810)执行。特定而言，存储器单元的控制器(例如，图79的控制器7913A)及零值侦测逻辑单元(例如，零值侦测逻辑单元7914A)可执行方法8000。Referring now to FIG. 80, a flowchart illustrating an exemplary method 8000 of detecting zero values stored in particular addresses of a plurality of memory banks in accordance with embodiments of the present invention. Method 8000 may be performed by a memory chip (eg, memory chip 7810 of Figure 78B). In particular, a controller of a memory cell (eg, controller 7913A of FIG. 79 ) and a zero value detection logic unit (eg, zero value detection logic unit 7914A) may perform method 8000 .

在步骤8010中，可藉由任何合适的技术起始读取或写入操作。在一些状况下，控制器可接收对读取储存于多个离散存储器组(例如，图78中所描绘的存储器组)的特定地址中的数据的请求。控制器可被配置为控制相对于多个离散存储器组的读取/写入操作的至少一个方面。In step 8010, a read or write operation may be initiated by any suitable technique. In some cases, the controller may receive a request to read data stored in a particular address of a plurality of discrete memory banks (eg, the memory banks depicted in FIG. 78). The controller may be configured to control at least one aspect of read/write operations with respect to the plurality of discrete memory banks.

在步骤8020中，一个或多个零值侦测电路可用以侦测与读取或写入命令相关联的零值的存在。举例而言，零值侦测逻辑单元(例如，图78的零值侦测逻辑单元7830)可侦测与特定地址相关联的零值，该特定地址与读取或写入相关联。In step 8020, one or more zero value detection circuits may be used to detect the presence of a zero value associated with a read or write command. For example, a zero value detection logic unit (eg, zero value detection logic unit 7830 of Figure 78) may detect a zero value associated with a particular address associated with a read or write.

在步骤8030中，控制器可响应于由零值侦测逻辑单元在步骤8020中进行的零值侦测而将零值指示符传输至存储器单元外部的一个或多个电路。举例而言，零值侦测逻辑可侦测到所请求地址储存零值，且可将值为零的提示传输至存储器芯片外部(或存储器芯片内，例如在所公开的分布式处理器存储器芯片包括分布于存储器组的阵列当中的处理器子单元的状况下)的实体(例如，一个或多个电路)。若未侦测到与读取或写入命令相关联的零值，则控制器可传输数据值而非零值指示符。在一些实施例中，被传回零值指示符的一个或多个电路可在存储器单元内部。In step 8030, the controller may transmit a zero value indicator to one or more circuits external to the memory cell in response to the zero value detection by the zero value detection logic unit in step 8020. For example, the zero value detection logic can detect that the requested address stores a zero value, and can transmit a zero value hint to outside the memory chip (or within the memory chip, such as in the disclosed distributed processor memory chip) An entity (eg, one or more circuits) comprising processor subunits distributed among an array of memory banks. If the zero value associated with the read or write command is not detected, the controller may transmit a data value instead of a zero value indicator. In some embodiments, the one or more circuits that are passed back to the zero value indicator may be internal to the memory cell.

虽然所公开实施例已关于零值侦测进行了描述，但相同原理及技术将适用于侦测其他存储器值(例如，1等)。在一些状况下，除零值指示符以外，侦测逻辑亦可传回与读取或写入命令相关联的其他值(例如，1等)的一个或多个指示符，且这些指示符可在侦测到对应于值指示符的任何值的情况下被传回/传输。在一些状况下，可藉由使用者(例如，经由更新一个或多个寄存器)调整该些值。在可能知晓关于数据集的特性且了解到(例如，就使用者而言)某些值在数据中可能比其他值更普遍的情况下，这种更新可能尤其有用。在此状况下，一个、两个、三个或多于三个值指示符可与最普遍数据相关联，该些最普遍数据与数据集相关联。Although the disclosed embodiments have been described with respect to zero value detection, the same principles and techniques will apply to detecting other memory values (eg, 1, etc.). In some cases, in addition to a zero value indicator, the detection logic may also return one or more indicators of other values (eg, 1, etc.) associated with the read or write command, and these indicators may Returned/transmitted if any value corresponding to the value indicator is detected. In some cases, these values may be adjusted by the user (eg, by updating one or more registers). Such an update may be particularly useful in situations where it is possible to know properties about the dataset and to know (eg, to the user) that some values may be more prevalent in the data than others. In this case, one, two, three, or more than three value indicators may be associated with the most prevalent data associated with the dataset.

补偿DRAM启动惩罚Compensate for DRAM startup penalty

在某些类型的存储器(例如，DRAM)中，存储器胞元可按阵列配置于存储器组内，且一次可针对阵列中的一排存储器胞元存取及撷取(读取)包括于存储器胞元中的值。此读取处理程序可涉及首先开放(启动)存储器胞元的一排(line)(或行)以使由存储器胞元储存的数据值可用。接下来，可同时感测开放排中的存储器胞元的值，且列地址可用以循环通过个别存储器胞元值或存储器胞元值的群组(亦即，字)，且将每一存储器胞元值连接至外部数据总线以便读取存储器胞元值。这些处理程序耗费时间。在一些状况下，开放用于读取的存储器排可能需要运算时间的32个循环，且从开放排读取值可能需要另外32个循环。若仅在当前开放排的读取操作完成之后开放待读取的下一排，则可产生显著潜时。在此实例中，在开放下一排所需的32个循环期间，无数据被读取，且读取每一排有效地需要总计64个循环而非仅需要32个循环来遍历排数据。传统存储器系统不允许在正读取或写入第一排时开放同一组中的第二排。为节省潜时，待开放的下一排可因此在用于双排存取的特殊组中的不同组o中，如下文进一步详细地论述。在开放下一排之前，当前排可皆取样至触发器(flipflop)或锁存器，且在可开放下一排时，所有处理皆在触发器\锁存器上完成。若下一预测排在同一组中(且以上情形中无一者存在)，则可能无法避免潜时且系统可能需要等待。这些机制与标准存储器且尤其与存储器处理装置两者均相关。In some types of memory (eg, DRAM), memory cells may be arranged in arrays within memory banks, and memory cells may be accessed and retrieved (reads) included in memory cells at a time for one row of memory cells in the array. The value in the element. This read process may involve first opening (enabling) a line (or row) of memory cells to make available the data values stored by the memory cells. Next, the values of memory cells in open rows can be sensed simultaneously, and column addresses can be used to cycle through individual memory cell values or groups of memory cell values (ie, words), and each memory cell The cell value is connected to the external data bus for reading the memory cell value. These handlers are time consuming. In some cases, opening a memory bank for reading may require 32 cycles of computation time, and reading a value from an open bank may require another 32 cycles. Significant latency can result if the next row to be read is only opened after the read operation of the currently open row is completed. In this example, during the 32 cycles required to open the next row, no data is read, and reading each row effectively requires a total of 64 cycles rather than just 32 cycles to traverse the row data. Conventional memory systems do not allow a second bank in the same bank to be opened while the first bank is being read or written. To save latent time, the next row to be opened may therefore be in a different group o in a special group for dual row access, as discussed in further detail below. Before opening the next row, the current row can all be sampled to a flipflop or latch, and when the next row can be opened, all processing is done on the flipflop\latch. If the next prediction is in the same group (and none of the above situations exist), latent times may not be avoided and the system may need to wait. These mechanisms are relevant both to standard memory and in particular to memory processing devices.

本发明所公开的实施例可藉由例如在当前开放存储器排的读取操作已完成之前预测待开放的下一存储器排来减少此潜时。亦即，若可预测待开放的下一排，则用于开放下一排的处理程序可在当前排的读取操作已完成之前开始。取决于在处理程序中何时进行下一排预测，与开放下一排相关联的潜时可从32个循环(在上文所描述的特定实例中)减少至少于32个循环。在一个特定实例中，若提前20个循环预测下一排开放，则额外潜时仅为12个循环。在另一实例中，若提前32个循环预测下一排开放，则根本不存在潜时。结果，替代需要总计64个循环来串行地开放及读取每一行，藉由在读取当前行的同时开放下一行，可减少读取每一行的有效时间。The disclosed embodiments may reduce this latency by, for example, predicting the next memory bank to be opened before the read operation of the currently open memory bank has completed. That is, if the next row to be opened can be predicted, the processing for opening the next row can be started before the read operation of the current row has completed. Depending on when the next row prediction is made in the process, the latency associated with opening the next row may be reduced from 32 cycles (in the specific example described above) to less than 32 cycles. In one particular example, if the next row opening is predicted 20 cycles ahead, the extra latency is only 12 cycles. In another example, if the next row is predicted to open 32 cycles ahead, there is no latent time at all. As a result, instead of requiring a total of 64 cycles to open and read each row serially, the effective time to read each row can be reduced by opening the next row while the current row is being read.

以下机制可能需要当前排及预测排在相同组中，但若存在可支持在一排上同时启动及工作的此组，则亦可使用该些机制。The following mechanisms may require the current row and the predicted row to be in the same group, but they can also be used if there is such a group that can support both starting and working on a row.

在所公开实施例中，可使用各种技术(在下文更详细地论述)执行下一行预测。举例而言，下一行预测可基于图案辨识，基于预定行存取调度，基于人工智能模型(例如，用以分析行存取且进行待开放的下一行的预测的经训练神经网络)的输出或基于任何其他合适的预测技术。在一些实施例中，可藉由使用如下文所描述的延迟地址产生器或公式或其他方法来达成100％成功预测。预测可包含建置具有在需要存取待开放的下一排之前充分预测该排的能力的系统。在一些状况下，下一行预测可由下一行预测器执行，该下一行预测器可用各种方式实施。举例而言，用以产生用于对存储器行进行读取和/或写入的当前地址的预测地址产生器。产生用于存取存储器(读取或写入)的地址的实体可基于执行软件指令的任何逻辑电路或控制器\CPU。预测地址产生器可包括图案(pattern)学习模型，该图案学习模型观测所存取行，识别与存取(例如，依序排存取，对每第二排的存取，对每第三排的存取等)相关联的一个或多个图案且基于观测到的图案而估计待存取的下一行。在其他实例中，预测地址产生器可包括应用公式/算法以预测待存取的下一行的单元。在另外其他实施例中，预测地址产生器可包括经训练神经网络，该经训练神经网络基于诸如正存取的当前地址行、经存取的最后2个、3个、4个或多于4个地址/行等的输入来输出待存取的所预测下一行(包括与所预测行相关联的一个或多个地址)。使用所描述的预测地址产生器中的任一者预测待存取的下一存储器排可显著减少与存储器存取相关联的潜时。所描述的预测地址/行产生器可适用于涉及存取存储器以撷取数据的任何系统中。在一些状况下，所描述的预测地址/行产生器及用于预测下一存储器排存取的相关联技术可尤其适合于执行人工智能模型的系统中，因为AI模型可与可便利下一行预测的重复存储器存取图案相关联。In the disclosed embodiments, next row prediction may be performed using various techniques (discussed in more detail below). For example, the next row prediction can be based on pattern recognition, based on a predetermined row access schedule, based on the output of an artificial intelligence model (eg, a trained neural network to analyze row access and make predictions of the next row to be opened), or Based on any other suitable forecasting technique. In some embodiments, 100% successful prediction can be achieved by using a delayed address generator or formula or other methods as described below. Predicting may include building a system with the ability to sufficiently predict the next row to be opened before the next row needs to be accessed. In some cases, next-row prediction may be performed by a next-row predictor, which may be implemented in various ways. For example, a predicted address generator to generate a current address for reading and/or writing a memory row. The entity that generates addresses for accessing memory (read or write) may be based on any logic circuit or controller\CPU executing software instructions. The predicted address generator may include a pattern learning model that observes accessed rows, identifies and accesses (eg, sequential accesses, accesses for every second row, accesses for every third row accesses, etc.) associated with one or more patterns and estimate the next row to be accessed based on the observed patterns. In other examples, the predicted address generator may include applying a formula/algorithm to predict the next row of cells to be accessed. In still other embodiments, the predictive address generator may include a trained neural network based on, for example, the current address row being accessed, the last 2, 3, 4, or more than 4 accessed input of an address/row etc. to output the predicted next row to be accessed (including one or more addresses associated with the predicted row). Using any of the described predicted address generators to predict the next memory rank to be accessed can significantly reduce latency associated with memory accesses. The predicted address/row generator described is applicable in any system involving accessing memory to retrieve data. In some cases, the described predictive address/row generator and associated techniques for predicting next memory bank accesses may be particularly suitable in systems that execute artificial intelligence models, since AI models can be combined with can facilitate next row predictions associated with a repeating memory access pattern.

图81A说明符合本发明的实施例的用于基于下一行预测启动与存储器组8180相关联的下一行的系统8100。系统8100可包括当前及预测地址产生器8192、组控制器8191及存储器组8180A至8180B。地址产生器可为产生用于存取存储器组8180A至8180B的地址的实体，且可基于执行软件程序的任何逻辑电路、控制器或微处理器。组控制器8191可被配置为存取存储器组8180A的当前行(例如，使用由地址产生器8192产生的当前行识别符)。组控制器8191亦可被配置为基于由地址产生器8192产生的预测行识别符启动存储器组8180B内待存取的所预测下一行。以下实例描述两个组。在其他实例中，可使用更多组。在一些实施例中，可存在允许一次存取多于一行(如下文所论述)的存储器组，且因此可在单个组上进行相同处理程序。如上文所描述，待存取的所预测下一行的启动可在相对于正存取的当前行执行的读取操作完成之前开始。因此，在一些状况下，地址产生器8192可预测待存取的下一行，且可在对当前行的存取已完成之前的任何时间将所预测下一行的识别符(例如，一个或多个地址)发送至组控制器8191。此时序可允许组控制器在正存取当前行期间且在对当前行的存取完成之前的任何时间点起始所预测下一行的启动。在一些状况下，组控制器8291可在待存取的当前行的启动完成和/或相对于当前行的读取操作已开始的同时(或在几个时钟循环内)起始存储器组8180的所预测下一行的启动。81A illustrates a system 8100 for starting a next row associated with a memory bank 8180 based on next row prediction, consistent with embodiments of the invention. System 8100 may include current and predicted address generator 8192, bank controller 8191, and memory banks 8180A-8180B. An address generator may be an entity that generates addresses for accessing memory banks 8180A-8180B, and may be based on any logic circuit, controller, or microprocessor executing a software program. Bank controller 8191 may be configured to access the current row of memory bank 8180A (eg, using the current row identifier generated by address generator 8192). The bank controller 8191 may also be configured to enable the predicted next row to be accessed within the memory bank 8180B based on the predicted row identifier generated by the address generator 8192. The following examples describe two groups. In other instances, more groups may be used. In some embodiments, there may be memory banks that allow access to more than one row at a time (as discussed below), and thus the same process may be performed on a single bank. As described above, the initiation of the predicted next row to be accessed may begin before the completion of a read operation performed relative to the current row being accessed. Thus, in some cases, the address generator 8192 can predict the next row to be accessed, and can use an identifier (eg, one or more of the predicted next row) at any time before the access to the current row has completed. address) to the group controller 8191. This timing may allow the group controller to initiate activation of the predicted next row at any point in time while the current row is being accessed and before the access to the current row is complete. In some cases, the bank controller 8291 may initiate the activation of the memory bank 8180 at the same time (or within a few clock cycles) that the activation of the current row to be accessed is complete and/or a read operation relative to the current row has begun. The predicted start of the next line.

在一些实施例中，相对于与当前地址相关联的当前行的操作可为读取或写入操作。在一些实施例中，当前行及下一行可在同一存储器组中。在一些实施例中，同一存储器组可允许在正存取当前行的同时存取下一行。当前行及下一行可在不同存储器组中。在一些实施例中，存储器单元可包括被配置为产生当前地址及预测地址的处理器。在一些实施例中，存储器单元可包括分布式处理器。分布式处理器可包括在空间上分布于存储器阵列的多个离散存储器组当中的处理阵列的多个处理器子单元。在一些实施例中，预测地址可藉由对延迟产生的地址进行取样的一系列触发器产生。该延迟可为可经由在储存经取样地址的触发器之间进行选择的多任务器来配置的。In some embodiments, the operation relative to the current row associated with the current address may be a read or write operation. In some embodiments, the current row and the next row may be in the same memory bank. In some embodiments, the same memory bank may allow the next row to be accessed while the current row is being accessed. The current row and the next row can be in different memory banks. In some embodiments, the memory unit may include a processor configured to generate the current address and the predicted address. In some embodiments, the memory unit may comprise a distributed processor. A distributed processor may include multiple processor subunits of the processing array that are spatially distributed among multiple discrete memory banks of the memory array. In some embodiments, the predicted address may be generated by a series of flip-flops that sample the delayed generated address. The delay may be configurable via a multiplexer that selects between flip-flops that store sampled addresses.

应注意，在确认所预测下一行实际上为执行软件请求以存取的下一行后(例如，在完成相对于当前行的读取操作之后)，所预测下一行可成为待存取的当前行。在所公开实施例中，因为可在完成当前行读取操作之前起始用于启动所预测下一行的处理程序，所以在确认所预测下一行为待存取的正确的下一行后，可能已完全或部分启动待存取的下一行。此可显著减少与排启动相关联的潜时。若启动下一行使得启动在当前行的读取结束之前或同时结束，则可获得功率减少。It should be noted that the predicted next row may become the current row to be accessed after confirming that the predicted next row is actually the next row requested to be accessed by executing software (eg, after completing a read operation relative to the current row) . In the disclosed embodiments, because the handler for initiating the predicted next row may be initiated before the current row read operation is completed, after confirming that the predicted next row is the correct next row to be accessed, it may be possible to Fully or partially start the next line to be accessed. This can significantly reduce the latency associated with platoon activation. A power reduction can be achieved if the next row is activated such that the activation ends before or at the same time as the read of the current row ends.

当前及预测地址产生器8192可包括被配置为识别存储器组8180中待存取的行(例如，基于程序执行)且预测待存取的下一行(例如，基于行存取中的所观测图案，基于预定图案(n+1、n+2)等)的任何合适的逻辑组件、运算单元、存储器单元、算法、经训练模型等。举例而言，在一些实施例中，当前及预测地址产生器8192可包括计数器8192A、当前地址产生器8192B及预测地址产生器8192C。当前地址产生器8192B可被配置为基于计数器8192A的输出，例如基于来自运算单元的请求而产生存储器组8180中待存取的当前行的当前地址。可将与待存取的当前行相关联的地址提供至组控制器8191。预测地址产生器8192C可被配置为基于计数器8192A的输出、基于预定存取图案(例如，结合计数器8192A)或基于经训练神经网络的输出或其他类型的图案预测算法来判定存储器组8180中待存取的下一行的预测地址，该图案预测算法观测排存取且基于例如与所观测到的排存取相关联的图案来预测待存取的下一排。地址产生器8192可将来自预测地址产生器8192C的所预测下一行地址提供至组控制器8191。The current and predicted address generator 8192 may include a row configured to identify a row to be accessed in the memory bank 8180 (eg, based on program execution) and predict the next row to be accessed (eg, based on an observed pattern in row access, Any suitable logic components, arithmetic units, memory units, algorithms, trained models, etc. based on predetermined patterns (n+1, n+2, etc.). For example, in some embodiments, the current and predicted address generator 8192 may include a counter 8192A, a current address generator 8192B, and a predicted address generator 8192C. The current address generator 8192B may be configured to generate the current address of the current row to be accessed in the memory bank 8180 based on the output of the counter 8192A, eg, based on a request from the arithmetic unit. The address associated with the current row to be accessed may be provided to the group controller 8191. The predicted address generator 8192C may be configured to determine the pending memory bank 8180 based on the output of the counter 8192A, based on a predetermined access pattern (eg, in conjunction with the counter 8192A), or based on the output of a trained neural network or other type of pattern prediction algorithm. The predicted address of the next row to be fetched, the pattern prediction algorithm observes the row access and predicts the next row to be accessed based, for example, on the pattern associated with the observed row access. The address generator 8192 may provide the predicted next row address from the predicted address generator 8192C to the bank controller 8191.

在一些实施例中，当前地址产生器8192B及预测地址产生器8192C可实施于系统8100内部或外部。外部主机亦可实施于系统8100外部且进一步连接至系统8100。举例而言，当前地址产生器8192B可为执行程序的外部主机处的软件，且为避免任何潜时，预测地址产生器8192C可实施于系统8100内部或系统8100外部。In some embodiments, the current address generator 8192B and the predicted address generator 8192C may be implemented internal or external to the system 8100. External hosts may also be implemented external to system 8100 and further connected to system 8100. For example, the current address generator 8192B may be software at an external host executing the program, and the predicted address generator 8192C may be implemented either internal to the system 8100 or external to the system 8100 to avoid any latency.

如所提到，可使用经训练神经网络判定所预测下一行地址，该经训练神经网络基于可包括一个或多个先前存取的行地址的输入来预测待存取的下一行。经训练神经网络或其他类型的模型可在与预测地址产生器8192C相关联的逻辑内运行。在一些状况下，经训练神经网络等可藉由预测地址产生器8192C外部但与该预测地址产生器通信的一个或多个运算单元执行。As mentioned, the predicted next row address may be determined using a trained neural network that predicts the next row to be accessed based on an input that may include one or more previously accessed row addresses. A trained neural network or other type of model may operate within the logic associated with the predicted address generator 8192C. In some cases, trained neural networks, etc., may be performed by one or more arithmetic units external to, but in communication with, predictive address generator 8192C.

在一些实施例中，预测地址产生器8192C可包括当前地址产生器8192B的复制者或实质复制者。另外，当前地址产生器8192B及预测地址产生器8192C的操作的时序可相对于彼此固定或可调整。举例而言，在一些状况下，预测地址产生器8192C可被配置为相对于当前地址产生器8192B发出与待存取的下一行相关联的地址识别符时在固定时间(例如，固定数目个时钟循环)输出与所预测下一行相关联的地址识别符。在一些状况下，在待存取的当前行的启动开始之前或之后，在与待存取的当前行相关联的读取操作开始之前或之后，或在与正存取的当前行相关联的读取操作完成之前的任何时间，可产生所预测下一行识别符。在一些状况下，可在待存取的当前行的启动开始的同时或在与待存取的当前行相关联的读取操作开始的同时产生所预测下一行识别符。In some embodiments, the predicted address generator 8192C may comprise a replica or a substantial replica of the current address generator 8192B. Additionally, the timing of the operations of the current address generator 8192B and the predicted address generator 8192C may be fixed or adjustable relative to each other. For example, in some cases, predictive address generator 8192C may be configured to issue address identifiers associated with the next row to be accessed at a fixed time (eg, a fixed number of clocks) relative to current address generator 8192B loop) output the address identifier associated with the predicted next row. In some cases, before or after the start of the current row to be accessed begins, before or after the start of a read operation associated with the current row to be accessed, or before or after the start of the current row being accessed The predicted next row identifier can be generated any time before the read operation is complete. In some cases, the predicted next row identifier may be generated at the same time as the initiation of the current row to be accessed begins or at the beginning of a read operation associated with the current row to be accessed.

在其他状况下，所预测下一行识别符的产生与待存取的当前行的启动或与当前行相关联的读取操作的起始之间的时间可为可调整的。举例而言，在一些状况下，此时间可在存储器单元8100的操作期间基于与一个或多个操作参数相关联的值而延长或缩短。在一些状况下，与存储器单元或运算系统的另一组件相关联的当前温度(或任何其他参数值)可使当前地址产生器8192B及预测地址产生器8192C改变其相对操作时序。在实施例中，其中在存储器处理中，预测机制可为该逻辑的部分。In other cases, the time between the generation of the predicted next row identifier and the initiation of the current row to be accessed or the initiation of a read operation associated with the current row may be adjustable. For example, under some conditions, this time may be extended or shortened during operation of memory cell 8100 based on values associated with one or more operating parameters. In some cases, the current temperature (or any other parameter value) associated with the memory cell or another component of the computing system may cause the current address generator 8192B and the predicted address generator 8192C to change their relative operating timings. In embodiments where in memory processing, the prediction mechanism may be part of the logic.

当前及预测地址产生器8192可产生与所预测下一行相关联的置信度以存取判定。此置信度(其可作为预测处理程序的部分由预测地址产生器8192C判定)可用于判定例如是否在当前行的读取操作期间(亦即，在当前行读取操作已完成之前且在待存取的下一行的识别已确认之前)起始所预测下一行的启动。举例而言，在一些状况下，可将与待存取的所预测下一行相关联的置信度与阈值等级进行比较。若置信度降至低于阈值等级，则例如存储器单元8100可放弃启动所预测下一行。另一方面，若置信度超过阈值等级，则存储器单元8100可起始存储器组8180中的所预测下一行的启动。The current and predicted address generator 8192 can generate the confidence associated with the predicted next row to access the decision. This confidence level, which may be determined by the prediction address generator 8192C as part of the prediction handler, may be used to determine, for example, whether during a read operation of the current row (ie, before the current row read operation has completed and pending storage) fetched before the identification of the next line has been confirmed) to start the predicted start of the next line. For example, in some cases, the confidence associated with the predicted next row to be accessed may be compared to a threshold level. If the confidence level falls below a threshold level, for example, the memory unit 8100 may forego starting the predicted next row. On the other hand, if the confidence level exceeds a threshold level, memory cell 8100 may initiate activation of the predicted next row in memory bank 8180.

可用任何合适的方式实现测试相对于阈值等级的所预测下一行的置信度及所预测下一行的启动之后续起始或非起始的机制。在一些状况下，例如，若与所预测下一行相关联的置信度降至低于阈值，则预测地址产生器8192C可放弃将其所预测下一行结果输出至下游逻辑组件。替代地，在此状况下，当前及预测地址产生器8192可抑制来自组控制器8191的所预测下一行识别符，或组控制器(或另一逻辑单元)可经装备以使用所预测下一行的置信度以判定是否在与正读取的当前行相关联的读取操作完成之前开始启动所预测下一行。The mechanism to test the confidence of the predicted next row relative to a threshold level and the subsequent initiation or non-initiation of the predicted initiation of the next row can be implemented in any suitable manner. Under some conditions, for example, if the confidence associated with the predicted next row falls below a threshold, the predicted address generator 8192C may forego outputting its predicted next row result to downstream logic components. Alternatively, in this case, the current and predicted address generator 8192 may suppress the predicted next row identifier from the group controller 8191, or the group controller (or another logic unit) may be equipped to use the predicted next row The confidence level to determine whether to start the predicted next row before the read operation associated with the current row being read is completed.

可用任何合适的方式产生与所预测下一行相关联的置信度。在一些状况下，诸如在基于预定的已知存取图案识别所预测下一行的情况下，预测地址产生器8192C可产生高置信度或鉴于行存取的预定图案，可完全放弃产生置信度。另一方面，在预测地址产生器8192C执行一个或多个算法以监视行存取，且基于相对于所监视的行存取而计算的图案输出所预测行，或在一个或多个经训练神经网络或其他模型被配置为基于包括最近行存取的输入而输出所预测下一行的情况下，可基于任何相关参数判定所预测下一行的置信度。举例而言，在一些状况下，置信度可取决于一个或多个先前的下一行预测是否证明为准确的(例如，过去效能指示符)。置信度亦可基于算法/模型的输入的一个或多个特性。举例而言，包括遵循图案的实际行存取的输入可导致比展现较少图案化的实际行存取高的置信度。且在相对于包括最近行存取的输入的串流侦测随机性的一些状况下，例如，所产生的置信度可为低的。另外，在侦测到随机性的状况下，可完全中止下一行预测处理程序，存储器单元8100的组件中的一个或多个可忽略下一行预测，或可采取任何其他动作以放弃启动所预测下一行。The confidence associated with the predicted next row may be generated in any suitable manner. In some situations, such as where the predicted next row is identified based on a predetermined known access pattern, the predicted address generator 8192C may generate a high confidence level or may forgo generating a confidence level entirely given the predetermined pattern of row accesses. On the other hand, one or more algorithms are executed at the predicted address generator 8192C to monitor row accesses and output predicted rows based on patterns computed relative to the monitored row accesses, or at one or more trained neural Where the network or other model is configured to output a predicted next row based on input including the most recent row access, the confidence of the predicted next row may be determined based on any relevant parameters. For example, in some cases, the confidence may depend on whether one or more previous next row predictions proved accurate (eg, past performance indicators). Confidence may also be based on one or more characteristics of the input to the algorithm/model. For example, an input that includes an actual row access that follows a pattern may result in a higher confidence level than an actual row access that exhibits less patterning. And in some cases, for example, the resulting confidence may be low relative to the stream detection randomness relative to the input including the most recent row access. Additionally, in the event that randomness is detected, the next row prediction process may be completely aborted, one or more of the components of the memory unit 8100 may ignore the next row prediction, or any other action may be taken to forego initiating the predicted next row prediction. one line.

在一些状况下，可相对于存储器8100的操作包括反馈机制。举例而言，周期性地或甚至在每下一行预测之后，可判定预测地址产生器8192C预测待存取的实际下一行的准确性。在一些状况下，若在预测待存取的下一行时存在错误(或在预定数目个错误之后)，则可暂时中止预测地址产生器8192C的下一行预测操作。在其他状况下，预测地址产生器8192C可包括学习元件，使得其预测操作的一个或多个方面可基于关于其预测待存取的下一行的准确性的所接收反馈而调整。此能力可改进预测地址产生器8192C的操作，使得地址产生器8192C可适应于改变的存取图案等。In some cases, a feedback mechanism may be included with respect to the operation of memory 8100 . For example, periodically or even after each next row prediction, the accuracy of the prediction address generator 8192C in predicting the actual next row to be accessed may be determined. In some cases, if there is an error in predicting the next row to be accessed (or after a predetermined number of errors), the next row prediction operation of the predicted address generator 8192C may be temporarily suspended. In other cases, the predicted address generator 8192C may include a learning element such that one or more aspects of its prediction operation may be adjusted based on received feedback regarding its accuracy in predicting the next row to be accessed. This capability can improve the operation of the predictive address generator 8192C so that the address generator 8192C can adapt to changing access patterns and the like.

在一些实施例中，所预测下一行的产生和/或所预测下一行的启动的时序可取决于存储器单元8100的整体操作。举例而言，在通电之后或在重设存储器单元8100之后，可暂时中止预测待存取的下一行(或将所预测下一行转送至组控制器8191)(例如，持续预定时间量或时钟循环，直至预定数目个行存取/读取已完成，直至所预测下一行的置信度超过预定阈值，或基于任何其他合适的准则)。In some embodiments, the timing of the generation of the predicted next row and/or the activation of the predicted next row may depend on the overall operation of the memory cell 8100 . For example, after power-up or after resetting the memory cell 8100, predicting the next row to be accessed (or forwarding the predicted next row to the bank controller 8191) may be temporarily suspended (eg, for a predetermined amount of time or clock cycles) , until a predetermined number of row accesses/reads have been completed, until the confidence of the predicted next row exceeds a predetermined threshold, or based on any other suitable criteria).

图81B说明根据例示性所公开实施例的存储器单元8100的另一配置。在图81B的系统8100B中，高速缓存8193可与组控制器8191相关联。举例而言，高速缓存8193可被配置为在一个或多个数据行被存取之后储存该一个或多个数据行，且防止需要再次启动该些数据行。因此，高速缓存8193可使得组控制器8191能够存取来自高速缓存8193的行数据而非存取存储器组8180。举例而言，高速缓存8193可储存最后X行数据(或任何其他高速缓存节省策略)，且组控制器8191可根据所预测行来填充高速缓存8193。此外，若所预测行已在高速缓存8193中，则不需要再次开放所预测行，且组控制器(或实施于高速缓存8193中的高速缓存控制器)可保护所预测行不被调换。高速缓存8193可提供若干益处。首先，由于高速缓存8193将行加载至高速缓存8193且组控制器可存取高速缓存8193以撷取行数据，因此不需要特殊组或多于一个组用于下一行预测。其次，对高速缓存8193进行读取及写入可节省能量，因为从组控制器8191至高速缓存8193的实体距离小于从组控制器8191至存储器组8180的实体距离。第三，相较于存储器组8180，由高速缓存8193引起的潜时通常较低，因为高速缓存8193更小且更接近控制器8191。在一些状况下，当藉由组控制器8191在存储器组8180中启动所预测下一行时，由预测地址产生器产生的所预测下一行的识别符例如可储存于高速缓存8193中。基于程序执行等，当前地址产生器8192B可识别存储器组8191中待存取的实际下一行。可将与待存取的实际下一行相关联的识别符与储存于高速缓存8193中的所预测下一行的识别符进行比较。若待存取的实际下一行与待存取的所预测下一行相同，则组控制器8191可在待存取的实际下一行的启动已完成之后开始相对于该行的读取操作(其可能由于下一行预测处理程序而完全或部分启动)。另一方面，若待存取的实际下一行(由当前地址产生器8192B判定)不匹配储存于高速缓存8193中的所预测下一行识别符，则将不会相对于完全或部分启动的所预测下一行开始读取操作，而是系统将开始启动待存取的实际下一行。81B illustrates another configuration of a memory cell 8100 in accordance with an illustratively disclosed embodiment. In system 8100B of FIG. 81B, cache 8193 may be associated with group controller 8191. For example, the cache 8193 can be configured to store one or more lines of data after they are accessed, and prevent the lines of data from needing to be activated again. Thus, cache 8193 may enable bank controller 8191 to access row data from cache 8193 instead of accessing memory bank 8180. For example, cache 8193 may store the last X lines of data (or any other cache saving strategy), and bank controller 8191 may fill cache 8193 according to the predicted lines. Furthermore, if the predicted line is already in cache 8193, the predicted line does not need to be opened again, and the set controller (or a cache controller implemented in cache 8193) can protect the predicted line from being swapped. Cache 8193 may provide several benefits. First, since cache 8193 loads lines into cache 8193 and the set controller can access cache 8193 to fetch line data, no special set or more than one set is required for next line prediction. Second, reading and writing to cache 8193 saves energy because the physical distance from bank controller 8191 to cache 8193 is less than the physical distance from bank controller 8191 to memory bank 8180. Third, the latency caused by cache 8193 is generally lower compared to memory bank 8180 because cache 8193 is smaller and closer to controller 8191. In some cases, when the predicted next row is enabled in the memory bank 8180 by the bank controller 8191, the identifier of the predicted next row generated by the predicted address generator may be stored in the cache 8193, for example. Based on program execution, etc., the current address generator 8192B may identify the actual next row in the memory bank 8191 to be accessed. The identifier associated with the actual next row to be accessed may be compared to the identifier of the predicted next row stored in cache 8193. If the actual next row to be accessed is the same as the predicted next row to be accessed, then the group controller 8191 may begin a read operation relative to the actual next row to be accessed after the start of the actual next row to be accessed has completed (which may fully or partially started due to the next line of prediction handlers). On the other hand, if the actual next row to be accessed (as determined by the current address generator 8192B) does not match the predicted next row identifier stored in the cache 8193, then there will be no comparison to the fully or partially enabled predicted The next row starts the read operation, but the system will start the actual next row to be accessed.

双重启动组dual boot group

如所论述，描述若干机制为有价值的，该些机制允许建置能够在一行仍正被处理的同时启动另一行的组。可针对在另一行正被存取的同时启动额外行的组提供若干实施例。虽然实施例仅描述两行启动，但应了解，其可适用于更多行。在首先建议的实施例中，存储器组可分成存储器子组，且所描述实施例可用以执行相对于一个子组中的一排的读取操作，同时启动另一子组中的所预测或所需下一行。举例而言，如图81C中所展示，存储器组8180可被配置为包括多个存储器子组8181。另外，与存储器组8180相关联的组控制器8191可包括与对应子组相关联的多个子组控制器。多个子组控制器中的第一子组控制器可被配置为使得能够存取包括于多个子组中的第一子组的当前行中的数据，而多个子组控制器中的第二子组控制器可启动多个子组中的第二子组中的下一行。当一次仅存取一个子组中的字时可使用仅一个列解码器。两个组可系结至同一输出总线以呈现为单个组。新的单个组输入亦可为单个地址及用于开放下一行的额外行地址。As discussed, it is valuable to describe several mechanisms that allow building groups that can start a row while another row is still being processed. Several embodiments may be provided for groups that enable additional rows while another row is being accessed. Although the embodiment describes only two lines of activation, it should be understood that it is applicable to more lines. In the first proposed embodiment, a bank of memory may be divided into sub-banks of memory, and the described embodiments may be used to perform a read operation relative to a bank in one bank, while enabling a predicted or all of a bank in another bank The next line is required. For example, as shown in FIG. 81C, memory bank 8180 may be configured to include multiple memory sub-banks 8181. Additionally, the bank controller 8191 associated with the memory bank 8180 may include a plurality of subgroup controllers associated with corresponding subgroups. A first subgroup controller of the plurality of subgroup controllers may be configured to enable access to data included in a current row of a first subgroup of the plurality of subgroups, while a second subgroup controller of the plurality of subgroups The group controller may initiate the next row in a second subgroup of the plurality of subgroups. Only one column decoder may be used when accessing words in only one subset at a time. Two groups can be tied to the same output bus to appear as a single group. The new single group input can also be a single address and an extra row address for opening the next row.

图81C说明每一存储器子组8181的第一及第二子组行控制器(8183A、8183B)。存储器组8180可包括多个子组8181，如图81C中所展示。另外，组控制器8191可包括各与对应子组8181相关联的多个子组控制器8183A至8183B。多个子组控制器中的第一子组控制器8183A可被配置为使得能够存取包括于子组8181中的第一部分的当前行中的数据，而第二子组控制器8183B可启动子组8181的第二部分中的下一行。81C illustrates the first and second subset row controllers (8183A, 8183B) for each memory subset 8181. Memory bank 8180 may include multiple sub-banks 8181, as shown in Figure 81C. Additionally, the group controller 8191 may include a plurality of subgroup controllers 8183A-8183B, each associated with a corresponding subgroup 8181 . A first subgroup controller 8183A of the plurality of subgroup controllers can be configured to enable access to data included in the current row of the first portion in subgroup 8181, while a second subgroup controller 8183B can activate the subgroup The next line in the second part of the 8181.

因为启动直接邻近于正被存取的行的行可能会使所存取行失真和/或损坏正自所存取行读取的数据，所以所公开实施例例如可被配置为使得待启动的所预测下一行可与第一子组中正被存取数据的当前行隔开至少两行。在一些实施例中，待启动的行可隔开至少一垫，使得启动可在不同垫中执行。第二子组控制器可被配置为使得存取包括于第二子组的当前行中的数据，而第一子组控制器启动第一子组中的下一行。第一子组的经启动的下一行可与第二子组中正被存取数据的当前行隔开至少两行。Because activating a row directly adjacent to the row being accessed may distort the accessed row and/or corrupt the data being read from the accessed row, the disclosed embodiments may, for example, be configured such that the to-be-activated row The predicted next row may be separated by at least two rows from the current row of data being accessed in the first subset. In some embodiments, the rows to be activated may be separated by at least one pad, such that activation may be performed in different pads. The second subgroup controller may be configured such that data included in the current row of the second subgroup is accessed, while the first subgroup controller activates the next row in the first subgroup. The activated next row of the first subset may be separated by at least two rows from the current row of data being accessed in the second subset.

正被读取/存取的行与正被启动的行之间的此预定义距离可由例如将存储器组的不同部分耦接至不同行解码器的硬件判定，且软件可维持该预定义距离以免破坏数据。当前行之间的间隔可超过两列(例如可为3行、4行、5行及甚至多于5行)。该距离可随时间改变，例如基于关于所储存数据中引入的失真的评估。可用各种方式评估失真，例如藉由计算信噪比、错误率、修复失真所需的错误码及其类似者。若两行足够远且两个组控制器实施于同一组上，则实际上可启动两行。新架构(在同一组上实施两个控制器)可防止开放同一垫中的多个排。This predefined distance between the row being read/accessed and the row being enabled can be determined by, for example, hardware coupling different parts of the memory bank to different row decoders, and software can maintain this predefined distance to avoid corrupt data. The spacing between the current row may be more than two columns (eg, it may be 3 rows, 4 rows, 5 rows, and even more than 5 rows). The distance may vary over time, for example based on an assessment of the distortion introduced in the stored data. Distortion can be assessed in various ways, such as by calculating signal-to-noise ratio, error rate, error codes needed to repair the distortion, and the like. If the two rows are far enough apart and the two group controllers are implemented on the same group, two rows can actually be enabled. The new architecture (implementing two controllers on the same group) prevents opening up multiple rows in the same pad.

图81D说明符合本发明的实施例的下一行预测的实施例。实施例可包括触发器(地址寄存器A至C)的额外管线。管线可藉由任何数目个触发器(级)实施为在地址产生器之后启动及延迟整体执行以使用所延迟地址所需的延迟，接着预测可为所产生的新地址(在管线的开头，在地址寄存器C下方)且当前地址为管线的末尾。在此实施例中，不需要复制地址产生器。可添加选择器(图81D中所展示的多任务器)以配置延迟，而地址寄存器提供延迟。Figure 81D illustrates an embodiment of next row prediction consistent with embodiments of the present invention. Embodiments may include additional pipelines of flip-flops (address registers A to C). The pipeline can be implemented with any number of flip-flops (stages) to initiate and delay overall execution after the address generator by the delay required to use the delayed address, then prediction can be the new address generated (at the beginning of the pipeline, at the address register C) and the current address is the end of the pipeline. In this embodiment, there is no need to duplicate the address generator. A selector (multiplexer shown in Figure 81D) can be added to configure the delay, while the address register provides the delay.

图81E说明符合本发明的实施例的存储器组的实施例。存储器组可实施为若新启动的排距当前排足够远，则启动新排将不会破坏当前排。如图81E中所展示，存储器组可包括垫的每两排之间的额外存储器垫(黑色)因此，控制单元(诸如，行解码器)可启动隔开一垫的多个排。Figure 81E illustrates an embodiment of a memory bank consistent with embodiments of the present invention. A memory bank can be implemented such that if a newly started rank is far enough from the current rank, starting a new rank will not destroy the current rank. As shown in Figure 81E, a memory bank may include additional memory pads (black) between every two rows of pads. Thus, a control unit (such as a row decoder) may activate multiple rows separated by a pad.

在一些实施例中，存储器单元可被配置为在预定时间接收第一地址以用于处理及接收第二地址以启动及存取。In some embodiments, the memory unit may be configured to receive the first address at predetermined times for processing and to receive the second address for activation and access.

图81F说明符合本发明的实施例的存储器组的另一实施例。存储器组可实施为若新启动的排距当前排足够远，则启动新排将不会破坏当前排。图81F中所描绘的实施例可藉由确保在存储器组的上半部分处实施所有偶数排且在存储器组的下半部分处实施所有奇数排来允许行解码器开放排n及n+1。实施方案可允许存取始终足够远的连续排。Figure 81F illustrates another embodiment of a memory bank consistent with embodiments of the present invention. A memory bank can be implemented such that if a newly started rank is far enough from the current rank, starting a new rank will not destroy the current rank. The embodiment depicted in Figure 81F may allow the row decoder to open up rows n and n+1 by ensuring that all even rows are implemented at the upper half of the memory bank and all odd rows are implemented at the lower half of the memory bank. Implementations may allow access to consecutive rows that are always far enough apart.

根据所公开实施例，双重控制存储器组可允许存取及启动单个存储器组的不同部分，即使在双重控制存储器组被配置为一次输出一个数据单元时亦如此。举例而言，如所描述，双重控制可使得存储器组能够在启动第二行(例如，所预测下一行或待存取的预定下一行)时存取第一行。According to the disclosed embodiments, a dual control memory bank may allow different portions of a single memory bank to be accessed and activated, even when the dual control memory bank is configured to output one data unit at a time. For example, as described, dual control may enable a memory bank to access a first row when a second row is activated (eg, a predicted next row or a predetermined next row to be accessed).

图82说明符合本发明的实施例的用于减少存储器行启动惩罚(例如，潜时)的双重控制存储器组8280。双重控制存储器组8280可包括输入，该些输入包括数据输入(DIN)8290、行地址(ROW)8291、列地址(COLUMN)8292、第一命令输入(COMMAND_1)8293及第二命令输入(COMMAND_2)8294。存储器组8280可包括数据输出(Dout)8295。82 illustrates a dual control memory bank 8280 for reducing memory row start penalties (eg, latency) in accordance with an embodiment of the present invention. Dual control memory bank 8280 may include inputs including data input (DIN) 8290, row address (ROW) 8291, column address (COLUMN) 8292, first command input (COMMAND_1) 8293, and second command input (COMMAND_2) 8294. Memory bank 8280 may include data output (Dout) 8295.

假定地址可包括行地址及列地址，且存在两个行解码器。可提供地址的其他配置，行解码器的数目可超过两个，且可存在多于单个列解码器。It is assumed that addresses can include row addresses and column addresses, and that there are two row decoders. Other configurations of addresses may be provided, the number of row decoders may exceed two, and there may be more than a single column decoder.

行地址(ROW)8291可识别与诸如启动命令的命令相关联的行。因为行启动后可接着从该行读取或写入至该行，所以接着在该行开放(在其启动之后)，可能不需要发送用于写入至开放行或从开放行读取的行地址。A row address (ROW) 8291 may identify the row associated with a command such as a start command. Since a row can be followed by reading from or writing to that row, then when the row is opened (after it is started), it may not be necessary to send a row for writing to or reading from an open row address.

第一命令输入(COMMAND_1)8293可用以将命令(诸如但不限于启动命令)发送至由第一行解码器存取的行。第二命令(COMMAND_2)输入8294可用以将命令(诸如但不限于启动命令)发送至由第二行解码器存取的行。The first command input (COMMAND_1) 8293 may be used to send a command, such as, but not limited to, a start command, to the row accessed by the first row decoder. A second command (COMMAND_2) input 8294 may be used to send a command, such as, but not limited to, a start command, to the row accessed by the second row decoder.

数据输入(DIN)8290可用以在执行写入操作时馈入数据。A data input (DIN) 8290 can be used to feed data when performing write operations.

因为无法一次读取整行，所以可依序读取单个行区段，且列地址(COLUMN)8292可提示待读取该行的哪一区段(哪些列)。为解释简单起见，可假定存在2Q个区段且列输入具有Q个位；Q为超过一的正整数。Since the entire row cannot be read at once, individual row sectors can be read sequentially, and the column address (COLUMN) 8292 can indicate which sector (which columns) of the row is to be read. For simplicity of explanation, it may be assumed that there are 2Q segments and that the column input has Q bits; Q is a positive integer greater than one.

双重控制存储器组8280可在具有或不具有上文关于图81A至图81B所描述的地址预测的情况下操作。当然，为减少操作潜时，根据所公开实施例，双重控制存储器组可在具有地址预测的情况下操作。Dual control memory bank 8280 may operate with or without the address prediction described above with respect to Figures 81A-81B. Of course, to reduce operating latency, dual control memory banks may operate with address prediction in accordance with the disclosed embodiments.

图83A、图83B及图83C说明存取及启动存储器组8180的列的实例。如上文所提及，假定在一个实例中，读取行及启动行两者均需要32个循环(区段)。另外，为了减少启动惩罚(具有表示为差量(Delta)的长度)，预先(在需要存取下一行之前至少差量)知晓应开放下一行可为有益的。在一些状况下，差量可等于四个循环。图83A、图83B及图83C中所描绘的每一存储器组可包括两个或多于两个子组，在该两个或多于两个子组内，在一些实施例中，在任何给定时间可仅开放一个行。在一些状况下，偶数行可与第一子组相关联，且奇数行可与第二子组相关联。在此实例中，使用所公开的预测性寻址实施例可使得能够在到达相对于另一存储器子组的行的读取操作的末尾之前(在到达末尾之前的延迟时段)起始某一存储器子组的一个行的启动。以此方式，可用高效方式进行依序存储器存取(例如，预定义存储器存取序列，其中行1、2、3、4、5、6、7、8……待读取，且行1、3、5……等与第一存储器子组相关联且行2、4、6……等与第二不同存储器子组)相关联。83A, 83B, and 83C illustrate an example of accessing and enabling a row of memory bank 8180. As mentioned above, assume that in one example, 32 cycles (segments) are required for both reading a line and starting a line. Additionally, in order to reduce the startup penalty (having a length denoted as delta), it may be beneficial to know in advance (at least delta before the next row needs to be accessed) that the next row should be opened. In some cases, the difference may be equal to four cycles. Each memory bank depicted in Figures 83A, 83B, and 83C may include two or more subgroups within which, in some embodiments, at any given time Only one row can be opened. In some cases, even rows may be associated with a first subset, and odd rows may be associated with a second subset. In this example, use of the disclosed predictive addressing embodiments may enable a certain memory to be initiated before the end of a read operation relative to a row of another memory subgroup is reached (a delay period before the end is reached). Start of a row of a subgroup. In this way, sequential memory accesses (eg, a predefined memory access sequence, where rows 1, 2, 3, 4, 5, 6, 7, 8 . . . are to be read, and rows 1, 3, . , 5, . . . etc. are associated with a first memory subgroup and rows 2, 4, 6, .

图83A可说明用于存取包括于两个不同存储器子组中的存储器行的状态。在图83A中所展示的状态中：83A may illustrate states for accessing memory rows included in two different memory subgroups. In the state shown in Figure 83A:

a.行A可为可由第一行解码器存取的。可在第一行解码器启动行A之后存取第一区段(以灰色标记的最左区段)。a. Row A may be accessible by the first row decoder. The first sector (the leftmost sector marked in grey) can be accessed after the first row of decoder starts row A.

b.行B可为可由第二行解码器存取的。在图83A中所展示的这些状态中，行B被关闭且尚未启动。b. Row B may be accessible by a second row decoder. In these states shown in Figure 83A, row B is turned off and not yet started.

图83A中所说明的状态之前可为将启动命令及行A的地址发送至第一行解码器。The state illustrated in Figure 83A may be preceded by sending a start command and the address of row A to the first row decoder.

图83B说明用于在存取行A之后存取行B的状态。根据此实例：行A可为可由第一行解码器存取的。在图83B中所展示的状态中，第一行解码器启动行A且已存取除四个最右区段(未以灰色标记的四个区段)以外的所有区段。因为差量(行A中的四个白色区段)等于四个循环，所以组控制器可使得第二行解码器能够在存取行A中的最右区段之前启动行B。在一些状况下，启动行B可响应于预定存取图案(例如，依序行存取，其中奇数行指明于第一子组中且偶数行指明于第二子组中)。在其他状况下，启动行B可响应于上文所描述的任何行预测技术。组控制器可使得第二行解码器能够预先启动行B，使得当存取行B时，已启动(开放)行B而非等待启动行B以开放行B。83B illustrates a state for accessing row B after row A is accessed. According to this example: Row A may be accessible by the first row decoder. In the state shown in Figure 83B, the first row decoder starts row A and has accessed all but the four rightmost sectors (the four not marked in gray). Because the delta (four white segments in row A) equals four cycles, the bank controller can enable the second row decoder to enable row B before accessing the rightmost segment in row A. In some cases, enabling row B may be responsive to a predetermined access pattern (eg, sequential row access, with odd rows designated in a first subset and even rows designated in a second subset). In other cases, enabling row B may be responsive to any of the row prediction techniques described above. The group controller may enable the second row decoder to pre-enable row B so that when row B is accessed, row B is already activated (opened) rather than waiting to be activated to open row B.

图83B中所说明的状态之前可为以下操作：The state illustrated in Figure 83B may be preceded by the following operations:

a.将启动命令及行A的地址发送至第一行解码器。a. Send the start command and the address of line A to the first line decoder.

b.写入或读取行A的前二十八个区段。b. Write or read the first twenty-eight sectors of row A.

c.在对行的二十八个区段进行读取或写入操作之后，将相对于行B的地址的启动命令发送至第二行解码器。c. After a read or write operation of the twenty-eight sectors of the row, send a start command relative to the address of row B to the second row decoder.

在一些实施例中，偶数编号列位于一个或多个存储器组的一半中。在一些实施例中，奇数编号列位于一个或多个存储器组的一半中。In some embodiments, even-numbered columns are located in half of one or more memory banks. In some embodiments, odd-numbered columns are located in half of one or more memory banks.

在一些实施例中，一排额外冗余垫置放于两个垫排中的每一者之间以建立用于允许启动的距离。在一些实施例中，可能不同时启动彼此接近的多个排。In some embodiments, an additional row of redundant pads is placed between each of the two pad rows to establish a distance for enabling activation. In some embodiments, multiple rows in close proximity to each other may not be activated at the same time.

图83C可说明用于在存取行A之后存取行C(例如，包括于第一子组中的下一奇数行)的状态。如图83C中所展示，行B可为可由第二行解码器存取的。如所展示，第二行解码器已启动行B且已存取除四个最右区段(未以灰色标记的四个剩余区段)以外的所有区段。因为在此实例中，差量等于四个循环，组控制器可使得第一行解码器能够在存取行B中的最右区段之前启动行C。组控制器可使得第一行解码器能够预先启动行C，使得当存取行C时，已启动行C而非等待启动行C。以此方式操作可减少或完全消除与存储器读取操作相关联的潜时。83C may illustrate a state for accessing row C (eg, the next odd row included in the first subset) after row A is accessed. As shown in Figure 83C, row B may be accessible by the second row decoder. As shown, the second row decoder has enabled row B and has accessed all but the four rightmost sectors (the four remaining sectors not marked in gray). Because the delta is equal to four cycles in this example, the group controller may enable the first row decoder to start row C before accessing the rightmost sector in row B. The bank controller may enable the first row decoder to pre-enable row C so that when row C is accessed, row C is already enabled rather than waiting to be enabled. Operating in this manner may reduce or completely eliminate latency associated with memory read operations.

作为寄存器文件的存储器垫memory pad as register file

在计算机架构中，处理器寄存器构成计算机处理器(例如，中央处理单元(CPU))可快速存取的储存位置。寄存器通常包括最接近处理器核心(L0)的存储器单元。寄存器可提供存取某些类型的数据的最快方式。计算机可具有若干类型的寄存器，其各根据其储存的信息的类型或基于对某一类型的寄存器中的信息操作的指令的类型而分类。举例而言，计算机可包括：数据寄存器，其保存数值信息、操作数、中间结果及配置；地址寄存器，其储存由指令使用以存取主要存储器的地址信息；通用寄存器，其储存数据及地址信息两者；及状态寄存器；以及其他寄存器。寄存器文件包括可供计算机处理单元使用的寄存器的逻辑群组。In computer architecture, processor registers constitute storage locations that are quickly accessible by a computer processor (eg, a central processing unit (CPU)). Registers typically include the memory location closest to the processor core (L0). Registers provide the fastest way to access certain types of data. A computer may have several types of registers, each classified according to the type of information it stores or based on the type of instructions that operate on information in a certain type of register. For example, a computer may include: data registers, which store numerical information, operands, intermediate results, and configuration; address registers, which store address information used by instructions to access main memory; general purpose registers, which store data and address information both; and the status register; and other registers. A register file includes a logical grouping of registers available to a computer processing unit.

在许多状况下，计算机的寄存器文件位于处理单元(例如，CPU)内且由逻辑晶体管实施。然而，在所公开实施例中，运算处理单元可能不驻存于传统的CPU中。实情为，这种处理元件(例如，处理器子单元)可作为处理阵列在空间上分布于(如以上章节中所描述)存储器芯片内。每一处理器子单元可与一个或多个对应及专用的存储器单元(例如，存储器组)相关联。经由此架构，每一处理器子单元可在空间上位于储存特定处理器子单元要对其操作的数据的一个或多个存储器元件附近。如本文中所描述，此架构可藉由例如消除由典型CPU及外部存储器架构所经历的存储器存取瓶颈来显著加速某些存储器密集型操作中的操作。In many cases, a computer's register file is located within a processing unit (eg, a CPU) and implemented by logic transistors. However, in the disclosed embodiments, the arithmetic processing unit may not reside in a conventional CPU. Rather, such processing elements (eg, processor subunits) may be spatially distributed (as described in the above sections) within a memory chip as a processing array. Each processor sub-unit may be associated with one or more corresponding and dedicated memory units (eg, memory banks). Through this architecture, each processor sub-unit can be located spatially near one or more memory elements that store data on which a particular processor sub-unit is to operate. As described herein, this architecture can significantly speed up operations in certain memory-intensive operations by, for example, eliminating memory access bottlenecks experienced by typical CPU and external memory architectures.

然而，本文中所描述的分布式处理器存储器芯片架构可仍利用寄存器文件，其包括用于对来自专用于对应处理器子单元的存储器元件的数据进行操作的各种类型的寄存器。然而，由于处理器子单元可分布于存储器芯片的存储器元件当中，因此有可能将一个或多个存储器元件(相较于特定制造制程中的逻辑组件，该一个或多个存储器元件可受益于该同一制程)添加于对应处理器子单元中，以充当用于对应处理器子单元的寄存器文件或高速缓存，而非充当主要存储器储存器。However, the distributed processor memory chip architecture described herein may still utilize register files, which include various types of registers for operating on data from memory elements dedicated to corresponding processor subunits. However, since the processor sub-units may be distributed among the memory elements of the memory chip, it is possible to combine one or more memory elements that may benefit from the The same process) is added in the corresponding processor sub-unit to act as a register file or cache for the corresponding processor sub-unit, rather than acting as the main memory storage.

此架构可提供若干优点。举例而言，由于寄存器文件为对应处理器子单元的部分，因此处理器子单元可在空间上位于相关寄存器文件附近。此配置可显著增加操作效率。传统寄存器文件由逻辑晶体管实施。举例而言，传统寄存器文件的每一位由约12个逻辑晶体管制成，且因此16个位的寄存器文件由192个逻辑晶体管制成。此寄存器文件可能需要大量逻辑组件来存取逻辑晶体管，且因此可占用大的空间。相较于由逻辑晶体管实施的寄存器文件，本发明所公开的实施例的寄存器文件可能需要显著更少的空间。此大小减小可藉由使用包括存储器胞元的存储器垫实施所公开实施例的寄存器文件来实现，该些存储器胞元藉由经优化以用于制造存储器结构而非用于制造逻辑结构的制程来制造。大小减小亦可允许较大寄存器文件或高速缓存。This architecture can provide several advantages. For example, since a register file is part of a corresponding processor sub-unit, a processor sub-unit may be located spatially near the associated register file. This configuration can significantly increase operational efficiency. Traditional register files are implemented by logic transistors. For example, each bit of a conventional register file is made of about 12 logic transistors, and thus a 16-bit register file is made of 192 logic transistors. This register file may require a large number of logic components to access the logic transistors, and thus may take up a large amount of space. The register files of the disclosed embodiments may require significantly less space than register files implemented by logic transistors. This size reduction can be achieved by implementing the register file of the disclosed embodiments using memory pads that include memory cells by processes optimized for fabricating memory structures rather than logic structures to manufacture. The size reduction may also allow for larger register files or caches.

在一些实施例中，可提供分布式处理器存储器芯片。分布式处理器存储器芯片可包括：基板；存储器阵列，其安置于基板上且包括多个离散存储器组；及处理阵列，其安置于基板上且包括多个处理器子单元。该些处理器子单元中的每一者可与多个离散存储器组中的对应的专用存储器组相关联。分布式处理器存储器芯片还可包括第一多个总线及第二多个总线。第一多个总线中的每一者可将多个处理器子单元中的一者连接至其对应的专用存储器组。第二多个总线中的每一者可将多个处理器子单元中的一者连接至多个处理器子单元中的另一者。在一些状况下，第二多个总线可将多个处理器子单元中的一个或多个连接至多个处理器子单元当中的两个或多于两个其他处理器子单元。处理器子单元中的一个或多个还可包括安置于基板上的至少一个存储器垫。至少一个存储器垫可被配置为充当用于多个处理子单元中的一个或多个的寄存器文件的至少一个寄存器。In some embodiments, distributed processor memory chips may be provided. A distributed processor memory chip may include: a substrate; a memory array disposed on the substrate and including a plurality of discrete memory banks; and a processing array disposed on the substrate and including a plurality of processor subunits. Each of the processor sub-units may be associated with a corresponding dedicated memory bank of a plurality of discrete memory banks. The distributed processor memory chip may also include a first plurality of buses and a second plurality of buses. Each of the first plurality of buses may connect one of the plurality of processor subunits to its corresponding dedicated memory bank. Each of the second plurality of buses may connect one of the plurality of processor subunits to another of the plurality of processor subunits. In some cases, the second plurality of buses may connect one or more of the plurality of processor subunits to two or more other processor subunits of the plurality of processor subunits. One or more of the processor subunits may also include at least one memory pad disposed on the substrate. At least one memory pad may be configured to function as at least one register of a register file for one or more of the plurality of processing subunits.

在一些状况下，寄存器文件可与一个或多个逻辑组件相关联以使得存储器垫能够充当寄存器文件的一个或多个寄存器。举例而言，这种逻辑组件可包括开关、放大器、反相器、感测放大器以及其他者。在寄存器文件由动态随机存取存储器(DRAM)垫实施的实例中，可包括逻辑组件以执行刷新操作从而防止所储存数据丢失。这种逻辑组件可包括行及列多任务器(“mux”)。此外，由DRAM垫实施的寄存器文件可包括冗余机构以对抗良率下降。In some cases, a register file may be associated with one or more logical components to enable a memory pad to function as one or more registers of the register file. For example, such logic components may include switches, amplifiers, inverters, sense amplifiers, and others. In instances where the register file is implemented by a dynamic random access memory (DRAM) pad, logic components may be included to perform refresh operations to prevent loss of stored data. Such logical components may include row and column multiplexers ("mux"). Additionally, the register file implemented by the DRAM pads may include redundancy mechanisms to combat yield degradation.

图84说明包括CPU 8402及外部存储器8406的传统计算机架构8400。在操作期间，可将来自存储器8406的值加载至与包括于CPU 8402中的寄存器文件8504相关联的寄存器中。84 illustrates a conventional computer architecture 8400 including a CPU 8402 and external memory 8406. During operation, values from memory 8406 may be loaded into registers associated with register file 8504 included in CPU 8402.

图85A说明符合所公开实施例的例示性分布式处理器存储器芯片8500a。相比于图84的架构，分布式处理器存储器芯片8500a包括安置于同一基板上的存储器组件及处理器组件。亦即，芯片8500a可包括存储器阵列及处理阵列，该处理阵列包括各与包括于存储器阵列中的一个或多个专用存储器组相关联的多个处理器子单元。在图85的架构中，由处理器子单元使用的寄存器藉由安置于同一基板上的一个或多个存储器垫提供，存储器阵列及处理阵列形成于该基板上。85A illustrates an exemplary distributed processor memory chip 8500a consistent with disclosed embodiments. In contrast to the architecture of Figure 84, a distributed processor memory chip 8500a includes memory components and processor components disposed on the same substrate. That is, chip 8500a may include a memory array and a processing array including a plurality of processor subunits each associated with one or more dedicated memory banks included in the memory array. In the architecture of Figure 85, the registers used by the processor subunits are provided by one or more memory pads disposed on the same substrate on which the memory array and processing array are formed.

如图85A中所描绘，分布式处理器存储器芯片8500a可藉由安置于基板8502上的多个处理群组8510a、8510b及8510c形成。更具体而言，分布式处理器存储器芯片8500a可包括安置于基板8502上的存储器阵列8520及处理阵列8530。存储器阵列8520可包括多个存储器组，诸如存储器组8520a、8520b及8520c。处理阵列8530可包括多个处理器子单元，诸如处理器子单元8530a、8530b及8530c。As depicted in FIG. 85A, a distributed processor memory chip 8500a may be formed by a plurality of processing groups 8510a, 8510b, and 8510c disposed on a substrate 8502. More specifically, the distributed processor memory chip 8500a may include a memory array 8520 and a processing array 8530 disposed on a substrate 8502. Memory array 8520 may include multiple memory banks, such as memory banks 8520a, 8520b, and 8520c. Processing array 8530 may include multiple processor subunits, such as processor subunits 8530a, 8530b, and 8530c.

此外，处理群组8510a、8510b及8510c中的每一者可包括处理器子单元及专用于该处理器子单元的一个或多个对应存储器组。在图85A中所描绘的实施例中，处理器子单元8530a、8530b及8530c中的每一者可与对应的专用存储器组8520a、8520b或8520c相关联。亦即，处理器子单元8530a可与存储器组8520a相关联；处理器子单元8530b可与存储器组8520b相关联；且处理器子单元8530c可与存储器组8520c相关联。Additionally, each of processing groups 8510a, 8510b, and 8510c may include a processor subunit and one or more corresponding memory banks dedicated to that processor subunit. In the embodiment depicted in Figure 85A, each of the processor subunits 8530a, 8530b, and 8530c may be associated with a corresponding dedicated memory bank 8520a, 8520b, or 8520c. That is, processor sub-unit 8530a may be associated with memory bank 8520a; processor sub-unit 8530b may be associated with memory bank 8520b; and processor sub-unit 8530c may be associated with memory bank 8520c.

为了允许每一处理器子单元与其对应的专用存储器组通信，分布式处理器存储器芯片8500a可包括将处理器子单元中的一者连接至其对应的专用存储器组的第一多个总线8540a、8540b及8540c。在图85A中所描绘的实施例中，总线8540a可将处理器子单元8530a连接至存储器组8520a；总线8540b可将处理器子单元8530b连接至存储器组8520b；且总线8540c可将处理器子单元8530c连接至存储器组8520c。To allow each processor sub-unit to communicate with its corresponding dedicated memory bank, the distributed processor memory chip 8500a may include a first plurality of buses 8540a connecting one of the processor sub-units to its corresponding dedicated memory bank, 8540b and 8540c. In the embodiment depicted in Figure 85A, bus 8540a may connect processor subunit 8530a to memory bank 8520a; bus 8540b may connect processor subunit 8530b to memory bank 8520b; and bus 8540c may connect processor subunit 8540c 8530c is connected to memory bank 8520c.

此外，为了允许每一处理器子单元与其他处理器子单元通信，分布式处理器存储器芯片8500a可包括将处理器子单元中的一者连接至至少另一处理器子单元的第二多个总线8550a及8550b。在图85中所描绘的实施例中，总线8550a可将处理器子单元8530a连接至处理器子单元8530b，且总线8550b可将处理器子单元8530a连接至处理器子单元8550b，等等。Furthermore, to allow each processor subunit to communicate with other processor subunits, the distributed processor memory chip 8500a may include a second plurality of connecting one of the processor subunits to at least one other processor subunit Buses 8550a and 8550b. In the embodiment depicted in Figure 85, bus 8550a may connect processor subunit 8530a to processor subunit 8530b, and bus 8550b may connect processor subunit 8530a to processor subunit 8550b, and so on.

离散存储器组8520a、8520b及8520c中的每一者可包括多个存储器垫。在图84中所描绘的实施例中，存储器组8520a可包括存储器垫8522a、8524a及8526a；存储器组8520b可包括存储器垫8522b、8524b及8526b；且存储器组8520c可包括存储器垫8522c、8524c及8526c。如先前关于图10所公开，存储器垫可包括多个存储器胞元，且每一胞元可包含电容器、晶体管或储存至少一个数据位的其他电路系统。传统存储器垫可包含例如512个位×512个位，但本文中所公开的实施例不限于此。Each of discrete memory banks 8520a, 8520b, and 8520c may include multiple memory pads. In the embodiment depicted in Figure 84, memory bank 8520a may include memory pads 8522a, 8524a, and 8526a; memory bank 8520b may include memory pads 8522b, 8524b, and 8526b; and memory bank 8520c may include memory pads 8522c, 8524c, and 8526c . As previously disclosed with respect to FIG. 10, a memory pad may include a plurality of memory cells, and each cell may include a capacitor, transistor, or other circuitry that stores at least one bit of data. A conventional memory pad may include, for example, 512 bits by 512 bits, although the embodiments disclosed herein are not so limited.

处理器子单元8530a、8530b及8530c中的至少一者可包括被配置为充当用于对应处理器子单元8530a、8530b及8530c的寄存器文件的至少一个存储器垫，诸如存储器垫8532a、8532b及8532c。亦即，至少一个存储器垫8532a、8532b及8532c提供由处理器子单元8530a、8530b及8530c中的一个或多个使用的寄存器文件的至少一个寄存器。寄存器文件可包括一个或多个寄存器。在图85A中所描绘的实施例中，处理器子单元8530a中的存储器垫8532a可充当用于处理器子单元8530a(和/或包括于分布式处理器存储器芯片8500a中的任何其他处理器子单元)的寄存器文件(亦被称作“寄存器文件8532a”)；处理器子单元8530b中的存储器垫8532b可充当用于处理器子单元8530b的寄存器文件；且处理器子单元8530c中的存储器垫8532c可充当用于处理器子单元8530c的寄存器文件。At least one of the processor subunits 8530a, 8530b, and 8530c may include at least one memory pad, such as memory pads 8532a, 8532b, and 8532c, configured to serve as a register file for the corresponding processor subunits 8530a, 8530b, and 8530c. That is, at least one memory pad 8532a, 8532b, and 8532c provides at least one register of a register file used by one or more of processor subunits 8530a, 8530b, and 8530c. A register file can include one or more registers. In the embodiment depicted in Figure 85A, the memory pad 8532a in the processor sub-unit 8530a may serve as a function for the processor sub-unit 8530a (and/or any other processor sub-unit included in the distributed processor memory chip 8500a) unit) of the register file (also referred to as "register file 8532a"); the memory pad 8532b in the processor sub-unit 8530b may serve as a register file for the processor sub-unit 8530b; and the memory pad in the processor sub-unit 8530c The 8532c can act as a register file for the processor sub-unit 8530c.

处理器子单元8530a、8530b及8530c中的至少一者还可包括至少一个逻辑组件，诸如逻辑组件8534a、8534b及8534c。每一逻辑组件8534a、8534b或8534c可被配置为使得对应存储器垫8532a、8532b或8532c能够充当用于对应处理器子单元8530a、8530b或8530c的寄存器文件。At least one of processor sub-units 8530a, 8530b, and 8530c may also include at least one logical component, such as logical components 8534a, 8534b, and 8534c. Each logical component 8534a, 8534b, or 8534c may be configured such that the corresponding memory pad 8532a, 8532b, or 8532c can function as a register file for the corresponding processor subunit 8530a, 8530b, or 8530c.

在一些实施例中，至少一个存储器垫可安置于基板上，且至少一个存储器垫可含有被配置为提供用于多个处理器子单元中的一个或多个的至少一个冗余寄存器的至少一个冗余存储器位。在一些实施例中，处理器子单元中的至少一者可包括用以停止当前任务且在某些时间触发存储器刷新操作以刷新存储器垫的机制。In some embodiments, at least one memory pad may be disposed on the substrate, and the at least one memory pad may contain at least one of at least one redundancy register configured to provide for one or more of the plurality of processor subunits Redundant memory bits. In some embodiments, at least one of the processor sub-units may include a mechanism to stop the current task and trigger a memory refresh operation at certain times to refresh the memory pad.

图85B说明符合所公开实施例的例示性分布式处理器存储器芯片8500b。图85B中所说明的存储器芯片8500b与图85A中所说明的存储器芯片8500大体上相同，除了图85B中的存储器垫8532a、8532b及8532c不包括于对应处理器子单元8530a、8530b及8530c中以外。实情为，图85B中的存储器垫8532a、8532b及8532c安置于对应处理器子单元8530a、8530b及8530c外部但在空间上靠近该些处理器子单元。以此方式，存储器垫8532a、8532b及8532c仍可充当用于对应处理器子单元8530a、8530b及8530c的寄存器文件。85B illustrates an exemplary distributed processor memory chip 8500b consistent with disclosed embodiments. The memory chip 8500b illustrated in Figure 85B is substantially the same as the memory chip 8500 illustrated in Figure 85A, except that the memory pads 8532a, 8532b, and 8532c in Figure 85B are not included in the corresponding processor subunits 8530a, 8530b, and 8530c . In fact, memory pads 8532a, 8532b, and 8532c in Figure 85B are disposed outside but spatially close to the corresponding processor subunits 8530a, 8530b, and 8530c. In this way, the memory pads 8532a, 8532b, and 8532c can still serve as register files for the corresponding processor subunits 8530a, 8530b, and 8530c.

图85C说明符合所公开实施例的装置8500c。装置8500c包括基板8560、第一存储器组8570、第二存储器组8572及处理单元8580。第一存储器组8570、第二存储器组8572及处理单元8580安置于基板8560上。处理单元8580包括处理器8584及由存储器垫实施的寄存器文件8582。在处理单元8580的操作期间，处理器8584可存取寄存器文件8582以读取或写入数据。Figure 85C illustrates a device 8500c consistent with the disclosed embodiments. Device 8500c includes substrate 8560 , first memory bank 8570 , second memory bank 8572 , and processing unit 8580 . The first memory group 8570 , the second memory group 8572 and the processing unit 8580 are disposed on the substrate 8560 . The processing unit 8580 includes a processor 8584 and a register file 8582 implemented by a memory pad. During operation of processing unit 8580, processor 8584 may access register file 8582 to read or write data.

分布式处理器存储器芯片8500a、8500b或装置8500c可基于处理器子单元对由存储器垫提供的寄存器的存取而提供多种功能。举例而言，在一些实施例中，分布式处理器存储器芯片8500a或8500b可包括处理器子单元，该处理器子单元充当耦接至存储器的加速器，从而允许其使用更多存储器带宽。在图85A中所描绘的实施例中，处理器子单元8530a可充当加速器(亦被称作“加速器8530a”)。加速器8530a可使用安置于加速器8530a中的存储器垫8532a以提供寄存器文件的一个或多个寄存器。替代地，在图85B中所描绘的实施例中，加速器8530a可使用安置于加速器8530a外部的存储器垫8532a作为寄存器文件。又另外，加速器8530a可使用存储器组8520b中的存储器垫8522b、8524b及8526b中的任一者或存储器组8520c中的存储器垫8522c、8524c及8526c中的任一者，以提供一个或多个寄存器。Distributed processor memory chips 8500a, 8500b or device 8500c may provide a variety of functions based on processor subunit accesses to registers provided by memory pads. For example, in some embodiments, a distributed processor memory chip 8500a or 8500b may include a processor subunit that acts as an accelerator coupled to memory, allowing it to use more memory bandwidth. In the embodiment depicted in Figure 85A, the processor subunit 8530a may function as an accelerator (also referred to as "accelerator 8530a"). The accelerator 8530a may use a memory pad 8532a disposed in the accelerator 8530a to provide one or more registers of a register file. Alternatively, in the embodiment depicted in Figure 85B, the accelerator 8530a may use a memory pad 8532a disposed outside the accelerator 8530a as a register file. Still further, accelerator 8530a may use any of memory pads 8522b, 8524b, and 8526b in memory bank 8520b or any of memory pads 8522c, 8524c, and 8526c in memory bank 8520c to provide one or more registers .

所公开实施例可尤其适用于某些类型的图像处理、神经网络、数据库分析、压缩及解压缩以及更多应用。举例而言，在图85A或图85B的实施例中，存储器垫可提供用于与存储器垫包括在同一芯片上的一个或多个处理器子单元的寄存器文件的一个或多个寄存器。一个或多个寄存器可用以储存由处理器子单元频繁存取的数据。举例而言，在卷积图像处理期间，卷积加速器可在保存于存储器中的整个图像上反复使用相同系数。用于此卷积加速器的所建议实施方案可将所有这些系数保存于在一个或多个寄存器内的“关闭”寄存器文件中，该一个或多个寄存器包括于专用于一个或多个处理器子单元的存储器垫内，该一个或多个处理器子单元与寄存器文件存储器垫位于同一芯片上。此架构可将寄存器(及所储存的系数值)置放成紧密接近对系数值操作的处理器子单元。因为由存储器垫实施的寄存器文件可充当在空间上紧密的高效高速缓存，所以可达成数据传送的显著较低损失及存取的较低潜时。The disclosed embodiments may be particularly suitable for certain types of image processing, neural networks, database analysis, compression and decompression, and many more applications. For example, in the embodiment of Figure 85A or Figure 85B, the memory pad may provide one or more registers for the register file of one or more processor subunits included on the same chip as the memory pad. One or more registers may be used to store data frequently accessed by processor subunits. For example, during convolutional image processing, the convolution accelerator may reuse the same coefficients over the entire image held in memory. The proposed implementation for this convolution accelerator may store all of these coefficients in a "closed" register file within one or more registers included in the registers dedicated to one or more processor subsystems. Within the unit's memory pad, the one or more processor subunits are located on the same chip as the register file memory pad. This architecture can place registers (and stored coefficient values) in close proximity to processor subunits that operate on the coefficient values. Because the register file implemented by the memory pad can act as a spatially compact, efficient cache, significantly lower penalties for data transfers and lower latencies for accesses can be achieved.

在另一实例中，所公开实施例可包括可将字输入至由存储器垫提供的寄存器中的加速器。加速器可将寄存器处置为循环缓冲器以在单个循环中将向量相乘。举例而言，在图85C中所说明的装置8500c中，处理单元8580中的处理器8584充当加速器，其使用由存储器垫实施的寄存器文件8582作为循环缓冲器以储存数据A1、A2、A3……。第一存储器组8570储存待与数据A1、A2、A3……相乘的数据B1、B2、B3……。第二存储器组8572储存乘法结果C1、C2、C3……。亦即，Ci＝Ai×Bi。若处理单元8580中不存在寄存器文件，则处理器8584将需要更多存储器带宽及更多循环以从诸如存储器组8570或8572的外部存储器组读取数据A1、A2、A3……及数据B1、B2、B3……两者，此可产生显著延迟。另一方面，在本实施例中，数据A1、A2、A3……储存于形成于处理单元8580内的寄存器文件8582中。因此，处理器8584将仅需要自外部存储器组8570读取数据B1、B2、B3……。因此，可显著减少存储器带宽。In another example, the disclosed embodiments can include accelerators that can input words into registers provided by memory pads. The accelerator can handle registers as circular buffers to multiply vectors in a single cycle. For example, in the device 8500c illustrated in Figure 85C, the processor 8584 in the processing unit 8580 acts as an accelerator that uses a register file 8582 implemented by a memory pad as a circular buffer to store data A1, A2, A3 . . . . The first memory group 8570 stores data B1 , B2 , B3 . . . to be multiplied by data A1 , A2 , A3 . . . The second memory group 8572 stores the multiplication results C1, C2, C3 . . . That is, Ci=Ai×Bi. If no register file exists in the processing unit 8580, the processor 8584 would require more memory bandwidth and more cycles to read data A1, A2, A3... and data B1, from an external memory bank such as memory bank 8570 or 8572 B2, B3...both, this can create significant delays. On the other hand, in the present embodiment, the data A1 , A2 , A3 . . . are stored in the register file 8582 formed in the processing unit 8580 . Therefore, the processor 8584 will only need to read the data B1, B2, B3 . . . from the external memory bank 8570. Therefore, memory bandwidth can be significantly reduced.

在存储器处理程序中，存储器垫通常允许单向存取(亦即，单次存取)。在单向存取中，存在至存储器的一个端口。结果，可在某一时间仅执行对特定地址的一个存取操作，例如读取或写入。然而，若存储器垫本身允许双向存取，则双向存取可为有效选项。在双向存取中，可在某一时间存取两个不同地址。存取存储器垫的方法可基于面积及要求而判定。在一些状况下，若由存储器垫实施的寄存器文件连接至需要读取两个源且具有一个目的地寄存器的处理器，则该些寄存器文件可允许四向存取。在一些状况下，当寄存器文件由DRAM垫实施以储存配置或高速缓存数据时，寄存器文件可仅允许单向存取。标准CPU可包括多向存取垫，而单向存取垫对于DRAM应用可为更佳的。In memory handlers, memory pads typically allow unidirectional access (ie, a single access). In unidirectional access, there is one port to the memory. As a result, only one access operation, such as a read or a write, to a particular address can be performed at a time. However, bidirectional access may be a valid option if the memory pad itself allows bidirectional access. In bidirectional access, two different addresses can be accessed at a time. The method of accessing the memory pad can be determined based on area and requirements. In some cases, register files implemented by memory pads may allow four-way access if connected to a processor that needs to read two sources and has one destination register. In some cases, when the register file is implemented by DRAM pads to store configuration or cache data, the register file may only allow one-way access. Standard CPUs may include multidirectional access pads, while unidirectional access pads may be better for DRAM applications.

当控制器或加速器以其仅需要单次存取寄存器(在可能的少数情况下)的方式设计时，可使用存储器垫实施的寄存器而非传统的寄存器文件。在单次存取中，一次仅可存取一个字。举例而言，处理单元可在某一时间从两个寄存器文件存取两个字。两个寄存器文件中的每一者可藉由仅允许单次存取的存储器垫(例如，DRAM垫)实施。When the controller or accelerator is designed in such a way that only a single access to the registers is required (in the few possible cases), memory pad-implemented registers may be used instead of traditional register files. In a single access, only one word can be accessed at a time. For example, a processing unit may access two words from two register files at a time. Each of the two register files may be implemented with memory pads (eg, DRAM pads) that allow only a single access.

在大多数技术中，存储器垫IP(其为自制造商获得的封闭区块(IP))将附带有处于适当位置以用于行及列存取的布线，诸如字线及行线。但存储器垫IP不包括环绕逻辑组件。因此，由揭示于本发明实施例中的存储器垫实施的寄存器文件可包括逻辑组件。可基于寄存器文件的所需大小选择存储器垫的大小。In most technologies, the memory pad IP, which is an enclosed block (IP) obtained from the manufacturer, will be accompanied by wiring, such as word and row lines, in place for row and column access. But the memory pad IP does not include wrap-around logic components. Accordingly, a register file implemented by a memory pad disclosed in embodiments of the present invention may include logic components. The size of the memory pad can be chosen based on the desired size of the register file.

当使用存储器垫以提供寄存器文件的寄存器时，可能会出现某些挑战，且这些挑战可取决于用以形成存储器垫的特定存储器技术。举例而言，在存储器生产中，并非所有制造的存储器胞元皆可在生产之后适当地操作。此为已知问题，尤其在芯片上存在高密度的SRAM或DRAM的情况下。为了解决存储器技术中的此问题，可使用一个或多个冗余机构以便将良率维持于合理水平。在所公开实施例中，因为用以提供寄存器文件的寄存器的存储器例项(例如，存储器组)的数目可相当小，所以冗余机构可能不如正常存储器应用中那样重要。另一方面，影响存储器功能性的相同生产问题亦可影响特定存储器垫在提供一个或多个寄存器时是否可适当地起作用。结果，冗余元件可包括于所公开实施例中。举例而言，至少一个冗余存储器垫可安置于分布式处理器存储器芯片的基板上。至少一个冗余存储器垫可被配置为针对多个处理器子单元中的一个或多个提供至少一个冗余寄存器。在另一实例中，垫可大于所需大小(例如，620×620而非512×512)，且冗余机构可建置至512×512区或其等效物外部的存储器垫的区中。Certain challenges may arise when using memory pads to provide registers of a register file, and these challenges may depend on the particular memory technology used to form the memory pad. For example, in memory production, not all memory cells fabricated may operate properly after production. This is a known problem, especially if there is a high density of SRAM or DRAM on the chip. To address this problem in memory technology, one or more redundancy mechanisms may be used in order to maintain yield at a reasonable level. In the disclosed embodiments, the redundancy mechanism may not be as important as in normal memory applications because the number of memory instances (eg, memory banks) used to provide the registers of the register file can be relatively small. On the other hand, the same production issues that affect memory functionality can also affect whether a particular memory pad can function properly when providing one or more registers. As a result, redundant elements may be included in the disclosed embodiments. For example, at least one redundant memory pad may be disposed on the substrate of the distributed processor memory chips. At least one redundant memory pad may be configured to provide at least one redundant register for one or more of the plurality of processor subunits. In another example, the pads can be larger than desired (eg, 620x620 instead of 512x512), and the redundancy mechanism can be built into regions of the memory pads outside the 512x512 region or its equivalent.

另一挑战可与时序相关。加载字及位线的时序通常由存储器的大小判定。由于寄存器文件可由相当小的单个存储器垫(例如，512×512个位)实施，因此从存储器垫加载字所需的时间将为少的，相较于逻辑，时序可足以相当快速地运行。Another challenge can be related to timing. The timing of loading words and bit lines is usually determined by the size of the memory. Since the register file can be implemented by a relatively small single memory pad (eg, 512x512 bits), the time required to load a word from the memory pad will be small, and timing may be sufficient to run fairly quickly compared to logic.

刷新(Refresh)—如DRAM的一些存储器类型需要周期性地刷新。刷新可在暂停处理器或加速器时执行。对于小的存储器垫，刷新时间可为时间的一小部分。因此，即使系统在短时间段内停止，自总效能来看，藉由使用存储器垫作为寄存器所获得的增益亦值得停工时间。在一个实施例中，处理单元可包括从预定义数目向后计数的计数器。当计数器到达“0”时，处理单元可停止由处理器(例如，加速器)执行的当前任务，且触发逐排刷新存储器垫的刷新操作。当刷新操作完成时，处理器可重新继续其任务，且计数器可经重设以从预定义数目向后计数。Refresh - Some memory types such as DRAM require periodic refresh. Refreshing can be performed while pausing the processor or accelerator. For small memory pads, the refresh time may be a fraction of the time. Therefore, even if the system stops for a short period of time, the gain obtained by using the memory pad as a register is worth the downtime in terms of overall performance. In one embodiment, the processing unit may include a counter that counts backwards from a predefined number. When the counter reaches "0", the processing unit may stop the current task being executed by the processor (eg, the accelerator) and trigger a refresh operation that refreshes the memory pad row by row. When the refresh operation is complete, the processor can resume its tasks, and the counter can be reset to count backwards from a predefined number.

图86提供表示符合所公开实施例的用于在分布式处理器存储器芯片中执行至少一个指令的例示性方法的流程图8600。举例而言，在步骤8602处，可从分布式处理器存储器芯片的基板上的存储器阵列撷取至少一个数据值。在步骤8604处，可将所撷取的数据值储存于由分布式处理器存储器芯片的基板上的存储器阵列的存储器垫提供的寄存器中。在步骤8606处，诸如分布式处理器存储器芯片板上的分布式处理器子单元中的一个或多个的处理器元件可对来自存储器垫寄存器的所储存数据值操作。86 provides a flowchart 8600 representing an exemplary method for executing at least one instruction in a distributed processor memory chip consistent with the disclosed embodiments. For example, at step 8602, at least one data value can be retrieved from a memory array on a substrate of a distributed processor memory chip. At step 8604, the retrieved data values may be stored in registers provided by memory pads of a memory array on the substrate of the distributed processor memory chip. At step 8606, a processor element, such as one or more of the distributed processor sub-units on the distributed processor memory chip board, may operate on the stored data values from the memory pad registers.

此处且贯穿全文，应理解，对寄存器文件的所有参考皆应等同地指高速缓存，因为寄存器文件可为最低层级高速缓存。Here and throughout, it should be understood that all references to a register file shall refer equally to a cache, as a register file may be the lowest level cache.

处理瓶颈processing bottlenecks

术语“第一”、“第二”、“第三”及其类似者仅用以区分不同术语。这些术语可能不提示组件的次序和/或时序和/或重要性。举例而言，第一处理程序之前可为第二处理程序，及其类似者。The terms "first", "second", "third" and the like are only used to distinguish different terms. These terms may not suggest the order and/or timing and/or importance of the components. For example, the first handler can be preceded by the second handler, and the like.

术语“耦接”可意谓直接连接和/或间接连接。The term "coupled" can mean directly connected and/or indirectly connected.

术语“存储器/处理”、“存储器及处理”及“存储器处理”以可互换方式使用。The terms "memory/processing," "memory and processing," and "memory processing" are used interchangeably.

可提供可为存储器/处理单元的多个方法、计算机可读介质、存储器/处理单元和/或系统。Various methods, computer readable media, memory/processing units, and/or systems may be provided that may be memory/processing units.

存储器/处理单元为具有存储器及处理能力的硬件单元。A memory/processing unit is a hardware unit with memory and processing capabilities.

存储器/处理单元可为存储器处理集成电路，可包括于存储器处理集成电路中或可包括一个或多个存储器处理集成电路。The memory/processing unit may be a memory processing integrated circuit, may be included in a memory processing integrated circuit, or may include one or more memory processing integrated circuits.

存储器/处理单元可为如PCT专利申请公开案WO2019025892中所说明的分布式处理器。The memory/processing unit may be a distributed processor as described in PCT Patent Application Publication WO2019025892.

存储器/处理单元可包括如PCT专利申请公开案WO2019025892中所说明的分布式处理器。The memory/processing unit may comprise a distributed processor as described in PCT Patent Application Publication WO2019025892.

存储器/处理单元可属于如PCT专利申请公开案WO2019025892中所说明的分布式处理器。The memory/processing unit may belong to a distributed processor as described in PCT Patent Application Publication WO2019025892.

存储器/处理单元可为如PCT专利申请公开案WO2019025892中所说明的存储器芯片。The memory/processing unit may be a memory chip as described in PCT Patent Application Publication WO2019025892.

存储器/处理单元可包括如PCT专利申请公开案WO2019025892中所说明的存储器芯片。The memory/processing unit may comprise a memory chip as described in PCT Patent Application Publication WO2019025892.

存储器/处理单元可为如PCT专利申请案第PCT/IB2019/001005号中所说明的分布式处理器。The memory/processing unit may be a distributed processor as described in PCT Patent Application No. PCT/IB2019/001005.

存储器/处理单元可属于如PCT专利申请案第PCT/IB2019/001005号中所说明的分布式处理器。The memory/processing unit may belong to a distributed processor as described in PCT Patent Application No. PCT/IB2019/001005.

存储器/处理单元可为如PCT专利申请案第PCT/IB2019/001005号中所说明的存储器芯片。The memory/processing unit may be a memory chip as described in PCT Patent Application No. PCT/IB2019/001005.

存储器/处理单元可包括如PCT专利申请案第PCT/IB2019/001005号中所说明的存储器芯片。The memory/processing unit may comprise a memory chip as described in PCT Patent Application No. PCT/IB2019/001005.

存储器/处理单元可属于如PCT专利申请案第PCT/IB2019/001005号中所说明的存储器芯片。The memory/processing unit may belong to a memory chip as described in PCT Patent Application No. PCT/IB2019/001005.

存储器/处理单元可为使用晶圆间接合及多个导体彼此连接的集成电路。The memory/processing unit may be an integrated circuit that uses inter-wafer bonding and multiple conductors to connect to each other.

对分布式处理器存储器芯片、分布式存储器处理集成电路、存储器芯片、分布式处理器的任何参考可实施为藉由晶圆间接合及多个导体彼此连接的一对集成电路。Any reference to distributed processor memory chips, distributed memory processing integrated circuits, memory chips, distributed processors may be implemented as a pair of integrated circuits connected to each other by inter-wafer bonding and a plurality of conductors.

存储器/处理单元可藉由相比逻辑胞元更佳地适合存储器胞元的第一制造制程来制造。因此，第一制造制程可被视为存储器类别的制造制程。存储器胞元可包括多个晶体管中的一者。逻辑胞元可包括一个或多个晶体管。可应用第一制造制程以制造存储器组。逻辑胞元可包括一起实施逻辑功能的一个或多个晶体管，且可用作较大逻辑电路的基本建置区块。存储器胞元可包括一起实施存储器功能的一个或多个晶体管，且可用作较大逻辑电路的基本建置区块。对应逻辑胞元可实施相同逻辑功能。Memory/processing units can be fabricated by a first fabrication process that is better suited for memory cells than logic cells. Therefore, the first manufacturing process can be regarded as a manufacturing process of the memory class. A memory cell may include one of a plurality of transistors. A logic cell may include one or more transistors. The first fabrication process can be applied to fabricate the memory bank. A logic cell may include one or more transistors that together perform logic functions, and may be used as a basic building block for larger logic circuits. A memory cell may include one or more transistors that together perform memory functions, and may be used as a basic building block for larger logic circuits. Corresponding logical cells may implement the same logical function.

存储器/处理单元可不同于处理器、处理集成电路和/或处理单元中的任一者，该处理器、处理集成电路和/或处理单元藉由相比存储器胞元更佳地适合于逻辑胞元的第二制造制程来制造。因此，第一制造制程可被视为逻辑类别的制造制程。第二制造制程可用以制造中央处理单元、图形处理单元及其类似者。The memory/processing unit may be different from any of the processors, processing integrated circuits and/or processing units by being better suited to logical cells than memory cells Manufactured by the second manufacturing process of Yuan. Therefore, the first manufacturing process can be regarded as a logical class of manufacturing processes. The second manufacturing process can be used to manufacture central processing units, graphics processing units, and the like.

相比处理器、处理集成电路和/或处理单元，存储器/处理单元可更适合于执行较少算术密集型运算。A memory/processing unit may be better suited to perform less arithmetically intensive operations than a processor, processing integrated circuit, and/or processing unit.

举例而言，由第一制造制程制造的存储器胞元可展现超过且甚至大大超过(例如，超过2倍、3倍、4倍、5倍、9倍、7倍、8倍、9倍、10倍及其类似者)由第一制造制程制造的逻辑电路的临界尺寸的临界尺寸。For example, memory cells fabricated by the first manufacturing process may exhibit more than and even greatly exceeds (eg, more than 2 times, 3 times, 4 times, 5 times, 9 times, 7 times, 8 times, 9 times, 10 times times and the like) the critical dimension of the critical dimension of the logic circuit fabricated by the first fabrication process.

第一制造制程可为模拟制造制程，第一制造制程可为DRAM制造制程，及其类似者。The first manufacturing process may be an analog manufacturing process, the first manufacturing process may be a DRAM manufacturing process, and the like.

由第一制造制程制造的逻辑胞元的大小可超过由第二制造制程制造的对应逻辑胞元的大小至少两倍。对应逻辑呼叫可具有与由第一制造制程制造的逻辑胞元相同的功能性。The size of the logic cells fabricated by the first fabrication process may exceed the size of the corresponding logic cells fabricated by the second fabrication process by at least two times. The corresponding logical calls may have the same functionality as the logical cells fabricated by the first manufacturing process.

第二制造制程可为数字制造制程。The second manufacturing process may be a digital manufacturing process.

第二制造制程可为互补金属氧化物半导体(CMOS)、双极、双极CMOS(BiCOMS)、双扩散金属氧化物半导体(DMOS)、氧化物上硅制造制程及其类似者中的任一者。The second fabrication process may be any of Complementary Metal Oxide Semiconductor (CMOS), Bipolar, Bipolar CMOS (BiCOMS), Double Diffused Metal Oxide Semiconductor (DMOS), Silicon on Oxide fabrication process, and the like.

存储器/处理单元可包括多个处理器子单元。The memory/processing unit may include multiple processor sub-units.

一个或多个存储器/处理单元的处理器子单元可彼此独立地操作和/或可彼此相配合和/或执行分布式处理。可以各种方式，例如以平面方式或以阶层式方式执行分布式处理。The processor sub-units of the one or more memory/processing units may operate independently of each other and/or may cooperate with each other and/or perform distributed processing. Distributed processing may be performed in various manners, eg, in a flat manner or in a hierarchical manner.

平面方式可涉及使处理器子单元执行相同操作(且可能在或可能不在处理器子单元之间输出处理结果)。A flat approach may involve having processor subunits perform the same operations (and may or may not output processing results between the processor subunits).

阶层式方式可涉及执行不同层级的处理操作序列，而某一层的处理操作在又一层级的处理操作之后进行。处理器子单元可经分配(动态地或静态地)给不同层且参与阶层式处理。A hierarchical approach may involve performing sequences of processing operations at different levels, with processing operations at one level occurring after processing operations at another level. Processor subunits may be assigned (dynamically or statically) to different layers and participate in hierarchical processing.

分布式处理亦可涉及其他单元，例如存储器/处理单元的控制器和/或不属于存储器/处理单元的单元。Distributed processing may also involve other units, such as controllers of memory/processing units and/or units not belonging to the memory/processing units.

以可互换方式使用术语逻辑及处理器子单元。The terms logic and processor subunit are used interchangeably.

可以任何方式(分布式和/或非分布式及其类似者)执行本申请案中所提及的任何处理。Any processing mentioned in this application may be performed in any manner (distributed and/or non-distributed and the like).

在以下申请案中，对PCT专利申请公开案WO2019025892及PCT专利申请案第PCT/IB2019/001005号(2019年9月9日)进行各种参考和/或以参考方式并入。PCT专利申请公开案WO2019025892和/或PCT专利申请案第PCT/IB2019/001005号提供各种方法、系统、处理器、存储器芯片及其类似者的非限制性实例。可提供其他方法、系统、处理器。In the following applications, various references and/or incorporation by reference are made to PCT Patent Application Publication WO2019025892 and PCT Patent Application No. PCT/IB2019/001005 (September 9, 2019). PCT Patent Application Publication WO2019025892 and/or PCT Patent Application No. PCT/IB2019/001005 provide non-limiting examples of various methods, systems, processors, memory chips, and the like. Other methods, systems, processors may be provided.

可提供处理系统(系统)，其中处理器之前为一个或多个存储器/处理单元，每一存储器及处理单元(存储器/处理单元)具有处理资源及储存资源。A processing system (system) may be provided in which the processor is preceded by one or more memories/processing units, each memory and processing unit (memory/processing unit) having processing resources and storage resources.

处理器可请求或发指令给一个或多个存储器/处理单元以执行各种处理任务。各种处理任务的执行可减轻处理器的负担，减少潜时，且在一些状况下减少一个或多个存储器/处理单元与处理器之间的总信息带宽，及其类似者。A processor may request or instruct one or more memory/processing units to perform various processing tasks. The execution of various processing tasks may reduce the burden on the processor, reduce latency, and in some cases reduce the overall information bandwidth between one or more memory/processing units and the processor, and the like.

处理器可用不同粒度提供指令和/或请求，例如处理器可发送针对某些处理资源的指令或可发送针对存储器/处理单元的较高阶指令，而不指定任何处理资源。The processor may provide instructions and/or requests at different granularities, eg, the processor may issue instructions for certain processing resources or may issue higher order instructions for memory/processing units without specifying any processing resources.

存储器/处理单元可用任何方式(动态、静态、分布式、集中式、脱机、在线及其类似者)管理其处理和/或存储器资源。资源的管理可在以下情况下执行：自主地、在处理器的控制下、在处理器进行配置之后，及其类似者。A memory/processing unit may manage its processing and/or memory resources in any manner (dynamic, static, distributed, centralized, offline, online, and the like). The management of resources may be performed autonomously, under the control of the processor, after the processor is configured, and the like.

举例而言，可将任务分割成可能需要一个或多个存储器/处理单元的一个或多个处理资源和/或存储器资源执行或一个或多个指令的子任务。每一处理资源可被配置为执行(例如，独立地或非独立地)至少一个指令。参见例如藉由诸如PCT专利申请公开案WO2019025892的处理器子单元的处理资源对指令子系列的执行。For example, a task may be divided into subtasks that may require one or more processing resources and/or memory resources of one or more memory/processing units to execute or one or more instructions. Each processing resource may be configured to execute (eg, independently or not independently) at least one instruction. See, eg, execution of sub-series of instructions by processing resources of a processor sub-unit such as PCT Patent Application Publication WO2019025892.

亦可至少将存储器资源的分配提供至除一个或多个存储器/处理单元以外的实体，例如可耦接至一个或多个存储器/处理单元的直接存取存储器(DMA)单元。Allocation of memory resources may also be provided at least to entities other than one or more memory/processing units, such as a direct access memory (DMA) unit that may be coupled to the one or more memory/processing units.

编译程序可针对由存储器/处理单元执行的任务的每个类型准备配置文件。配置文件包括与任务类型相关联的存储器分配及处理资源分配。配置文件可包括可由不同处理资源执行和/或可定义存储器分配的指令。The compiler may prepare configuration files for each type of task performed by the memory/processing unit. Profiles include memory allocations and processing resource allocations associated with task types. A configuration file may include instructions executable by different processing resources and/or may define memory allocations.

举例而言，与矩阵乘法(将矩阵A乘以矩阵B，A*B＝C)的任务相关的配置文件可提示在何处储存矩阵A的元素，在何处储存矩阵B的元素，在何处储存矩阵C的元素，在何处储存在矩阵乘法期间产生的中间结果，且可包括针对用于执行与矩阵乘法相关的任何数学运算的处理资源的指令。配置文件为数据结构的实例，可提供其他数据结构。For example, a configuration file related to the task of matrix multiplication (multiplying matrix A by matrix B, A*B=C) may suggest where to store the elements of matrix A, where to store the elements of matrix B, where where elements of matrix C are stored, where intermediate results produced during matrix multiplication are stored, and may include instructions for processing resources used to perform any mathematical operations related to matrix multiplication. A configuration file is an instance of a data structure and can provide other data structures.

可藉由一个或多个存储器/处理单元以任何方式执行矩阵乘法。Matrix multiplication may be performed in any manner by one or more memory/processing units.

一个或多个存储器/处理单元可将矩阵A乘以向量V。此可用任何方式进行。举例而言，此可涉及每处理资源维护矩阵的一行或列(每不同处理资源维护行的不同行)，及循环(在不同处理资源之间)矩阵的行或列与向量的乘法的最终结果(在第一迭代期间)，及循环先前乘法的最终结果(在第二至最后迭代期间)。One or more memory/processing units may multiply matrix A by vector V. This can be done in any way. For example, this may involve maintaining a row or column of a matrix per processing resource (maintaining a different row of rows per different processing resource), and looping (between different processing resources) the final result of a multiplication of a row or column of a matrix and a vector (during the first iteration), and looping over the final result of the previous multiplication (during the second to last iteration).

假定矩阵A为4×4矩阵，向量V为1×4向量，且存在四个处理资源。在此假设下，矩阵A的第一行储存于第一处理器子单元处，矩阵A的第二行储存于第二处理器子单元处，矩阵A的第三行储存于第三处理资源处，且矩阵A在第四行储存于第四处理器子单元处。藉由以下操作开始乘法：将向量V的第一至第四元素发送至第一至第四处理资源；及将向量V的第一至第四元素乘以A的不同向量以提供第一中间结果。藉由以下操作循环第一中间结果来继续乘法：藉由每一处理资源将由第一处理资源计算的第一中间结果发送至其相邻处理资源。每一处理资源将第一乘法结果乘以向量以提供第二乘法结果。此过程重复多次，直至矩阵A与向量V的乘法结束。Assume that matrix A is a 4x4 matrix, vector V is a 1x4 vector, and there are four processing resources. Under this assumption, the first row of matrix A is stored at the first processor subunit, the second row of matrix A is stored at the second processor subunit, and the third row of matrix A is stored at the third processing resource , and matrix A is stored at the fourth processor subunit in the fourth row. Begin the multiplication by: sending the first through fourth elements of vector V to the first through fourth processing resources; and multiplying the first through fourth elements of vector V by a different vector of A to provide a first intermediate result . The multiplication continues by looping over the first intermediate result by sending, by each processing resource, the first intermediate result computed by the first processing resource to its neighboring processing resource. Each processing resource multiplies the first multiplication result by the vector to provide the second multiplication result. This process is repeated many times until the multiplication of matrix A and vector V ends.

图90A为包括一个或多个存储器/处理单元(共同地表示为10910)及处理器10920的系统10900的实例。处理器10920可将请求或指令发送(经由链路10931)至一个或多个存储器/处理单元10920，该一个或多个存储器/处理单元又完成(或选择性地完成)请求和/或指令且将结果发送(经由链路10932)至处理器10920，如上文所说明。处理器10920可进一步处理结果以提供(经由链路10933)一个或多个输出。90A is an example of a system 10900 that includes one or more memory/processing units (collectively denoted 10910) and a processor 10920. The processor 10920 may send the request or instruction (via link 10931) to one or more memory/processing units 10920, which in turn complete (or selectively complete) the request and/or instruction and The results are sent (via link 10932) to processor 10920, as described above. Processor 10920 may further process the results to provide (via link 10933) one or more outputs.

一个或多个存储器/处理单元可包括J(J为正整数)个存储器资源10912(1,1)至10912(1,J)及K(K为正整数)个处理资源10911(1,1)至10911(1,K)。One or more memory/processing units may include J (J is a positive integer) memory resources 10912(1,1) to 10912(1,J) and K (K is a positive integer) processing resources 10911(1,1) to 10911(1,K).

J可等于K或可不同于K。J may be equal to K or may be different from K.

处理资源10911(1,1)至10911(1,K)可为例如处理群组或处理器子单元，如PCT专利申请公开案WO2019025892中所说明。The processing resources 10911(1,1) to 10911(1,K) may be, for example, processing groups or processor sub-units, as described in PCT Patent Application Publication WO2019025892.

存储器资源10912(1,1)至10912(1,J)可为存储器例项、存储器垫、存储器组，如PCT专利申请公开案WO2019025892中所说明。Memory resources 10912(1,1) to 10912(1,J) may be memory instances, memory pads, memory banks, as described in PCT Patent Application Publication WO2019025892.

一个或多个存储器/处理单元的资源(存储器或处理)中的任一者之间可存在任何连接性和/或任何功能关系。There may be any connectivity and/or any functional relationship between any of the resources (memory or processing) of one or more memory/processing units.

图90B为存储器/处理单元10910(1)的实例。Figure 90B is an example of a memory/processing unit 10910(1).

在图90B中，K(K为正整数)个处理资源10911(1,1)至10911(1,K)形成回路，因为该些处理资源彼此串联连接(参见链路10915)。每一处理资源亦耦接至其自身的一对专用存储器资源(例如，处理资源10911(1)耦接至存储器资源10912(1)及10912(2)，且处理资源10911(K)耦接至存储器资源10912(J-1)及10912(J))。处理资源可用任何其他方式彼此连接。每一处理资源所分配的存储器资源的数目可不同于两个。不同资源之间的连接性的实例说明于PCT专利申请公开案WO2019025892中。In FIG. 90B, K (K is a positive integer) processing resources 10911(1,1) to 10911(1,K) form a loop because the processing resources are connected to each other in series (see link 10915). Each processing resource is also coupled to its own pair of dedicated memory resources (eg, processing resource 10911(1) is coupled to memory resources 10912(1) and 10912(2), and processing resource 10911(K) is coupled to Memory resources 10912(J-1) and 10912(J)). The processing resources can be connected to each other in any other way. The number of memory resources allocated for each processing resource may be different from two. An example of connectivity between different resources is described in PCT Patent Application Publication WO2019025892.

图90C为N(N为正整数)个存储器/处理单元10910(1)至10910(N)及处理器10920的系统10901的实例。处理器10920可将请求或指令发送(经由链路10931(1)至10931(N))至存储器/处理单元10920(1)至10910(N)，该些存储器/处理单元又完成请求和/或指令且将结果发送(经由链路10932(1)至3232(N))至处理器10920，如上文所说明。处理器10920可进一步处理结果以提供(经由链路10933)一个或多个输出。90C is an example of a system 10901 of N (N is a positive integer) memory/processing units 10910(1)-10910(N) and processors 10920. Processor 10920 may send requests or instructions (via links 10931(1) to 10931(N)) to memory/processing units 10920(1) to 10910(N), which in turn fulfill requests and/or instructs and sends the results (via links 10932(1) to 3232(N)) to processor 10920, as described above. Processor 10920 may further process the results to provide (via link 10933) one or more outputs.

图90D为包括N(N为正整数)个存储器/处理单元10910(1)至10910(N)及处理器10920的系统10902的实例。图90D说明在存储器/处理单元10910(1)至10910(N)之前的预处理器10909。预处理器可执行各种预处理操作，诸如帧提取、标头侦测及其类似者。90D is an example of a system 10902 that includes N (N is a positive integer) memory/processing units 10910(1)-10910(N) and a processor 10920. Figure 90D illustrates the preprocessor 10909 prior to the memory/processing units 10910(1)-10910(N). The preprocessor may perform various preprocessing operations, such as frame extraction, header detection, and the like.

图90E为包括一个或多个存储器/处理单元10910及处理器10920的系统10903的实例。图90E说明在一个或多个存储器/处理单元10910及DMA控制器10908之前的预处理器10909。90E is an example of a system 10903 including one or more memory/processing units 10910 and a processor 10920. 90E illustrates a preprocessor 10909 preceding one or more memory/processing units 10910 and a DMA controller 10908.

图90F说明用于至少一个信息串流的分布式处理的方法10800。90F illustrates a method 10800 for distributed processing of at least one stream of information.

方法10800可开始于藉由一个或多个存储器处理集成电路经由第一通信通道接收至少一个信息串流的步骤10810；其中每一存储器处理集成单元包含控制器、多个处理器子单元及多个存储器单元。Method 10800 may begin at step 10810 of receiving, by one or more memory processing integrated circuits, at least one stream of information via a first communication channel; wherein each memory processing integrated unit includes a controller, a plurality of processor sub-units, and a plurality of memory unit.

步骤10810之后可接着步骤10820及10830。Step 10810 may be followed by steps 10820 and 10830.

步骤10820可包括藉由一个或多个存储器处理集成电路缓冲信息串流。Step 10820 may include buffering the stream of information by one or more memory processing integrated circuits.

步骤10830可包括藉由一个或多个存储器处理集成电路对至少一个信息串流执行第一处理操作以提供第一处理结果。Step 10830 may include performing, by one or more memory processing integrated circuits, a first processing operation on the at least one stream of information to provide a first processing result.

步骤10830可涉及压缩或解压缩。Step 10830 may involve compression or decompression.

因此，信息串流的总大小可超过第一处理结果的总大小。信息串流的总大小可反映在给定持续时间的时段期间接收的信息量。第一处理结果的总大小可反映在同一给定持续时间的任何时段期间输出的第一处理结果的量。Therefore, the total size of the information stream may exceed the total size of the first processing result. The total size of the information stream may reflect the amount of information received during a period of given duration. The total size of the first processing result may reflect the amount of the first processing result output during any period of the same given duration.

替代地，信息串流(在本说明书中所提及的任何其他信息实体)的总大小小于第一处理结果的总大小。在此状况下，获得压缩。Alternatively, the total size of the information stream (any other information entities mentioned in this specification) is smaller than the total size of the first processing result. In this situation, compression is obtained.

步骤10830之后可接着为将第一处理结果发送至一个或多个处理集成电路的步骤10840。Step 10830 may be followed by step 10840 of sending the first processing result to one or more processing integrated circuits.

一个或多个存储器处理集成电路可由存储器类别的制造制程制造。One or more memory processing integrated circuits may be fabricated by a memory class of fabrication process.

一个或多个存储器处理集成电路可由逻辑类别的制造制程制造。One or more memory processing integrated circuits may be fabricated by a logic class of fabrication process.

在存储器处理集成单元中，存储器单元中的每一者可耦接至处理器子单元。In a memory processing integrated unit, each of the memory units may be coupled to a processor subunit.

步骤10840之后可接着为藉由一个或多个处理集成电路对第一处理结果执行第二处理操作以提供第二处理结果的步骤10850。Step 10840 may be followed by step 10850 of performing a second processing operation on the first processing result by one or more processing integrated circuits to provide a second processing result.

步骤10820和/或步骤10830可藉由一个或多个处理集成电路发指令，可藉由一个或多个处理集成电路请求，可藉由一个或多个处理集成电路在一个或多个存储器处理集成电路进行配置之后执行，或可独立地执行而无需一个或多个处理集成电路的介入。Step 10820 and/or step 10830 may be instructed by one or more processing integrated circuits, may be requested by one or more processing integrated circuits, may be integrated by one or more processing integrated circuits in one or more memory processing The circuits are configured for execution, or may be executed independently without the intervention of one or more processing integrated circuits.

第一处理操作可具有比第二处理操作低的算术强度。The first processing operation may have a lower arithmetic intensity than the second processing operation.

步骤10830和/或步骤10850可为以下各者中的至少一者：(a)蜂窝网络处理操作；(b)其他网络相关处理操作(不同于蜂窝网络的网络的处理)；(c)数据库处理操作；(d)数据库分析处理操作；(e)人工智能处理操作；或任何其他处理操作。Step 10830 and/or step 10850 may be at least one of: (a) cellular network processing operations; (b) other network related processing operations (processing of networks other than cellular networks); (c) database processing operations; (d) database analytical processing operations; (e) artificial intelligence processing operations; or any other processing operations.

分解式系统存储器/处理单元及用于分布式处理的方法Decomposed system memory/processing unit and method for distributed processing

可提供分解式系统、用于分布式处理的方法、处理/存储器单元、用于操作分解式系统的方法、用于操作处理/存储器单元的方法及计算机可读介质，该计算机可读介质为非暂时性的且储存用于执行该些方法中的任一者的指令。分解式系统分配不同子系统以执行不同功能。举例而言，储存器可主要实施于一个或多个储存子系统中，而运算可主要在一个或多个储存子系统中进行。A disaggregated system, a method for distributed processing, a processing/memory unit, a method for operating a disaggregated system, a method for operating a processing/memory unit, and a computer-readable medium, the computer-readable medium, may be provided. Instructions for performing any of the methods are transiently and stored. A disaggregated system assigns different subsystems to perform different functions. For example, storage may be implemented primarily in one or more storage subsystems, and operations may be performed primarily in one or more storage subsystems.

分解式系统可为一分解式服务器、一个或多个分解式服务器和/或可不同于一个或多个服务器。A disaggregated system may be a disaggregated server, one or more disaggregated servers, and/or may be different from one or more servers.

分解式系统可包括一个或多个交换子系统、一个或多个运算子系统、一个或多个储存子系统及一个或多个处理/存储器子系统。A disaggregated system may include one or more switching subsystems, one or more computing subsystems, one or more storage subsystems, and one or more processing/memory subsystems.

一个或多个处理/存储器子系统、一个或多个运算子系统及一个或多个储存子系统经由一个或多个交换子系统彼此耦接。One or more processing/memory subsystems, one or more computing subsystems, and one or more storage subsystems are coupled to each other via one or more switching subsystems.

一个或多个处理/存储器子系统可包括于分解式系统的一个或多个子系统中。One or more processing/memory subsystems may be included in one or more subsystems of the disaggregated system.

图87A说明分解式系统的各种实例。87A illustrates various examples of decomposed systems.

可提供任何数目个任何类型的子系统。分解式系统可包括图87A中不包括的类型的一个或多个额外子系统，可包括较少类型的子系统，及其类似者。Any number of subsystems of any type may be provided. The exploded system may include one or more additional subsystems of the type not included in FIG. 87A, may include fewer types of subsystems, and the like.

分解式系统7101包括两个储存子系统7130、运算子系统7120、交换子系统7140及处理/存储器子系统7110。The decomposed system 7101 includes two storage subsystems 7130 , an arithmetic subsystem 7120 , a switching subsystem 7140 , and a processing/memory subsystem 7110 .

分解式系统7102包括两个储存子系统7130、运算子系统7120、交换子系统7140、处理/存储器子系统7110及加速器子系统7150。The decomposed system 7102 includes two storage subsystems 7130 , an arithmetic subsystem 7120 , a switching subsystem 7140 , a processing/memory subsystem 7110 , and an accelerator subsystem 7150 .

分解式系统7103包括两个储存子系统7130、运算子系统7120及包括处理/存储器子系统7110的交换子系统7140。The decomposed system 7103 includes two storage subsystems 7130 , an arithmetic subsystem 7120 , and a switch subsystem 7140 including a processing/memory subsystem 7110 .

分解式系统7104包括两个储存子系统7130、运算子系统7120、包括处理/存储器子系统7110的交换子系统7140，及加速器子系统7150。The decomposed system 7104 includes two storage subsystems 7130 , an arithmetic subsystem 7120 , a switch subsystem 7140 including a processing/memory subsystem 7110 , and an accelerator subsystem 7150 .

将处理/存储器子系统7110包括于交换子系统7140中可减少分解式系统7101及7102内的业务，可减少切换的潜时，及其类似者。Including processing/memory subsystem 7110 in switching subsystem 7140 can reduce traffic within disaggregated systems 7101 and 7102, can reduce switching latency, and the like.

分解式系统的不同子系统可使用各种通信协议彼此通信。已发现，使用以太网络及甚至以太网络RDMA通信协议可增加吞吐量，且可能甚至降低与分解式系统的组件之间的信息单元的交换相关的各种控制及/储存操作的复杂度。The different subsystems of the disaggregated system can communicate with each other using various communication protocols. It has been found that using the Ethernet and even the Ethernet RDMA communication protocol can increase throughput, and possibly even reduce the complexity of various control and/storage operations associated with the exchange of information units between components of a disaggregated system.

分解式系统可藉由允许处理/存储器子系统参与计算(尤其藉由执行存储器密集型计算)来执行分布式处理。A disaggregated system can perform distributed processing by allowing the processing/memory subsystem to participate in computations, especially by performing memory-intensive computations.

举例而言，假定N个运算单元应在其间共享信息单元(全部共享)，则(a)可将N个信息单元发送至一个或多个处理/存储器子系统的一个或多个处理/存储器单元，(b)一个或多个处理/存储器单元可执行需要全部共享的计算，且(c)将N个经更新信息单元发送至N个运算单元。此将需要大约N个传送操作。For example, given that N arithmetic units should share information units among them (all shared), then (a) N information units may be sent to one or more processing/memory units of one or more processing/memory subsystems , (b) one or more processing/memory units may perform computations that require all sharing, and (c) send N updated information units to N arithmetic units. This would require about N transfer operations.

举例而言，图87B说明更新神经网络的模型(该模型包括指派给神经网络的节点的权重)的分布式处理。For example, Figure 87B illustrates the distributed process of updating a model of a neural network that includes weights assigned to nodes of the neural network.

N个运算单元PU(1)7120(1)至PU(N)7120(N)中的每一者可属于分解式系统7101、7102、7103及7104中的任一者的运算子系统7120。Each of the N arithmetic units PU( 1 ) 7120( 1 ) through PU(N) 7120(N) may belong to an arithmetic subsystem 7120 of any of the decomposed systems 7101 , 7102 , 7103 , and 7104 .

N个运算单元计算N个部分模型更新(经更新的N个不同部分)7121(1)至7121(N)，且将其发送(经由交换子系统7140)至处理/存储器子系统7110。The N arithmetic units compute N part model updates (updated N different parts) 7121(1)-7121(N) and send them (via switch subsystem 7140) to processing/memory subsystem 7110.

处理/存储器子系统7110计算经更新模型7122且将经更新模型发送(经由交换子系统7140)至N个运算单元PU(1)7120(1)至PU(N)7120(N)。Processing/memory subsystem 7110 computes the updated model 7122 and sends (via exchange subsystem 7140) the updated model to N arithmetic units PU(1) 7120(1) through PU(N) 7120(N).

图87C、图87D及图87E分别说明存储器/处理单元7011、7012及7013的实例，且图87F及图87G说明集成电路7014及7015，集成电路包括存储器/处理单元9010诸如以太网络模块及以太网络RDMA模块22的一个或多个通信模块。Figures 87C, 87D, and 87E illustrate examples of memory/processing units 7011, 7012, and 7013, respectively, and Figures 87F and 87G illustrate integrated circuits 7014 and 7015, which include memory/processing units 9010 such as Ethernet modules and Ethernet One or more communication modules of the RDMA module 22 .

存储器/处理单元包括控制器9020、内部总线9021以及多对逻辑9030及存储器组9040。控制器被配置为作为通信模块操作或可耦接至通信模块。The memory/processing unit includes a controller 9020, an internal bus 9021, and pairs of logic 9030 and memory banks 9040. The controller is configured to operate as or couplable to the communication module.

可用其他方式实施控制器9020与多对逻辑9030及存储器组9040之间的连接性。可用其他方式(不成对)配置存储器组及逻辑。Connectivity between controller 9020 and pairs of logic 9030 and memory banks 9040 may be implemented in other ways. Memory banks and logic can be configured in other ways (not in pairs).

处理/存储器子系统7110的一个或多个存储器/处理单元9010可并行地处理(使用不同逻辑且从不同存储器组并行地撷取模型的不同部分)模型更新，且受益于海量存储器资源、存储器组与逻辑之间的连接的极高带宽，可用高效方式执行这些计算。One or more memory/processing units 9010 of the processing/memory subsystem 7110 can process (using different logic and fetching different parts of the model in parallel from different memory banks) model updates in parallel, and benefit from massive memory resources, memory banks The extremely high bandwidth of the connection to the logic enables these computations to be performed in an efficient manner.

图87C至图87E的存储器/处理单元7011、7012及7013以及图87C至图87E的集成电路7014及7015包括一个或多个通信模块，诸如以太网络模块7023(在图87C至图87G中)及以太网络RDMA模块7022(在图87E及图87G中)。Memory/processing units 7011, 7012 and 7013 of Figures 87C-87E and integrated circuits 7014 and 7015 of Figures 87C-87E include one or more communication modules, such as Ethernet module 7023 (in Figures 87C-87G) and Ethernet RDMA module 7022 (in Figures 87E and 87G).

具有这种RDMA和/或以太网络模块(在存储器/处理单元内或在与存储器/处理单元相同的集成电路内)大大加速分解式系统的不同元件之间的通信，且在RDMA的状况下，大大简化分解式系统的不同元件之间的通信。Having such RDMA and/or Ethernet modules (either within the memory/processing unit or within the same integrated circuit as the memory/processing unit) greatly speeds up communication between the different elements of the disaggregated system, and in the case of RDMA, Greatly simplifies communication between different elements of a disaggregated system.

应注意，包括RDMA和/或以太网络模块的存储器/处理单元在其他环境中可为有益的，即使当存储器/处理单元不包括于分解式系统中时亦如此。It should be noted that a memory/processing unit including RDMA and/or Ethernet modules may be beneficial in other environments, even when the memory/processing unit is not included in a disaggregated system.

亦应注意，例如出于减少成本原因，可针对存储器/处理单元的每个群组分配RDMA和/或以太网络模块。It should also be noted that RDMA and/or Ethernet modules may be allocated for each group of memory/processing units, eg, for cost reduction reasons.

应注意，存储器/处理单元、存储器/处理单元的群组及甚至处理/存储器子系统可包括其他通信端口，例如PCIe通信端口。It should be noted that memory/processing units, groups of memory/processing units, and even processing/memory subsystems may include other communication ports, such as PCIe communication ports.

使用RDMA和/或以太网络模块可具有成本效益，因为可消除将存储器/处理单元连接至网桥的需要，该网桥连接至可具有以太网络端口的网络集成电路(NIC)。Using RDMA and/or Ethernet modules can be cost effective because the need to connect the memory/processing unit to a bridge to a network integrated circuit (NIC) that can have an Ethernet port can be eliminated.

使用RDMA和/或以太网络模块可使以太网络(或以太网络RDMA)为存储器/处理单元中原生的。The use of RDMA and/or Ethernet modules can make Ethernet (or Ethernet RDMA) native to the memory/processing unit.

应注意，以太网络仅为局域网络(LAN)协议的实例。PCIe仅为可在比以太网络更长的距离上使用的另一通信协议的实例。It should be noted that an Ethernet network is only an example of a local area network (LAN) protocol. PCIe is just an example of another communication protocol that can be used over longer distances than Ethernet.

图87H说明用于分布式处理的方法7000。87H illustrates a method 7000 for distributed processing.

方法7000可包括一个或多个处理迭代。Method 7000 may include one or more processing iterations.

处理迭代可由分解式系统的一个或多个存储器处理集成电路执行。Processing iterations may be performed by one or more memory processing integrated circuits of the disaggregated system.

处理迭代可由分解式系统的一个或多个处理集成电路执行。Processing iterations may be performed by one or more processing integrated circuits of the disaggregated system.

由更多存储器处理集成电路执行的处理迭代之后可接着为由一个或多个处理集成电路执行的处理迭代。Processing iterations performed by more memory processing integrated circuits may be followed by processing iterations performed by one or more processing integrated circuits.

由更多存储器处理集成电路执行的处理迭代可在由一个或多个处理集成电路执行的处理迭代之前。Processing iterations performed by more memory processing integrated circuits may precede processing iterations performed by one or more processing integrated circuits.

又一处理迭代可由分解式系统的其他电路执行。举例而言，一个或多个预处理电路可执行任何类型的预处理，包括准备用于一个或多个存储器处理集成电路执行的处理迭代的信息单元。Yet another processing iteration may be performed by other circuits of the decomposed system. For example, the one or more preprocessing circuits may perform any type of preprocessing, including preparing units of information for processing iterations performed by one or more memory processing integrated circuits.

方法7000可包括藉由分解式系统的一个或多个存储器处理集成电路接收信息单元的步骤7020。Method 7000 may include a step 7020 of receiving a unit of information by one or more memory processing integrated circuits of the disaggregated system.

每一存储器处理集成单元可包括控制器、多个处理器子单元及多个存储器单元。Each memory processing integrated unit may include a controller, multiple processor sub-units, and multiple memory units.

信息单元可输送神经网络的模型的部分。The information unit may convey part of the model of the neural network.

信息单元可输送至少一个数据库查询的部分结果。The information element may deliver partial results of at least one database query.

信息单元可输送至少一个聚集数据库查询的部分结果。The information element may deliver partial results of at least one aggregated database query.

步骤7020可包括从分解式系统的一个或多个储存子系统接收信息单元。Step 7020 may include receiving information units from one or more storage subsystems of the disaggregated system.

步骤7020可包括从分解式系统的一个或多个运算子系统接收信息单元，一个或多个运算子系统可包括由逻辑类别的制造制程制造的多个处理集成电路。Step 7020 may include receiving information units from one or more computing subsystems of the disaggregated system, which may include a plurality of processing integrated circuits fabricated by a logical class of manufacturing processes.

步骤7020之后可接着为藉由一个或多个存储器处理集成电路对信息单元执行处理操作以提供处理结果的步骤7030。Step 7020 may be followed by step 7030 of performing processing operations on the information elements by one or more memory processing integrated circuits to provide processing results.

信息单元的总大小可超过，可等于或可小于处理结果的总大小。The total size of the information elements may exceed, may be equal to, or may be less than the total size of the processing result.

步骤7030之后可接着为藉由一个或多个存储器处理集成电路输出处理结果的步骤7040。Step 7030 may be followed by step 7040 of outputting the processing results by one or more memory processing integrated circuits.

步骤7040可包括将处理结果输出至分解式系统的一个或多个运算子系统，一个或多个运算子系统可包括由逻辑类别的制造制程制造的多个处理集成电路。Step 7040 may include outputting the processing results to one or more computing subsystems of the disaggregated system, which may include multiple processing integrated circuits fabricated by a logical class of manufacturing processes.

步骤7040可包括将处理结果输出至分解式系统的一个或多个储存子系统。Step 7040 may include outputting the processing results to one or more storage subsystems of the disaggregated system.

信息单元可从多个处理集成电路的处理单元的不同群组发送，且可为藉由多个处理集成电路以分布式方式执行的处理程序的中间结果的不同部分。处理单元的群组可包括至少一个处理集成电路。Information units may be sent from different groups of processing units of multiple processing integrated circuits, and may be different parts of intermediate results of a processing program executed in a distributed fashion by multiple processing integrated circuits. The group of processing units may include at least one processing integrated circuit.

步骤7030可包括处理信息单元以提供整个处理程序的结果。Step 7030 may include processing the information unit to provide the results of the overall processing procedure.

步骤7040可包括将整个处理程序的结果发送至多个处理集成电路中的每一者。Step 7040 may include sending the results of the entire processing procedure to each of the plurality of processing integrated circuits.

中间结果的不同部分可为经更新神经网络模型的不同部分，且其中整个处理程序的结果为经更新神经网络模型。The different parts of the intermediate result may be different parts of the updated neural network model, and wherein the result of the entire process is the updated neural network model.

步骤7040可包括将经更新神经网络模型发送至多个处理集成电路中的每一者。Step 7040 may include sending the updated neural network model to each of a plurality of processing integrated circuits.

步骤7040之后可接着为藉由多个处理集成电路至少部分地基于至多个处理集成电路的处理结果来执行另一处理的步骤7050。Step 7040 may be followed by step 7050 of performing, by the plurality of processing integrated circuits, another process based at least in part on the processing results to the plurality of processing integrated circuits.

步骤7040可包括使用分解式系统的交换子单元输出处理结果。Step 7040 may include outputting the processing results using the switching subunits of the decomposed system.

步骤7020可包括接收信息单元，该些信息单元为经预处理的信息单元。Step 7020 may include receiving information units, which are preprocessed information units.

图87I说明用于分布式处理的方法7001。Figure 87I illustrates a method 7001 for distributed processing.

方法7001与方法7000的不同之处在于包括藉由多个处理集成电路预处理信息以提供经预处理的信息单元的步骤7010。Method 7001 differs from method 7000 in that it includes step 7010 of preprocessing information by a plurality of processing integrated circuits to provide preprocessed information units.

步骤7010之后可接着为步骤7010、7020、7030及7040。Step 7010 may be followed by steps 7010, 7020, 7030, and 7040.

数据库分析加速Database Analytics Acceleration

提供一种装置、方法及计算机可读介质，该装置、该方法及该计算机可读介质储存用于藉由属于与存储器单元相同的集成电路的筛选单元至少执行筛选的指令，而筛选器可提示哪些条目与某一数据库查询相关。仲裁器或任何其他流程控制管理器可将相关条目发送至处理器且不将不相关条目发送至处理器，因此节省了去往处理器及来自处理器的几乎大部分业务。An apparatus, method, and computer-readable medium are provided that store instructions for performing at least screening by a screening unit belonging to the same integrated circuit as the memory unit, wherein the filter may prompt Which entries are relevant to a database query. An arbiter or any other flow control manager can send relevant entries to the processor and not send irrelevant entries to the processor, thus saving almost most of the traffic to and from the processor.

参见例如图91A，其展示处理器(CPU 9240)、包括存储器及筛选系统9220的集成电路。存储器及筛选系统9220可包括耦接至存储器单元条目9222及一个或多个仲裁器(诸如，用于将相关条目发送至处理器的仲裁器9229)的筛选单元9224。可应用任何仲裁处理程序。条目的数目、筛选单元的数目及仲裁器的数目之间可存在任何关系。See, eg, FIG. 91A, which shows a processor (CPU 9240), an integrated circuit including memory and a screening system 9220. Memory and screening system 9220 may include screening unit 9224 coupled to memory unit entries 9222 and one or more arbiters, such as arbiter 9229 for sending relevant entries to the processor. Any arbitration handler may be applied. There may be any relationship between the number of entries, the number of filter units and the number of arbiters.

仲裁器可由能够控制信息流的任何单元替换，例如通信接口、流控制器及其类似者。The arbiter can be replaced by any unit capable of controlling the flow of information, such as communication interfaces, flow controllers, and the like.

参考筛选，其基于一个或多个相关性/筛选准则。Reference screening, which is based on one or more correlation/screening criteria.

可针对每个数据库查询设定相关性，且可用任何方式提示相关性，例如存储器单元可储存提示哪一条目相关的相关性旗标9224'。亦存在储存K个数据库区段9220(k)的储存装置9210，而k的范围为1与K之间。应注意，整个数据库可储存于存储器单元中而不储存于储存装置中(该解决方案亦被称作易失性存储器储存的数据库)。Correlations can be set for each database query, and correlations can be prompted in any manner, for example a memory unit can store a correlation flag 9224' that indicates which entry is relevant. There is also a storage device 9210 that stores K database segments 9220(k), where k ranges between 1 and K. It should be noted that the entire database can be stored in the memory unit and not in the storage device (this solution is also referred to as a volatile memory-stored database).

存储器单元条目可能太小而无法储存整个数据库，且因此一次可接收一个区段。A memory cell entry may be too small to store the entire database, and thus can receive one segment at a time.

筛选单元可执行筛选操作，诸如比较字段的值与阈值，比较字段的值与预定义值，判定字段的值是否在预定义范围内，及其类似者。The filtering unit may perform filtering operations, such as comparing the value of a field with a threshold, comparing the value of a field with a predefined value, determining whether the value of a field is within a predefined range, and the like.

因此，筛选单元可执行已知数据库筛选操作，且可为紧密且廉价的电路。Thus, the screening unit can perform known database screening operations and can be a compact and inexpensive circuit.

将筛选操作的最终结果(例如，相关数据库条目之内容)9101发送至CPU9420以供处理。The final result of the filtering operation (eg, the contents of the relevant database entry) 9101 is sent to the CPU 9420 for processing.

存储器及筛选系统9220可由如图91B中所说明的存储器及处理系统替换。The memory and screening system 9220 may be replaced by a memory and processing system as illustrated in Figure 91B.

存储器及处理系统9229包括耦接至存储器单元条目9222的处理单元9225。处理单元9225可执行筛选操作，且可至少部分地参与对相关记录执行一个或多个额外操作。The memory and processing system 9229 includes a processing unit 9225 coupled to the memory unit entry 9222. Processing unit 9225 may perform filtering operations and may participate, at least in part, in performing one or more additional operations on related records.

处理单元可经定制以执行特定操作和/或可为被配置为执行多个操作的可编程单元。举例而言，处理单元可为管线化处理单元，可包括ALU，可包括多个ALU，及其类似者。A processing unit may be customized to perform particular operations and/or may be a programmable unit configured to perform multiple operations. For example, a processing unit may be a pipelined processing unit, may include an ALU, may include multiple ALUs, and the like.

处理单元9225可执行全部的一个或多个额外操作。Processing unit 9225 may perform all of the one or more additional operations.

替代地，一个或多个额外操作的一部分由处理单元执行，且处理器(CPU9240)可执行一个或多个额外操作的另一部分。Alternatively, a portion of the one or more additional operations is performed by a processing unit, and the processor (CPU 9240) may perform another portion of the one or more additional operations.

将处理操作的最终结果(例如，对数据库查询的部分响应9102，或完整响应9103)发送至CPU 9420。The final result of the processing operation (eg, a partial response to a database query 9102, or a complete response 9103) is sent to the CPU 9420.

部分响应需要进一步处理。Part of the response requires further processing.

图92A说明包括被配置为执行筛选及额外处理的存储器/处理单元9227的存储器/处理系统9228。92A illustrates a memory/processing system 9228 including a memory/processing unit 9227 configured to perform screening and additional processing.

存储器/处理系统9228藉由存储器/处理单元9227实施图91的处理单元及存储器单元。The memory/processing system 9228 implements the processing and memory units of FIG. 91 with the memory/processing unit 9227.

处理器的作用可包括控制处理单元、执行一个或多个额外操作的至少一部分，及其类似者。The role of the processor may include controlling a processing unit, performing at least a portion of one or more additional operations, and the like.

存储器条目与处理单元的组合可至少部分地由一个或多个存储器/处理单元实施。The combination of memory entry and processing unit may be implemented, at least in part, by one or more memory/processing units.

图92B说明实例存储器/处理单元9010。92B illustrates an example memory/processing unit 9010.

存储器/处理单元9010包括控制器9020、内部总线9021以及多对逻辑9030及存储器组9040。控制器被配置为作为通信模块操作或可耦接至通信模块。The memory/processing unit 9010 includes a controller 9020, an internal bus 9021, and pairs of logic 9030 and memory banks 9040. The controller is configured to operate as or couplable to the communication module.

可用其他方式实施控制器9020与多对逻辑9030及存储器组9040之间的连接性。可用其他方式(不成对)配置存储器组及逻辑。多个存储器组可耦接至单个逻辑和/或由单个逻辑管理。Connectivity between controller 9020 and pairs of logic 9030 and memory banks 9040 may be implemented in other ways. Memory banks and logic can be configured in other ways (not in pairs). Multiple memory banks may be coupled to and/or managed by a single logic.

存储器/处理系统经由接口9211接收数据库查询9100。接口9211可为总线、端口、输入/输出接口及其类似者。The memory/processing system receives database query 9100 via interface 9211. The interface 9211 can be a bus, port, input/output interface, and the like.

应注意，对数据库查询的响应可由以下各者中的至少一者(或以下各者中的一个或多个的组合)产生：一个或多个存储器/处理系统、一个或多个存储器及处理系统、一个或多个存储器及筛选系统、位于这些系统外部的一个或多个处理器，及其类似者。It should be noted that responses to database queries may be generated by at least one of (or a combination of one or more of): one or more memories/processing systems, one or more memories and processing systems , one or more memory and screening systems, one or more processors external to these systems, and the like.

应注意，对数据库查询的响应可由以下各者中的至少一者(或以下各者中的一个或多个的组合)产生：一个或多个筛选单元、一个或多个存储器/处理单元、一个或多个处理单元、一个或多个其他处理器(诸如，一个或多个其他CPU)，及其类似者。It should be noted that responses to database queries may be generated by at least one of (or a combination of one or more of): one or more filtering units, one or more memory/processing units, one or more or more processing units, one or more other processors (such as one or more other CPUs), and the like.

任何处理程序可包括寻找相关数据库条目，及其处理相关数据库条目。处理可由一个或多个处理实体执行。Any processing procedure may include finding the relevant database entry, and processing the relevant database entry. Processing may be performed by one or more processing entities.

处理实体可为以下各者中的至少一者：存储器及处理系统的处理单元(例如，存储器及处理系统9229的处理单元9225)、存储器/处理单元的处理器子单元(或逻辑)、另一处理器(例如，图91A、图91B及图74的CPU 9240)，及其类似者。The processing entity may be at least one of: a processing unit of a memory and processing system (eg, processing unit 9225 of memory and processing system 9229), a processor sub-unit (or logic) of a memory/processing unit, another A processor (eg, CPU 9240 of Figures 91A, 91B, and 74), and the like.

在产生对数据库查询的响应中所涉及的处理可由以下各者中的任一者或以下各者的组合产生：The processing involved in generating a response to a database query may result from any one or a combination of the following:

a.存储器及处理系统9229的处理单元9225。a. Processing unit 9225 of memory and processing system 9229.

b.不同存储器及处理系统9229的处理单元9225。b. Processing unit 9225 of different memory and processing system 9229.

c.存储器/处理系统9228的一个或多个存储器/处理单元9227的处理器子单元(或逻辑9030)。c. The processor sub-unit (or logic 9030) of one or more memory/processing units 9227 of the memory/processing system 9228.

d.不同存储器/处理系统9228的存储器/处理单元9227的处理器子单元(或逻辑9030)。d. The processor sub-unit (or logic 9030) of the memory/processing unit 9227 of the different memory/processing system 9228.

e.存储器/处理系统9228的一个或多个存储器/处理单元9227的控制器。e. A controller of one or more memory/processing units 9227 of the memory/processing system 9228.

f.不同存储器/处理系统9228的一个或多个存储器/处理单元9227的控制器。f. Controllers of one or more memory/processing units 9227 of different memory/processing systems 9228.

因此，在对数据库查询的响应中所涉及的处理可由以下各者的组合或子组合产生：(a)一个或多个存储器/处理单元的一个或多个控制器、(b)存储器处理系统的一个或多个处理单元、(c)一个或多个存储器/处理单元的一个或多个处理器子单元，及(d)一个或多个其他处理器，及其类似者。Thus, the processing involved in responding to a database query may result from a combination or sub-combination of (a) one or more controllers of one or more memory/processing units, (b) a memory processing system's One or more processing units, (c) one or more processor sub-units of one or more memory/processing units, and (d) one or more other processors, and the like.

由多于一个处理实体执行的处理可被称作分布式处理。Processing performed by more than one processing entity may be referred to as distributed processing.

应注意，筛选可由一个或多个筛选单元和/或一个或多个处理单元和/或一个或多个处理器子单元中的筛选实体执行。在此意义上，执行筛选操作的处理单元和/或处理器子单元可被称作筛选单元。It should be noted that screening may be performed by screening entities in one or more screening units and/or one or more processing units and/or one or more processor sub-units. In this sense, processing units and/or processor sub-units that perform screening operations may be referred to as screening units.

处理实体可为筛选实体或可不同于筛选实体。The processing entity can be a screening entity or can be different from the screening entity.

处理实体可执行由另一筛选实体视为相关的数据库条目的处理操作。A processing entity may perform processing operations that are considered relevant database entries by another filtering entity.

处理实体亦可执行筛选操作。Processing entities can also perform filtering operations.

对数据库查询的响应可利用一个或多个筛选实体及一个或多个处理实体。Responses to database queries may utilize one or more filtering entities and one or more processing entities.

一个或多个筛选实体及一个或多个处理实体可属于同一系统(例如，存储器/处理系统9228、存储器及处理系统9229、存储器及筛选系统9220)或属于不同系统。One or more screening entities and one or more processing entities may belong to the same system (eg, memory/processing system 9228, memory and processing system 9229, memory and screening system 9220) or belong to different systems.

存储器/处理单元可包括多个处理器子单元。处理器子单元可彼此独立地操作，可彼此部分地合作，可参与分布式处理，及其类似者。The memory/processing unit may include multiple processor sub-units. The processor subunits may operate independently of each other, may cooperate in part with each other, may participate in distributed processing, and the like.

图92C说明多个存储器及筛选系统9220、多个其他处理器(诸如，CPU9240)及储存装置9210。92C illustrates a number of memory and screening systems 9220, a number of other processors (such as a CPU 9240), and a storage device 9210.

多个存储器及筛选系统9220可基于多个数据库查询中的一者内的一个或多个筛选准则来参与(同时或不同时)一个或多个数据库条目的筛选。A plurality of memory and screening systems 9220 may participate (simultaneously or not simultaneously) in the screening of one or more database entries based on one or more screening criteria within one of a plurality of database queries.

图92D说明多个存储器及处理系统9229、多个其他处理器(诸如，CPU9240)及储存装置9210。92D illustrates a plurality of memory and processing systems 9229, a plurality of other processors (such as CPU 9240), and storage device 9210.

多个存储器及处理系统9229可参与(同时或不同时)在对多个数据库查询中的一者作出响应中所涉及的筛选及至少部分处理。A plurality of memory and processing systems 9229 may participate (simultaneously or not) in the screening and at least part of the processing involved in responding to one of the plurality of database queries.

图92F说明多个存储器/处理系统9228、多个其他处理器(诸如，CPU9240)及储存装置9210。92F illustrates multiple memory/processing systems 9228, multiple other processors (such as CPU 9240), and storage device 9210.

多个存储器/处理系统9228可参与(同时或不同时)在对多个数据库查询中的一者作出响应中所涉及的筛选及至少部分处理。Multiple memory/processing systems 9228 may participate (simultaneously or not) in the filtering and at least part of the processing involved in responding to one of the multiple database queries.

图92G说明用于数据库分析加速的方法9300方法。Figure 92G illustrates a method 9300 method for database analysis acceleration.

方法9300可开始于藉由存储器处理集成电路接收数据库查询的步骤9310，该数据库查询包含提示数据库中与数据库查询相关的数据库条目的至少一个相关性准则。The method 9300 may begin at step 9310 of receiving, by the memory processing integrated circuit, a database query including at least one relevance criterion that prompts a database entry in the database that is relevant to the database query.

数据库中与数据库查询相关的数据库条目可能并非数据库的数据库条目，可为数据库的数据库条目中的一者、一些或全部。The database entries in the database related to the database query may not be the database entries of the database, but may be one, some or all of the database entries of the database.

存储器处理集成电路可包括控制器、多个处理器子单元及多个存储器单元。A memory processing integrated circuit may include a controller, multiple processor sub-units, and multiple memory units.

步骤9310之后可接着为藉由存储器处理集成电路且基于至少一个相关性准则而判定储存于存储器处理集成电路中的相关数据库条目的群组的步骤9320。Step 9310 may be followed by step 9320 of determining, by the memory processing integrated circuit and based on at least one correlation criterion, a group of related database entries stored in the memory processing integrated circuit.

步骤9320之后可接着为将相关数据库条目的群组发送至一个或多个处理实体以供进一步处理而实质上不将储存于存储器处理集成电路中的不相关数据条目发送至该一个或多个处理实体的步骤9330。Step 9320 may be followed by sending the group of related database entries to one or more processing entities for further processing without substantially sending unrelated data entries stored in the memory processing integrated circuit to the one or more processing entities Entity step 9330.

词组“而实质上不发送”意谓根本不发送(在对数据库查询作出响应期间)或发送数目不多的不相关条目。不多可意谓至多1、2、3、4、5、9、7、8、9、10个百分比，或发送对带宽无显著影响的任何量。The phrase "without substantially sending" means not sending at all (during responding to a database query) or sending a small number of irrelevant entries. Not much can mean at most 1, 2, 3, 4, 5, 9, 7, 8, 9, 10 percent, or send any amount that does not significantly affect bandwidth.

步骤9330之后可接着为处理相关数据库条目的群组以提供对数据库查询的响应的步骤9340。Step 9330 may be followed by a step 9340 of processing groups of related database entries to provide responses to database queries.

图92H说明用于数据库分析加速的方法9301。Figure 92H illustrates a method 9301 for database analysis acceleration.

假定对数据库查询作出响应所需的筛选及整个处理由存储器处理集成电路执行。It is assumed that the filtering and overall processing required to respond to database queries is performed by the memory processing integrated circuit.

方法9301可开始于藉由存储器处理集成电路接收数据库查询的步骤9310，该数据库查询包含提示数据库中与数据库查询相关的数据库条目的至少一个相关性准则。The method 9301 may begin at step 9310 of receiving, by the memory processing integrated circuit, a database query including at least one relevance criterion that prompts a database entry in the database that is relevant to the database query.

步骤9320之后可接着为将相关数据库条目的群组发送至存储器处理集成电路的一个或多个处理实体以供完全处理而实质上不将储存于存储器处理集成电路中的不相关数据条目发送至该一个或多个处理实体的步骤9331。Step 9320 may be followed by sending the group of related database entries to one or more processing entities of the memory processing integrated circuit for full processing without substantially sending unrelated data entries stored in the memory processing integrated circuit to the memory processing integrated circuit. Step 9331 of one or more processing entities.

步骤9331之后可接着为完全处理相关数据库条目的群组以提供对数据库查询的响应的步骤9341。Step 9331 may be followed by step 9341 of fully processing the group of related database entries to provide a response to the database query.

步骤9341之后可接着为从存储器处理集成电路输出对数据库查询的响应的步骤9351。Step 9341 may be followed by step 9351 of outputting a response to the database query from the memory processing integrated circuit.

图92I说明用于数据库分析加速的方法9302。Figure 92I illustrates a method 9302 for database analysis acceleration.

假定对数据库查询作出响应所需的筛选以及处理的仅一部分由存储器处理集成电路执行。存储器处理集成电路将输出部分结果，该些部分结果将由位于存储器处理集成电路外部的一个或多个其他处理实体处理。It is assumed that only a portion of the filtering and processing required to respond to database queries is performed by the memory processing integrated circuit. The memory processing integrated circuit will output partial results, which will be processed by one or more other processing entities external to the memory processing integrated circuit.

方法9301可开始于藉由存储器处理集成电路接收数据库查询的步骤9310，该数据库查询包含提示数据库中与数据库查询相关的数据库条目的至少一个相关性准则。The method 9301 can begin at step 9310 of receiving, by the memory processing integrated circuit, a database query including at least one relevance criterion that prompts a database entry in the database that is relevant to the database query.

步骤9320之后可接着为将相关数据库条目的群组发送至存储器处理集成电路的一个或多个处理实体以供部分处理而实质上不将储存于存储器处理集成电路中的不相关数据条目发送至该一个或多个处理实体的步骤9332。Step 9320 may be followed by sending the group of related database entries to one or more processing entities of the memory processing integrated circuit for partial processing without substantially sending unrelated data entries stored in the memory processing integrated circuit to the memory processing integrated circuit. Step 9332 of one or more processing entities.

步骤9332之后可接着为部分地处理相关数据库条目的群组以提供对数据库查询的中间响应的步骤9342。Step 9332 may be followed by a step 9342 of partially processing the group of related database entries to provide an intermediate response to the database query.

步骤9342之后可接着为从存储器处理集成电路输出对数据库查询的中间响应的步骤9352。Step 9342 may be followed by step 9352 of outputting an intermediate response to the database query from the memory processing integrated circuit.

步骤9352之后可接着为进一步处理中间响应以提供对数据库的响应的步骤9390。Step 9352 may be followed by step 9390 of further processing the intermediate response to provide a response to the database.

图92J说明用于数据库分析加速的方法9303。Figure 92J illustrates a method 9303 for database analysis acceleration.

假定存储器处理集成电路执行相关数据库条目的筛选，但不执行相关数据库条目的处理。存储器处理集成电路将输出将由位于存储器处理集成电路外部的一个或多个其他处理实体完全处理的相关数据库条目的群组。It is assumed that the memory processing integrated circuit performs the filtering of the relevant database entries, but does not perform the processing of the relevant database entries. The memory processing integrated circuit will output a group of related database entries to be fully processed by one or more other processing entities external to the memory processing integrated circuit.

步骤9320之后可接着为将相关数据库条目的群组发送至位于存储器处理集成电路外部的一个或多个处理实体而实质上不将储存于存储器处理集成电路中的不相关数据条目发送至该一个或多个处理实体的步骤9333。Step 9320 may be followed by sending groups of related database entries to one or more processing entities external to the memory processing integrated circuit without substantially sending unrelated data entries stored in the memory processing integrated circuit to the one or more processing entities. Step 9333 of multiple processing entities.

步骤9333之后可接着为完全处理中间响应以提供对数据库的响应的步骤9391。Step 9333 may be followed by step 9391 of fully processing the intermediate response to provide a response to the database.

图92K说明数据库分析加速的方法9304。Figure 92K illustrates a method 9304 for database analysis acceleration.

方法9303可开始于藉由集成电路接收数据库查询的步骤9315，该数据库查询包含提示数据库中与数据库查询相关的数据库条目的至少一个相关性准则；其中该集成电路包含控制器、筛选单元及多个存储器单元。The method 9303 may begin at step 9315 of receiving, by the integrated circuit, a database query, the database query including at least one relevance criterion prompting database entries in the database that are relevant to the database query; wherein the integrated circuit includes a controller, a screening unit, and a plurality of memory unit.

步骤9315之后可接着为藉由筛选单元且基于至少一个相关性准则来判定储存于集成电路中的相关数据库条目的群组的步骤9325。Step 9315 may be followed by a step 9325 of determining, by a screening unit and based on at least one relevance criterion, a group of related database entries stored in the integrated circuit.

步骤9325之后可接着为将相关数据库条目的群组发送至位于集成电路外部的一个或多个处理实体以供进一步处理而实质上不将储存于集成电路中的不相关数据条目发送至一个或多个处理实体的步骤9335。Step 9325 may be followed by sending the group of related database entries to one or more processing entities external to the integrated circuit for further processing without substantially sending unrelated data entries stored in the integrated circuit to one or more processing entities. Step 9335 of a processing entity.

步骤9335之后可接着为步骤9391。Step 9335 may be followed by step 9391.

图92L说明数据库分析加速的方法9305。Figure 92L illustrates a method 9305 for database analysis acceleration.

方法9305可开始于藉由集成电路接收数据库查询的步骤9314，该数据库查询包含提示数据库中与数据库查询相关的数据库条目的至少一个相关性准则；其中集成电路包含控制器、筛选单元及多个存储器单元。The method 9305 may begin at step 9314 of receiving, by the integrated circuit, a database query, the database query including at least one relevance criterion prompting a database entry in the database that is relevant to the database query; wherein the integrated circuit includes a controller, a screening unit, and a plurality of memories unit.

步骤9314之后可接着为藉由处理单元且基于至少一个相关性准则来判定储存于集成电路中的相关数据库条目的群组的步骤9324。Step 9314 may be followed by step 9324 of determining, by the processing unit, a group of related database entries stored in the integrated circuit based on at least one correlation criterion.

步骤9324之后可接着为藉由处理单元处理相关数据库条目的群组而不藉由处理单元处理储存于集成电路中的不相关数据条目以提供处理结果的步骤9334。Step 9324 may be followed by step 9334 of processing the group of related database entries by the processing unit without processing the unrelated data entries stored in the integrated circuit by the processing unit to provide the processing result.

步骤9334之后可接着为从集成电路输出处理结果的步骤9344。Step 9334 may be followed by step 9344 of outputting the processing results from the integrated circuit.

在方法9300、9301、9302、9304及9305中的任一者中，存储器处理集成电路输出一输出。该输出可为相关数据库条目、一个或多个中间结果或一个或多个(完整)结果的群组。In any of methods 9300, 9301, 9302, 9304, and 9305, the memory processing integrated circuit outputs an output. The output may be a related database entry, one or more intermediate results, or a group of one or more (complete) results.

该输出之前可为从存储器处理集成电路的筛选实体和/或处理实体撷取一个或多个相关数据库条目和/或一个或多个结果(完整或中间)。The output may be preceded by one or more relevant database entries and/or one or more results (full or intermediate) retrieved from a screening entity and/or a processing entity of the memory processing integrated circuit.

该撷取可用一或多种方式控制且可由存储器处理集成电路的仲裁器和/或一个或多个控制器控制。The retrieval may be controlled in one or more ways and may be controlled by an arbiter and/or one or more controllers of the memory processing integrated circuit.

输出和/或撷取可包括控制撷取和/或输出的一个或多个参数。该些参数可包括撷取时序、撷取速率、撷取源、带宽、次序或撷取、输出时序、输出速率、输出源、带宽、次序或输出、撷取方法的类型、仲裁方法的类型及其类似者。Exporting and/or retrieving may include one or more parameters that control the ingesting and/or exporting. The parameters may include acquisition timing, acquisition rate, acquisition source, bandwidth, order or acquisition, output timing, output rate, output source, bandwidth, order or output, type of acquisition method, type of arbitration method, and its similar.

输出和/或撷取可执行流控制处理程序。Export and/or capture executable flow control handlers.

输出和/或撷取(例如，应用流控制处理程序)可对从一个或多个处理实体输出的关于群组的数据库条目的处理的完成的指示符作出响应。指示符可提示中间结果是否已准备好被自处理实体撷取。Exporting and/or retrieving (eg, applying a flow control handler) may be in response to an indicator output from one or more processing entities of completion of processing of the database entry for the group. An indicator may indicate whether the intermediate result is ready to be retrieved by the self-processing entity.

输出可包括尝试将在输出期间使用的带宽匹配到链路上的最大可允许带宽，该链路将存储器处理集成电路耦接至请求者单元。该链路可为至存储器处理集成电路的输出的接收者的链路。最大可允许带宽可藉由链路的容量和/或可用性、所输出内容的接收者的容量和/或可用性及其类似者规定。The output may include an attempt to match the bandwidth used during the output to the maximum allowable bandwidth on the link coupling the memory processing integrated circuit to the requestor unit. The link may be a link to a recipient of the output of the memory processing integrated circuit. The maximum allowable bandwidth may be specified by the capacity and/or availability of the link, the capacity and/or availability of the recipient of the output content, and the like.

输出可包括尝试以最佳或次最佳方式输出所输出内容。Outputting may include attempting to output the output in an optimal or sub-optimal manner.

所输出内容的输出可包括尝试维持输出业务速率的波动低于阈值。The output of the output content may include an attempt to maintain fluctuations in the output traffic rate below a threshold.

方法9300、9301、9302及9305的任何方法可包括藉由一个或多个处理实体产生处理状态指示符，处理状态指示符可提示相关数据库条目的群组的进一步处理的进展。Any of methods 9300, 9301, 9302, and 9305 may include generating, by one or more processing entities, a processing status indicator that may inform progress of further processing of the group of related database entries.

当包括于上文所提及的方法中的任一者中的处理由多于单个处理实体执行时，则处理可被视为分布式处理，因为处理以分布式方式执行。When the processing included in any of the above-mentioned methods is performed by more than a single processing entity, then the processing may be considered distributed processing because the processing is performed in a distributed fashion.

如上文所提示，可以阶层式方式或以平面方式执行处理。As hinted above, processing can be performed in a hierarchical fashion or in a planar fashion.

方法9300至9305中的任一者可由可同时或依序地对一个或多个数据库查询作出响应的多个系统执行。Any of methods 9300-9305 may be performed by multiple systems that may respond to one or more database queries simultaneously or sequentially.

字嵌入word embedding

如上文所提及，字嵌入(word embedding)为自然语言处理(NLP)中的语言模型化及特征学习技术的集合的总称，其中来自词汇表的字或词组映射至元素的向量。概念上，字嵌入涉及自每个字具有许多维度的空间至具有低得多的维度的连续向量空间的数学嵌入。As mentioned above, word embedding is an umbrella term for a collection of language modeling and feature learning techniques in natural language processing (NLP), where words or phrases from a vocabulary are mapped to vectors of elements. Conceptually, word embeddings involve mathematical embeddings from a space with many dimensions per word to a continuous vector space with much lower dimensions.

可对该些向量进行数学处理。举例而言，可对属于矩阵的向量进行加总以提供加总向量。These vectors can be mathematically processed. For example, vectors belonging to a matrix can be summed to provide a summed vector.

又对于另一实例，可计算(语句的)矩阵的协方差。此可包括将矩阵乘以其转置矩阵。For yet another example, the covariance of the matrix (of the sentence) may be calculated. This can include multiplying the matrix by its transpose.

存储器/处理单元可储存词汇表。特定而言，词汇表的部分可储存于存储器/处理单元的多个存储器组中。The memory/processing unit may store the vocabulary. In particular, portions of the vocabulary may be stored in multiple memory banks of the memory/processing unit.

因此，可使用将表示语句的词组的字的集合的存取信息(诸如，撷取密钥)来存取存储器/处理单元，使得将从存储器/处理单元的存储器组中的至少一些撷取表示语句的词组的字的向量。Thus, the memory/processing unit may be accessed using access information (such as a retrieval key) for the set of words that will represent the phrase's phrase, such that representations will be retrieved from at least some of the memory/processing unit's memory bank A vector of words for the phrase of the sentence.

存储器/处理单元的不同存储器组可储存词汇表的不同部分，且可被并行地存取(取决于语句的索引的分布)。即使在需要依序存取存储器组的多于单排时，预测亦可减少惩罚。Different memory banks of the memory/processing unit may store different parts of the vocabulary and may be accessed in parallel (depending on the distribution of the indexes of the statement). Prediction reduces penalties even when more than a single row of memory banks needs to be accessed sequentially.

可优化在存储器/处理单元的不同存储器组之间分配词汇表的字，或该分配在可增加每语句对存储器/处理单元的不同存储器组的并行存取的机会的意义上为高度有益的。该分配可按每个使用者学习，可按每个一般群体学习或可按每个人群来学习。The allocation of vocabulary words between different memory banks of memory/processing units can be optimized, or the allocation can be highly beneficial in the sense that the chance of parallel access per statement to different memory banks of memory/processing units can be increased. The assignment can be learned per user, per general group or per population.

此外，存储器/处理单元亦可用以执行处理操作中的至少一些(藉由其逻辑)，且藉此可减少从存储器/处理单元外部的总线所需的带宽，可用高效方式(甚至并行)计算多个运算(并行地使用存储器/处理单元的多个处理器)。In addition, the memory/processing unit may also be used to perform at least some of the processing operations (by virtue of its logic) and thereby reduce the bandwidth required from the bus external to the memory/processing unit, allowing multiple computations to be performed in an efficient manner (even in parallel). operations (multiple processors using memory/processing units in parallel).

存储器组可与逻辑相关联。Memory groups can be logically associated.

处理操作的至少一部分可由一个或多个额外处理器(诸如，向量处理器，包括但不限于向量加法器)执行。At least a portion of the processing operations may be performed by one or more additional processors, such as vector processors, including but not limited to vector adders.

存储器/处理单元可包括可分配给存储器组(逻辑对)中的一些或全部的一个或多个额外处理器。The memory/processing units may include one or more additional processors that may be assigned to some or all of the memory banks (logical pairs).

因此，可将单个额外处理器分配给存储器组(逻辑对)中的全部或一些。又对于另一实例，该些额外处理器可用阶层式方式配置，使得某一层级的额外处理器处理来自较低层级的额外处理器的输出。Thus, a single additional processor may be assigned to all or some of the memory banks (logical pairs). For yet another example, the additional processors may be configured in a hierarchical manner, such that additional processors at a certain level process output from additional processors at lower levels.

应注意，处理操作可在不使用任何额外处理器的情况下执行，但可由存储器/处理单元的逻辑执行。It should be noted that processing operations may be performed without the use of any additional processors, but may be performed by the logic of the memory/processing unit.

图89A、图89B、图89C、图89D、图89E、图89F及图89G分别说明存储器/处理单元9010、9011、9012、9013、9014、9015及9019的实例。存储器/处理单元9010包括控制器9020、内部总线9021以及多对逻辑9030及存储器组9040。89A, 89B, 89C, 89D, 89E, 89F, and 89G illustrate examples of memory/processing units 9010, 9011, 9012, 9013, 9014, 9015, and 9019, respectively. The memory/processing unit 9010 includes a controller 9020, an internal bus 9021, and pairs of logic 9030 and memory banks 9040.

应注意，逻辑9030及存储器组9040可用其他方式耦接至控制器和/或彼此耦接，例如，多个总线可设置于控制器与逻辑之间，逻辑可配置于多个层中，单个逻辑可由多个存储器组共享(参见例如图89E)，及其类似者。It should be noted that logic 9030 and memory bank 9040 may be coupled to the controller and/or to each other in other ways, for example, multiple buses may be provided between the controller and logic, logic may be configured in multiple layers, a single logic Can be shared by multiple memory banks (see, eg, Figure 89E), and the like.

可用任何方式定义存储器/处理单元9010内的每一存储器组的页面的长度，例如，其可足够小，且存储器组的数目可足够大以使得能够并行地输出大量向量而不会在不相关信息上浪费许多位。The length of the pages of each memory bank within the memory/processing unit 9010 may be defined in any way, for example, it may be small enough and the number of memory banks may be large enough to enable a large number of vectors to be output in parallel without interfering with irrelevant information. A lot of bits are wasted.

逻辑9020可包括完整ALU、部分ALU、存储器控制器、部分存储器控制器及其类似者。部分ALU(存储器控制器)单元能够仅执行可由完整ALU(存储器控制器)执行的功能的一部分。在本申请案中说明的任何逻辑或子处理器可包括完整ALU、部分ALU、存储器控制器、部分存储器控制器及其类似者。Logic 9020 may include full ALUs, partial ALUs, memory controllers, partial memory controllers, and the like. A partial ALU (memory controller) unit can perform only a portion of the functions that can be performed by a full ALU (memory controller). Any logic or sub-processors described in this application may include full ALUs, partial ALUs, memory controllers, partial memory controllers, and the like.

存储器/处理单元9010可能不具有额外向量，且向量(来自存储器组)的处理由逻辑9030进行。Memory/processing unit 9010 may not have additional vectors, and processing of vectors (from memory banks) is performed by logic 9030.

图89B说明额外处理器，诸如耦接至内部总线9021的向量处理器9050。89B illustrates an additional processor, such as a vector processor 9050 coupled to the internal bus 9021.

图89C说明额外处理器，诸如耦接至内部总线9021的向量处理器9050。一个或多个额外处理器执行(单独或与逻辑相配合)处理操作。89C illustrates an additional processor, such as a vector processor 9050 coupled to the internal bus 9021. One or more additional processors perform (alone or in conjunction with logic) the processing operations.

图89D亦说明经由总线9022耦接至存储器/处理单元9010的主机9018。89D also illustrates the host 9018 coupled to the memory/processing unit 9010 via the bus 9022.

图89D亦说明将字/词组9072映射至向量9073的词汇表9070。使用撷取密钥9071存取存储器/处理单元，每一撷取密钥表示先前辨识的字或词组。主机9018将表示语句的多个撷取密钥9071发送至存储器/处理单元，且存储器/处理单元可输出向量9070或由与语句相关的向量所应用的处理操作的最终结果。字/词组通常不储存于存储器/处理单元9010中。89D also illustrates a vocabulary 9070 that maps words/phrases 9072 to vectors 9073. The memory/processing unit is accessed using retrieval keys 9071, each retrieval key representing a previously recognized word or phrase. The host 9018 sends a plurality of retrieval keys 9071 representing the statement to the memory/processing unit, and the memory/processing unit can output a vector 9070 or the final result of the processing operation applied by the vector associated with the statement. Words/phrases are generally not stored in memory/processing unit 9010.

用于控制存储器组的存储器控制器功能性可包括(单独或部分地)于逻辑中，可包括(单独或部分地)于控制器9020中和/或可包括(单独或部分地)于存储器/处理单元9010内的一个或多个存储器控制器(未图标)中。Memory controller functionality for controlling memory banks may be included (individually or partially) in logic, may be included (individually or partially) in controller 9020, and/or may be included (individually or partially) in memory/ In one or more memory controllers (not shown) within the processing unit 9010.

存储器/处理单元可被配置为最大化发送至主机9018的向量/结果的吞吐量，或可应用用于控制内部存储器/处理单元业务和/或控制存储器/处理单元与主计算机(或存储器/处理单元外部的任何其他实体)之间的业务的任何处理程序。The memory/processing unit may be configured to maximize throughput of vectors/results sent to the host 9018, or may be applied to control internal memory/processing unit traffic and/or control the memory/processing unit and the host computer (or memory/processing unit). Any handler for business between any other entity outside the cell).

不同逻辑9030耦接至存储器/处理单元的存储器组9040，且可对向量执行(较佳并行地)数学运算以产生经处理向量。一个逻辑9030可将向量发送至另一逻辑(参见例如图89G的线38)，且另一逻辑可对所接收向量及其计算的向量应用数学运算。逻辑可按层级配置，且某一层级的逻辑可处理来自前一层级逻辑的向量或中间结果(由应用数学运算产生)。该些逻辑可形成树(二元、三元及其类似者)。Various logics 9030 are coupled to memory banks 9040 of memory/processing units and can perform mathematical operations (preferably in parallel) on the vectors to produce processed vectors. One logic 9030 can send the vector to another logic (see, eg, line 38 of Figure 89G), and the other logic can apply mathematical operations on the received vector and its computed vector. Logic may be configured in layers, and logic at one level may process vectors or intermediate results (produced by applying mathematical operations) from logic at a previous level. These logics can form trees (binary, ternary, and the like).

当经处理向量的总大小超过结果的总大小时，则获得输出带宽(在存储器/处理单元外部)的减少。举例而言，当K个向量由存储器/处理单元加总以提供单个输出向量时，则获得带宽的K:1减少。When the total size of the processed vectors exceeds the total size of the result, then a reduction in output bandwidth (outside the memory/processing unit) is obtained. For example, when K vectors are summed by the memory/processing unit to provide a single output vector, then a K:1 reduction in bandwidth is obtained.

控制器9020可被配置为藉由广播待存取的不同向量的地址来并行地开放多个存储器组。The controller 9020 can be configured to open multiple memory banks in parallel by broadcasting addresses of different vectors to be accessed.

控制器可被配置为至少部分地基于语句中的字或词组的次序来控制从多个存储器组(或从储存不同向量的任何中间缓冲或储存电路，参见图89D的缓冲器9033)撷取不同向量的次序。The controller may be configured to control fetching of different values from the plurality of memory banks (or from any intermediate buffers or storage circuits that store different vectors, see buffer 9033 of FIG. 89D ) based at least in part on the order of the words or phrases in the statement. order of vectors.

控制器9020可被配置为基于与在存储器/处理单元9010外部输出向量相关的一个或多个参数来管理不同向量的撷取，例如，从存储器组撷取不同向量的速率可设定为实质上等于从存储器/处理单元9010输出不同向量的可允许速率。The controller 9020 can be configured to manage the fetching of the different vectors based on one or more parameters related to outputting the vectors external to the memory/processing unit 9010, for example, the rate at which the different vectors are fetched from the memory bank can be set to be substantially Equal to the allowable rate at which different vectors are output from the memory/processing unit 9010.

控制器可藉由应用任何业务塑形处理程序来在存储器/处理单元9010外部输出不同向量。举例而言，控制器9020可旨在以尽可能接近主计算机或将存储器/处理单元9010耦接至主计算机的链路可允许的最大速率的速率输出不同向量。又对于另一实例，控制器可输出不同向量，同时最少化或至少实质上减少业务速率随时间的波动。The controller can output different vectors outside the memory/processing unit 9010 by applying any traffic shaping handler. For example, the controller 9020 may aim to output the different vectors at a rate as close as possible to the maximum rate allowed by the host computer or the link coupling the memory/processing unit 9010 to the host computer. For yet another example, the controller may output different vectors while minimizing, or at least substantially reducing, fluctuations in traffic rates over time.

控制器9020属于与存储器组9040及逻辑9030相同的集成电路，且因此可易于自不同逻辑/存储器组接收关于不同向量的撷取状态(例如，向量是否准备好，向量是否准备好但自同一存储器组正撷取或将要撷取另一向量)的反馈，及其类似者。可用任何方式提供反馈：经由专用控制线，经由共享控制线。使用一个或多个状态位及其类似者(参见图89F的状态线9039)。Controller 9020 belongs to the same integrated circuit as memory bank 9040 and logic 9030, and thus can easily receive fetch status for different vectors from different logic/memory banks (eg, is the vector ready, is the vector ready but from the same memory group is capturing or about to capture feedback from another vector), and the like. Feedback can be provided in any way: via dedicated control lines, via shared control lines. One or more status bits and the like are used (see status line 9039 of Figure 89F).

控制器9020可独立地控制不同向量的撷取及输出，且因此可减少主计算机的参与。替代地，主计算机可能不知晓控制器的管理能力，且可能继续发送详细指令，且在此状况下，存储器/处理单元9010可忽略详细指令，可隐藏控制器的管理能力，及其类似者。可基于可由主计算机管理的协议使用所提及的以上解决方案。The controller 9020 can independently control the capture and output of different vectors, and thus can reduce the involvement of the host computer. Alternatively, the host computer may be unaware of the management capabilities of the controller and may continue to send detailed instructions, and in this case the memory/processing unit 9010 may ignore the detailed instructions, may hide the management capabilities of the controller, and the like. The above mentioned solutions can be used based on a protocol that can be managed by the host computer.

已发现，在存储器/处理单元中执行处理操作为极有益的(就能量而言)，即使当这些操作相比在主机中的处理操作消耗更多功率时且即使当这些操作相比在主机与存储器/处理单元之间的传送操作消耗更多功率时亦如此。举例而言，假定向量足够大，传送数据单元的能量消耗为4pJ，数据单元的处理操作(藉由主机)的能量消耗为0.1pJ，则当藉由存储器/处理单元处理数据单元的能量消耗低于5pJ时，藉由存储器/处理单元处理数据单元更有效。It has been found that it is extremely beneficial (in terms of energy) to perform processing operations in the memory/processing unit even when these operations consume more power than processing operations in the host and even when these operations are The same is true when transfer operations between memory/processing units consume more power. For example, assuming that the vector is large enough, the energy consumption of transmitting a data unit is 4pJ, and the energy consumption of the processing operation of the data unit (by the host) is 0.1pJ, then the energy consumption of processing the data unit by the memory/processing unit is low At 5pJ, it is more efficient to process data units by the memory/processing unit.

(表示语句的矩阵的)每一向量可由字(或其他多位区段)的序列表示。为解释简单起见，假定多个位区段为字。Each vector (of the matrix representing the statement) may be represented by a sequence of words (or other multi-bit segments). For simplicity of explanation, multiple bit segments are assumed to be words.

当向量包括零值字时，可获得额外功率节省。替代输出整个零值字，可输出短于字(例如，位)的零值旗标(甚至经由专用控制线输送)而非整个字。可将旗标分配给其他值(例如，值1的字)。Additional power savings may be obtained when the vector includes zero-valued words. Instead of outputting the entire zero-valued word, a zero-valued flag that is shorter than the word (eg, bits) (even delivered via a dedicated control line) may be output instead of the entire word. Flags can be assigned to other values (eg, words of value 1).

图88A说明用于嵌入的方法9400，或确切而言，可为用于撷取特征向量相关信息的方法。特征向量相关信息可包括特征向量和/或处理特征向量的结果。88A illustrates a method 9400 for embedding, or rather, a method for extracting feature vector related information. The feature vector related information may include the feature vector and/or the result of processing the feature vector.

方法9400可开始于藉由存储器处理集成电路接收撷取信息以用于撷取多个所请求特征向量的步骤9410，该些特征向量可映射至多个语句区段。The method 9400 may begin with a step 9410 of receiving, by a memory processing integrated circuit, fetch information for retrieving a plurality of requested feature vectors, which may be mapped to a plurality of sentence segments.

存储器处理单元可包括控制器、多个处理器子单元及多个存储器单元。存储器单元中的每一者可耦接至处理器子单元。A memory processing unit may include a controller, multiple processor sub-units, and multiple memory units. Each of the memory units may be coupled to a processor sub-unit.

步骤9410之后可接着为从多个存储器单元中的至少一些撷取多个所请求特征向量的步骤9420。Step 9410 may be followed by step 9420 of retrieving the plurality of requested feature vectors from at least some of the plurality of memory cells.

该撷取可包括向两个或多于两个存储器单元同时请求储存于该两个或多于两个存储器单元中的所请求特征向量。The fetching may include concurrently requesting two or more memory cells for the requested feature vector stored in the two or more memory cells.

该请求基于语句区段与映射至语句区段的特征向量的位置之间的已知映射而执行。The request is performed based on a known mapping between the statement segments and the locations of the feature vectors mapped to the statement segments.

该映射可在存储器处理集成电路的开机处理程序期间上传。This map may be uploaded during the power-on handler of the memory processing integrated circuit.

一次撷取尽可能多的所请求特征向量可为有益的，但此取决于所请求特征向量储存之处及不同存储器单元的数目。It may be beneficial to fetch as many requested feature vectors as possible at once, but this depends on where the requested feature vectors are stored and on the number of different memory cells.

若多于一个所请求特征向量储存于同一存储器组中，则可应用预测性撷取以用于减少与自存储器组撷取信息相关联的惩罚。在本申请案的各种章节中说明用于减少惩罚的各种方法。If more than one requested feature vector is stored in the same memory bank, predictive fetching may be applied for reducing penalties associated with fetching information from the memory bank. Various methods for reducing penalties are described in various sections of this application.

撷取可包括应用储存于单个存储器单元中的所请求特征向量的集合中的至少一些所请求特征向量的预测性撷取。Fetching may include applying predictive fetching of at least some of the set of requested feature vectors stored in a single memory unit.

所请求特征向量可用最佳方式分布于存储器单元之间。The requested feature vectors can be distributed among the memory cells in an optimal manner.

所请求特征向量可基于预期撷取图案而分布于存储器单元之间。The requested feature vectors may be distributed among the memory cells based on the expected retrieval pattern.

多个所请求特征向量的撷取可根据某一次序执行。举例而言，根据语句区段在一个或多个语句中的次序。The retrieval of the plurality of requested feature vectors may be performed according to a certain order. For example, according to the order of statement sections in one or more statements.

多个所请求特征向量的撷取可至少部分无序地执行；且其中撷取进一步可包括将多个所请求特征向量重新排序。The retrieval of the plurality of requested feature vectors may be performed at least partially out of order; and wherein the retrieval may further include reordering the plurality of requested feature vectors.

多个所请求特征的撷取可包括在多个所请求特征向量可由控制器读取之前缓冲该多个所请求特征向量。The retrieval of the plurality of requested features may include buffering the plurality of requested feature vectors before the plurality of requested feature vectors can be read by the controller.

多个所请求特征的撷取可包括产生缓冲状态指示符，该些缓冲状态指示符提示与多个存储器单元相关联的一个或多个缓冲器何时储存一个或多个所请求特征向量。The retrieval of the plurality of requested features may include generating buffer status indicators that indicate when one or more buffers associated with the plurality of memory cells are storing the one or more requested feature vectors.

该方法可包括经由专用控制线输送缓冲状态指示符。The method may include delivering the buffer status indicator via a dedicated control line.

可每存储器单元分配一个专用控制线。A dedicated control line can be assigned per memory cell.

缓冲状态指示符可为储存于一个或多个缓冲器中的状态位。The buffer status indicators may be status bits stored in one or more buffers.

该方法可包括经由一个或多个共享控制线输送缓冲状态指示符。The method may include conveying the buffer status indicator via one or more shared control lines.

步骤9420之后可接着为处理多个所请求特征向量以提供处理结果的步骤9430。Step 9420 may be followed by step 9430 of processing the plurality of requested feature vectors to provide processing results.

另外或替代地，步骤9420之后可接着为从存储器处理集成电路输出可包括以下各者中的至少一者的输出的步骤9440：(a)所请求特征向量；及(b)处理所请求特征向量的结果。(a)所请求特征向量及(b)处理所请求特征向量的结果中的至少一者亦被称作特征向量相关信息。Additionally or alternatively, step 9420 may be followed by a step 9440 of outputting from the memory the integrated circuit output, which may include at least one of: (a) the requested feature vector; and (b) processing the requested feature vector the result of. At least one of (a) the requested feature vector and (b) the result of processing the requested feature vector is also referred to as feature vector related information.

当执行步骤9430时，则步骤9440可包括输出(至少)处理所请求特征向量的结果。When step 9430 is performed, then step 9440 may include outputting (at least) the result of processing the requested feature vector.

当跳过步骤9430时，则步骤9440包括输出所请求特征向量且可能不包括输出处理所请求特征向量的结果。When step 9430 is skipped, then step 9440 includes outputting the requested feature vector and may not include outputting the result of processing the requested feature vector.

图88B说明用于嵌入的方法9401。Figure 88B illustrates a method 9401 for embedding.

假定输出包括所请求特征向量，但不包括处理所请求特征向量的结果。The output is assumed to include the requested feature vector, but not the result of processing the requested feature vector.

方法9401可开始于藉由存储器处理集成电路接收撷取信息以用于撷取多个所请求特征向量的步骤9410，该些特征向量可映射至多个语句区段。Method 9401 may begin at step 9410 of receiving, by a memory processing integrated circuit, fetch information for retrieving a plurality of requested feature vectors, which may be mapped to a plurality of sentence segments.

步骤9420之后可接着为从存储器处理集成电路输出包括所请求特征向量但不包括处理所请求特征向量的结果的输出的步骤9431。Step 9420 may be followed by step 9431 of outputting from the memory processing integrated circuit an output that includes the requested feature vector but does not include the result of processing the requested feature vector.

图88C说明用于嵌入的方法9402。Figure 88C illustrates a method 9402 for embedding.

假定输出包括处理所请求特征向量的结果。The output is assumed to include the result of processing the requested feature vector.

方法9402可开始于藉由存储器处理集成电路接收撷取信息以用于撷取多个所请求特征向量的步骤9410，该些特征向量可映射至多个语句区段。The method 9402 may begin with a step 9410 of receiving, by the memory processing integrated circuit, fetch information for retrieving a plurality of requested feature vectors, which may be mapped to a plurality of sentence segments.

步骤9430之后可接着为从存储器处理集成电路输出可包括处理所请求特征向量的结果的输出的步骤9442。Step 9430 may be followed by a step 9442 of processing the output of the integrated circuit output from the memory, which may include processing the results of the requested feature vector.

该输出的输出可包括对输出应用业务塑形。The output of the output may include applying business shaping to the output.

该输出的输出可包括尝试匹配在输出期间使用的带宽与链路上的最大可允许带宽，该链路将存储器处理集成电路耦接至请求者单元。The output of the output may include an attempt to match the bandwidth used during the output with the maximum allowable bandwidth on the link coupling the memory processing integrated circuit to the requestor unit.

该输出的输出可包括尝试维持输出业务速率的波动低于阈值。The output of the output may include an attempt to maintain fluctuations in the output traffic rate below a threshold.

撷取及输出中的任何步骤可在主机的控制下和/或独立地或部分地由控制器执行。Any of the steps of capturing and exporting may be performed under the control of the host and/or independently or in part by the controller.

主机可发送具有不同粒度的撷取命令，从一般发送撷取信息而无关于所请求特征向量在多个存储器单元内的位置，直至基于所请求特征向量在多个存储器单元内的位置而发送详细撷取信息。The host can send fetch commands with different granularities, from generally sending fetch information regardless of the location of the requested feature vector within multiple memory cells, to sending detailed information based on the location of the requested feature vector within multiple memory cells. Fetch information.

主机可控制(或尝试控制)存储器处理集成电路内的不同撷取操作的时序，但可能与时序无关。The host may control (or attempt to control) the timing of the different fetch operations within the memory processing integrated circuit, but may be independent of timing.

控制器可藉由主机在各种层级中控制，且可甚至忽略主机的详细命令，且独立地至少控制撷取和/或输出。The controller can be controlled at various levels by the host, and can even ignore detailed commands from the host, and independently control at least capture and/or output.

所请求特征向量的处理可由以下各者中的至少一者(以下各者中的一个或多个的组合)执行：一个或多个存储器/处理单元，及位于一个或多个存储器/处理单元外部的一个或多个处理器，及其类似者。The processing of the requested feature vector may be performed by at least one of (a combination of one or more of): one or more memories/processing units, and external to the one or more memories/processing units one or more processors, and the like.

应注意，所请求特征向量的处理可由以下各者中的至少一者(以下各者中的一个或多个的组合)执行：一个或多个处理器子单元、控制器、一个或多个向量处理器，及位于一个或多个存储器/处理单元外部的一个或多个存储器/处理单元。It should be noted that the processing of the requested feature vector may be performed by at least one of the following (a combination of one or more of the following): one or more processor sub-units, a controller, one or more vectors A processor, and one or more memory/processing units external to the one or more memory/processing units.

所请求特征向量的处理可由以下各者中的任一者或以下各者的组合执行，可由以下各者中的任一者或以下各者的组合产生：The processing of the requested feature vector may be performed by any one or a combination of the following, and may result from any one or a combination of the following:

a.存储器/处理单元的处理器子单元(或逻辑9030)。a. The processor sub-unit (or logic 9030) of the memory/processing unit.

b.多个存储器/处理单元的处理器子单元(或逻辑9030)。b. Processor sub-unit (or logic 9030) of multiple memory/processing units.

c.存储器/处理单元的控制器。c. The controller of the memory/processing unit.

d.多个存储器/处理单元的控制器。d. A controller for multiple memory/processing units.

e.存储器/处理单元的一个或多个向量处理器。e. One or more vector processors of the memory/processing unit.

f.一个或多个向量处理器、多个存储器/处理单元。f. One or more vector processors, multiple memory/processing units.

因此，所请求特征向量的处理可由以下各者的任何组合或子组合执行：(a)一个或多个存储器/处理单元的一个或多个控制器；(b)一个或多个存储器/处理单元的一个或多个处理器子单元；(c)一个或多个存储器/处理单元的一个或多个向量处理器；及(d)位于一个或多个存储器/处理单元外部的一个或多个其他处理器。Accordingly, processing of the requested feature vector may be performed by any combination or subcombination of: (a) one or more controllers of one or more memory/processing units; (b) one or more memory/processing units (c) one or more vector processors of one or more memory/processing units; and (d) one or more other processors external to one or more memory/processing units processor.

可用平面方式执行处理，其中所有处理器子单元执行相同操作(且在其之间可能输出或可能不输出处理结果)。Processing may be performed in a planar fashion, where all processor sub-units perform the same operations (and may or may not output processing results between them).

可用阶层式方式执行处理，其中处理涉及不同层级的处理操作序列，而某一层的处理操作在又一层级的处理操作之后。处理器子单元可经分配(动态地或静态地)给不同层且参与阶层式处理。Processing may be performed in a hierarchical fashion, where processing involves a sequence of processing operations at different levels, with processing operations at one level following processing operations at another level. Processor subunits may be assigned (dynamically or statically) to different layers and participate in hierarchical processing.

所请求特征向量的任何处理可由多于一个处理实体(处理器子单元、控制器、向量处理器、其他处理器)执行，可用任何方式(用平面、阶层式或其他方式)进行分布式处理。举例而言，处理器子单元可将其处理结果输出至控制器，该控制器可进一步处理该些结果。位于一个或多个存储器/处理单元外部的一个或多个其他处理器可进一步处理存储器处理集成电路的输出。Any processing of the requested feature vectors may be performed by more than one processing entity (processor subunits, controllers, vector processors, other processors), distributed processing may be performed in any manner (either flat, hierarchical or otherwise). For example, the processor sub-unit can output its processing results to the controller, which can further process the results. One or more other processors external to the one or more memory/processing units may further process the output of the memory processing integrated circuit.

应注意，撷取信息还可包括用于撷取不映射至语句区段的所请求特征向量的信息。这些特征向量可映射至一个或多个人员、装置或可与语句区段相关的任何其他实体。举例而言，感测语句区段的装置的用户、感测区段的装置、识别为语句区段的来源的使用者、在产生语句时存取的网站、俘获语句的位置，及其类似者。It should be noted that the retrieval information may also include information for retrieving requested feature vectors that do not map to sentence segments. These feature vectors can be mapped to one or more persons, devices, or any other entities that can be associated with sentence segments. For example, the user of the device that sensed the sentence segment, the device that sensed the segment, the user identified as the source of the sentence segment, the website accessed when the sentence was generated, the location where the sentence was captured, and the like .

在细节上作必要修改后，方法9400、9401及9402可适用于不映射至语句区段的处理和/或所请求撷取向量。Methods 9400, 9401, and 9402, mutatis mutandis, may be applicable to processing and/or requested retrieving vectors that do not map to sentence segments.

特征向量的处理的非限制性实例可包括加总、加权和、平均、减法或应用任何其他数学函数。Non-limiting examples of processing of feature vectors may include summing, weighted summing, averaging, subtracting, or applying any other mathematical function.

混合装置mixing device

随着处理器速度及存储器大小两者均继续增加，对有效处理速度的显著限制系冯诺依曼(von Neumann)瓶颈。冯诺依曼瓶颈由传统计算机架构所导致的吞吐量限制造成。特定而言，相较于由处理器进行的实际运算，从存储器至处理器的数据传送(在诸如外部DRAM存储器的逻辑晶粒外部)常常遇到瓶颈。因此，用以对存储器进行读取及写入的时钟循环的数目随着存储器密集型处理程序而显著增加。这些时钟循环导致较低的有效处理速度，这是因为对存储器进行读取及写入会消耗时钟循环，该些时钟循环无法用于对数据执行操作。此外，处理器的运算带宽通常大于处理器用以存取存储器的总线的带宽。As both processor speed and memory size continue to increase, a significant limit to effective processing speed is a von Neumann bottleneck. Von Neumann bottlenecks are caused by throughput limitations imposed by traditional computer architectures. In particular, the transfer of data from memory to the processor (outside the logic die such as external DRAM memory) often encounters a bottleneck compared to the actual operations performed by the processor. Consequently, the number of clock cycles used to read and write memory increases significantly with memory-intensive processes. These clock cycles result in lower effective processing speeds because reading and writing the memory consumes clock cycles that cannot be used to perform operations on the data. In addition, the operational bandwidth of the processor is typically greater than the bandwidth of the bus used by the processor to access the memory.

这些瓶颈对于以下各者特别明显：存储器密集型处理程序，诸如神经网络及其他机器学习算法；数据库建构、索引搜寻及查询；以及包括比数据处理操作多的读取及写入操作的其他任务。These bottlenecks are particularly evident for: memory-intensive processing programs, such as neural networks and other machine learning algorithms; database construction, index searches, and queries; and other tasks that include more read and write operations than data processing operations.

本发明描述用于减轻或克服上文所阐述的问题中的一个或多个以及现有技术中的其他问题的解决方案。The present disclosure describes solutions for alleviating or overcoming one or more of the problems set forth above, as well as other problems in the prior art.

可提供一种用于存储器密集型处理的混合装置，该混合装置可包括基础晶粒、多个处理器、至少另一晶粒的第一存储器资源，及至少一个其他晶粒的第二存储器资源。A hybrid device for memory intensive processing can be provided that can include a base die, a plurality of processors, at least a first memory resource of another die, and a second memory resource of at least one other die .

该基础晶粒及该至少另一晶粒藉由晶圆上晶圆接合彼此连接。The base die and the at least one other die are connected to each other by wafer-on-wafer bonding.

多个处理器被配置为执行处理操作，且撷取储存于第一存储器资源中的所撷取信息。A plurality of processors are configured to perform processing operations and retrieve the retrieved information stored in the first memory resource.

第二存储器资源被配置为将来自第二存储器资源的额外信息发送至第一存储器资源。The second memory resource is configured to send additional information from the second memory resource to the first memory resource.

基础晶粒与至少另一晶粒之间的第一路径的总带宽超过至少另一晶粒与至少一个其他晶粒之间的第二路径的总带宽，且第一存储器资源的储存容量为第二存储器资源的储存容量的一部分。The total bandwidth of the first path between the base die and the at least one other die exceeds the total bandwidth of the second path between the at least one other die and the at least one other die, and the storage capacity of the first memory resource is the th Part of the storage capacity of two memory resources.

第二存储器资源为高带宽存储器(HBM)资源。The second memory resource is a high bandwidth memory (HBM) resource.

至少一个其他晶粒为高带宽存储器(HBM)芯片的堆叠。At least one other die is a stack of high bandwidth memory (HBM) chips.

第二存储器资源中的至少一些可属于另一晶粒，该另一晶粒藉由不同于晶圆间接合的连接性而连接至基础晶粒。At least some of the second memory resources may belong to another die that is connected to the base die by connectivity other than inter-wafer bonding.

第二存储器资源中的至少一些属于另一晶粒，该另一晶粒藉由不同于晶圆间接合的连接性而连接至另一晶粒。At least some of the second memory resources belong to another die that is connected to the other die by a different connectivity than the inter-wafer bonding.

第一存储器资源及第二存储器资源为不同层级的高速缓存。The first memory resource and the second memory resource are caches of different levels.

第一存储器资源定位于基础晶粒与第二存储器资源之间。The first memory resource is located between the base die and the second memory resource.

第一存储器资源定位于第二存储器资源的一侧。The first memory resource is located to one side of the second memory resource.

另一晶粒被配置为执行额外处理，其中另一晶粒包含多个处理器子单元及第一存储器资源。Another die is configured to perform additional processing, wherein the other die includes a plurality of processor subunits and a first memory resource.

每一处理器子单元耦接至分配给处理器子单元的第一存储器资源的唯一部分。Each processor subunit is coupled to a unique portion of the first memory resource allocated to the processor subunit.

第一存储器资源的唯一部分为至少一个存储器组。The only part of the first memory resource is at least one memory bank.

多个处理器为包括于为存储器处理芯片第一存储器资源中的多个处理器子单元。The plurality of processors are the plurality of processor sub-units included in the first memory resource of the memory processing chip.

基础晶粒包含多个处理器，其中多个处理器为经由使用晶圆间接合形成的导体耦接至第一存储器资源的多个处理器子单元。The base die includes a plurality of processors, where the plurality of processors are a plurality of processor subunits coupled to a first memory resource via conductors formed using wafer-to-wafer bonding.

可提供混合集成电路，其可利用晶圆上晶圆(WOW)连接性以将基础晶粒的至少一部分耦接至第二存储器资源，该些第二存储器资源包括于一个或多个其他晶粒中且使用不同于WOW连接性的连接性而连接。第二存储器资源的实例可为高带宽存储器(HBM)存储器资源。在各种附图中，第二存储器资源包括于HBM存储器单元的堆叠中，可使用硅穿孔(TSV)连接性而耦接至控制器。控制器可包括于基础晶粒中或耦接(例如，经由微凸块)至基础晶粒的至少部分。Hybrid integrated circuits can be provided that can utilize wafer-on-wafer (WOW) connectivity to couple at least a portion of the base die to second memory resources included in one or more other dies and connected using a different connectivity than WOW connectivity. An example of the second memory resource may be a high bandwidth memory (HBM) memory resource. In the various figures, a second memory resource is included in a stack of HBM memory cells, which may be coupled to a controller using through-silicon (TSV) connectivity. The controller may be included in the base die or coupled (eg, via microbumps) to at least a portion of the base die.

基础晶粒可为逻辑晶粒，但可为存储器/处理单元。The base die can be a logic die, but can be a memory/processing unit.

WOW连接性用以将基础晶粒的一个或多个部分耦接至另一晶粒(WOW连接的晶粒)的一个或多个部分，该另一晶粒可为存储器晶粒或存储器/处理单元。WOW连接性为极高吞吐量连接性。WOW connectivity is used to couple one or more portions of a base die to one or more portions of another die (WOW connected die), which may be a memory die or memory/processing unit. WOW connectivity is very high throughput connectivity.

高带宽存储器(HBM)芯片的堆叠可耦接至基础晶粒(直接或经由WOW连接的晶粒)，且可提供高吞吐量连接及极广泛存储器资源。Stacks of high bandwidth memory (HBM) chips can be coupled to the base die (direct or via WOW connected die) and can provide high throughput connections and very broad memory resources.

WOW连接的晶粒可耦接于HBM芯片的堆叠与基础晶粒之间以形成HBM存储器芯片堆叠，该堆叠具有TSV连接性且在其底部具有WOW连接的晶粒。The WOW connected die can be coupled between the stack of HBM chips and the base die to form a HBM memory chip stack with TSV connectivity and the WOW connected die at its bottom.

具有TSV连接性且在底部具有WOW连接的晶粒的HBM芯片堆叠可提供多层存储器阶层，其中WOW连接的晶粒可用作基础晶粒可存取的较低层级存储器(例如，3阶高速缓存)，其中自较高层级HBM存储器堆叠的提取和/或预提取操作填充WOW连接的晶粒。A stack of HBM chips with TSV connectivity and WOW-connected dies at the bottom can provide a multi-layer memory hierarchy, where the WOW-connected dies can be used as lower-level memory accessible by the base die (eg, 3-level high-speed cache), where fetches and/or prefetch operations from higher-level HBM memory stacks populate WOW-connected dies.

HBM存储器芯片可为HBM DRAM芯片，但可使用任何其他存储器技术。The HBM memory chips can be HBM DRAM chips, but any other memory technology can be used.

使用WOW连接性与HMB芯片的组合使得能够提供多层存储器结构，该多层存储器结构可包括可提供带宽与存储器密度之间的不同取舍的多个存储器层。Using WOW connectivity in combination with HMB chips enables the provision of multi-layer memory structures that can include multiple memory layers that can provide different trade-offs between bandwidth and memory density.

所建议解决方案可充当传统的DRAM存储器/HBM至逻辑晶粒的内部高速缓存之间的额外的全新存储器阶层，从而在DRAM侧实现更多带宽以及更佳管理及重复使用。The proposed solution can act as an additional new memory tier between traditional DRAM memory/HBM to the logic die's internal cache, enabling more bandwidth and better management and reuse on the DRAM side.

此可在DRAM侧提供以快速方式较佳地管理存储器读取的新的存储器阶层。This can provide a new memory hierarchy on the DRAM side that better manages memory reads in a fast manner.

图93A至图93I分别说明混合集成电路11011'至11019'。93A to 93I illustrate hybrid integrated circuits 11011' to 11019', respectively.

图93A说明具有TSV连接性且在最低层级具有微凸块的HBM DRAM堆叠(共同表示为11030)，该堆叠包括彼此耦接且使用TSV(11039)耦接至基础晶粒的第一存储器控制器11031的HDM DRAM存储器芯片11032的堆叠。Figure 93A illustrates a HBM DRAM stack with TSV connectivity and microbumps at the lowest level (collectively denoted 11030) that includes a first memory controller coupled to each other and to the base die using TSVs (11039) Stack of 11031 HDM DRAM memory chips 11032.

图93A亦说明至少具有存储器资源且使用WOW技术耦接的晶圆(共同表示为11040)，该晶圆包括经由一个或多个WOW中间层(11023)耦接至DRAM晶圆(11021)的基础晶粒11019的第二存储器控制器11022。一个或多个WOW中间层可由不同材料制成，但可不同于垫连接性和/或可不同于TSV连接性。Figure 93A also illustrates a wafer (collectively denoted 11040) having at least memory resources and coupled using WOW technology, the wafer including a base coupled to a DRAM wafer (11021) via one or more WOW interlayers (11023) Second memory controller 11022 of die 11019. The one or more WOW interlayers may be made of different materials, but may be different from the pad connectivity and/or may be different from the TSV connectivity.

穿过一个或多个WOW中间层的导体11022'将DRAM晶粒电耦接至基础晶粒的组件。Conductors 11022' through one or more WOW interlayers electrically couple the DRAM die to components of the base die.

基础晶粒11019耦接至中介层11018，该中介层又使用微凸块耦接至封装基板11017。封装基板在其下表面处具有微凸块的阵列。Base die 11019 is coupled to interposer 11018, which in turn is coupled to package substrate 11017 using microbumps. The package substrate has an array of micro bumps at its lower surface.

微凸块可由其他连接性替换。中介层11018及封装基板11017可由其他层替换。Microbumps can be replaced by other connectivity. The interposer 11018 and the package substrate 11017 may be replaced by other layers.

第一和/或第二存储器控制器(分别为11031及11032)可定位于(至少部分)基础晶粒11019外部，例如定位于DRAM晶圆中，DRAM晶圆与基础晶粒之间，HBM存储器单元的堆叠与基础晶粒之间，及其类似者。The first and/or second memory controllers (11031 and 11032, respectively) may be located (at least in part) outside the base die 11019, such as in the DRAM wafer, between the DRAM wafer and the base die, HBM memory Between the stack of cells and the base die, and the like.

第一和/或第二存储器控制器(分别为11031及11032)可属于同一控制器或可属于不同控制器。The first and/or second memory controllers (11031 and 11032, respectively) may belong to the same controller or may belong to different controllers.

HBM存储器单元中的一个或多个可包括逻辑以及存储器，且可为或可包括存储器/处理单元。One or more of the HBM memory units may include logic as well as memory, and may or may include a memory/processing unit.

第一及第二存储器控制器藉由多个总线11016彼此耦接，以用于在第一存储器资源与第二存储器资源之间输送信息。图93A亦说明自第二存储器控制器至基础晶粒的组件(例如，多个处理器)的总线11014。图93A进一步说明自第一存储器控制器至基础晶粒的组件(例如，多个处理器，如图93C中所展示)的总线11015。The first and second memory controllers are coupled to each other by a plurality of buses 11016 for transferring information between the first memory resource and the second memory resource. Figure 93A also illustrates a bus 11014 from the second memory controller to components of the base die (eg, multiple processors). Figure 93A further illustrates the bus 11015 from the first memory controller to the components of the base die (eg, multiple processors, as shown in Figure 93C).

图93B说明混合集成电路11012，其与图93A的混合集成电路11011的不同之处在于具有存储器/处理单元11021'而非DRAM晶粒11021。93B illustrates a hybrid integrated circuit 11012 that differs from the hybrid integrated circuit 11011 of FIG. 93A by having a memory/processing unit 11021' instead of a DRAM die 11021.

图93C说明混合集成电路11013，其与图93A的混合集成电路11011的不同之处在于具有HBM存储器芯片堆叠，该堆叠具有TSV连接性且在其底部具有WOW连接的晶粒(共同表示为11040)，该晶粒包括HBM存储器单元的堆叠与基础晶粒11018之间的DRAM晶粒11021。93C illustrates a hybrid integrated circuit 11013 that differs from the hybrid integrated circuit 11011 of FIG. 93A by having a stack of HBM memory chips with TSV connectivity and WOW connected dies (collectively denoted 11040 ) at their bottom , the die includes the DRAM die 11021 between the stack of HBM memory cells and the base die 11018 .

DRAM晶粒11021使用WOW技术(参见WOW中间层11023)耦接至基础晶粒11019的第一存储器控制器11031。HBM存储器晶粒11032中的一个或多个可包括逻辑以及存储器，且可为或可包括存储器/处理单元。The DRAM die 11021 is coupled to the first memory controller 11031 of the base die 11019 using WOW technology (see WOW interlayer 11023). One or more of the HBM memory dies 11032 may include logic as well as memory, and may or may include a memory/processing unit.

最下部DRAM晶粒(在图93C中表示为DEAM晶粒11021)可为HBM存储器晶粒或可不同于HBM晶粒。最下部DRAM晶粒(DRAM晶粒11021)可由存储器/处理单元11021'替换，如由图93D的混合集成电路11014所说明。The lowermost DRAM die (denoted as DEAM die 11021 in Figure 93C) may be an HBM memory die or may be different from the HBM die. The lowermost DRAM die (DRAM die 11021) may be replaced by memory/processing unit 11021', as illustrated by hybrid integrated circuit 11014 of Figure 93D.

图93E至图93G分别说明混合集成电路11015、11016及11016'，其中基础晶粒11019耦接至具有TSV连接性且在最低层级处具有微凸块的HBM DRAM堆叠(11020)及至少具有存储器资源且使用WOW技术耦接的晶圆(11030)的多个例项，和/或耦接至具有TSV连接性且在底部具有WOW连接的晶粒的HBM存储器芯片堆叠(11040)的多个例项。Figures 93E-93G illustrate hybrid integrated circuits 11015, 11016, and 11016', respectively, with base die 11019 coupled to a HBM DRAM stack (11020) with TSV connectivity and microbumps at the lowest level and with at least memory resources And multiple instances of wafers (11030) coupled using WOW technology, and/or multiple instances of HBM memory chip stacks (11040) coupled to die with TSV connectivity and WOW-connected dies on the bottom .

图93H说明混合集成电路11014'，该混合集成电路与图93D的混合集成电路11014的不同之处在于说明存储器单元53、二阶高速缓存(L2高速缓存52)、多个处理器11051。多个处理器11051耦接至L2高速缓存11052，且可馈入有储存于存储器单元11053及L2高速缓存11052中的系数和/或数据。93H illustrates a hybrid integrated circuit 11014' that differs from the hybrid integrated circuit 11014 of FIG. 93D in that it illustrates a memory cell 53, a L2 cache (L2 cache 52), a plurality of processors 11051. A plurality of processors 11051 are coupled to L2 cache 11052 and may be fed with coefficients and/or data stored in memory unit 11053 and L2 cache 11052.

上文所提及的混合集成电路中的任一者可用于人工智能(AI)处理，该处理为带宽密集的。Any of the hybrid integrated circuits mentioned above can be used for artificial intelligence (AI) processing, which is bandwidth intensive.

当使用WOW技术耦接至存储器控制器时，图93D及93H的存储器/处理单元11021'可执行AI计算，且可用极高速率自HBM DRAM堆叠和/或自WOW连接的晶粒接收数据及系数两者。When coupled to a memory controller using WOW technology, the memory/processing unit 11021' of Figures 93D and 93H can perform AI computations and can receive data and coefficients from the HBM DRAM stack and/or from WOW connected dies at very high rates both.

任何存储器/处理单元可包括分布式存储器阵列及处理器阵列。分布式存储器及处理器阵列可包括多个存储器组及多个处理器。多个处理器可形成处理阵列。Any memory/processing unit may include distributed memory arrays and processor arrays. A distributed memory and processor array may include multiple memory banks and multiple processors. Multiple processors may form a processing array.

参看图93C、图93D及图93H且假定需要混合集成电路(11013、11014或11014')来执行一般的矩阵向量乘法(GEMV)，该些乘法包括计算矩阵与向量的乘积。因为不存在对所撷取矩阵的重复使用，所以此类型的计算为带宽密集的。因此，仅需要撷取及使用整个矩阵一次。93C, 93D, and 93H and assume that a hybrid integrated circuit (11013, 11014, or 11014') is required to perform general matrix-vector multiplications (GEMV), which include computing matrix-vector products. This type of computation is bandwidth intensive because there is no re-use of the extracted matrices. Therefore, the entire matrix only needs to be retrieved and used once.

GEMV可为数学运算序列的一部分，其涉及(i)将第一矩阵(A)乘以第一向量(V1)以提供第一中间向量，对第一中间向量应用第一非线性运算(NLO1)以提供第一中间结果；(ii)将第二矩阵(B)乘以第一中间结果以第二中间向量，对第二中间向量应用第二非线性运算(NLO2)以提供第二中间结果，等等(直至接收第N中间结果，N可超过2)。GEMV may be part of a sequence of mathematical operations involving (i) multiplying a first matrix (A) by a first vector (V1) to provide a first intermediate vector, applying a first nonlinear operation (NLO1) to the first intermediate vector to provide a first intermediate result; (ii) multiplying the second matrix (B) by the first intermediate result to a second intermediate vector, applying a second nonlinear operation (NLO2) to the second intermediate vector to provide a second intermediate result, and so on (until the Nth intermediate result is received, N may exceed 2).

假定每一矩阵为大的(例如，1Gb)，计算将需要1Tbs运算功率及1Tbs的带宽/吞吐量。可并行地执行运算及计算。Assuming each matrix is large (eg, 1Gb), the computation will require 1Tbs of computing power and 1Tbs of bandwidth/throughput. Operations and calculations can be performed in parallel.

假定GEMV计算展现N＝4且具有以下形式：结果＝NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1)))))))。Assume that the GEMV calculation exhibits N=4 and has the form: result=NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1))))))).

亦假定DRAM晶粒11021(或存储器/处理单元11021')不具有足够的存储器资源以同时储存A、B、C及D，则这些矩阵中的至少一些将储存于HDM DRAM晶粒11032中。Also assuming that DRAM die 11021 (or memory/processing unit 11021') does not have sufficient memory resources to store A, B, C, and D simultaneously, at least some of these matrices will be stored in HDM DRAM die 11032.

假定基础晶粒为包括诸如但不限于处理器、算术逻辑单元及其类似者的计算单元的逻辑晶粒。A base die is assumed to be a logic die that includes computational units such as, but not limited to, processors, arithmetic logic units, and the like.

在第一晶粒计算A*V1时，第一存储器控制器11031自一个或多个HBM DRAM晶粒11032撷取其他矩阵的缺失部分以用于接下来的计算。When the first die calculates A*V1, the first memory controller 11031 retrieves missing portions of other matrices from one or more HBM DRAM dies 11032 for subsequent calculations.

参看图93H且假定(a)DRAM晶粒11021具有2TBs带宽及512Mb容量，(b)HBM DRAM晶粒11032具有0.2TBs带宽及8Gb容量，且(c)L2高速缓存11052为具有6Ts带宽及10Mb容量的SRAM。93H and assume (a) DRAM die 11021 has 2TBs bandwidth and 512Mb capacity, (b) HBM DRAM die 11032 has 0.2TBs bandwidth and 8Gb capacity, and (c) L2 cache 11052 has 6Ts bandwidth and 10Mb capacity SRAM.

矩阵乘法涉及重复使用数据，将大的矩阵分段成多个区段(例如，5Mb个区段以适合可在双缓冲器配置下使用的L2高速缓存)及将所提取的第一矩阵区段乘以第二矩阵的区段(一个第二矩阵区段接着另一第二矩阵区段)。Matrix multiplication involves reusing the data, segmenting a large matrix into multiple segments (eg, 5Mb segments to fit in an L2 cache that can be used in a double buffer configuration) and dividing the extracted first matrix segment Multiply by the segments of the second matrix (one second matrix segment followed by another second matrix segment).

在将第一矩阵区段乘以第二矩阵区段时，将另一第二矩阵区段自(存储器处理单元11021'的)DRAM晶粒11021提取至L2高速缓存。When multiplying the first matrix segment by the second matrix segment, another second matrix segment is fetched from the DRAM die 11021 (of the memory processing unit 11021') to the L2 cache.

假定矩阵各为1Gb，在执行取得及计算时，DRAM晶粒11021或存储器/处理单元11021'馈入有来自HBM DRAM晶粒11032的矩阵区段。Assuming that the matrices are each 1Gb, DRAM die 11021 or memory/processing unit 11021' is fed with matrix segments from HBM DRAM die 11032 when fetching and computing are performed.

DRAM晶粒11021或存储器/处理单元11021'聚集矩阵区段，且矩阵区段接着经由WOW中间层(11023)馈入至基础晶粒11019。The DRAM die 11021 or memory/processing unit 11021' gathers the matrix segments, and the matrix segments are then fed to the base die 11019 via the WOW interlayer (11023).

存储器/处理单元11021'可藉由执行计算及发送结果而非发送经计算以提供结果的中间值来减少经由WOW中间层(11023)发送至基础晶粒11019的信息的量。当处理多个(Q个)中间值以提供结果时，则压缩比可为Q比1。The memory/processing unit 11021' may reduce the amount of information sent to the base die 11019 via the WOW intermediate layer (11023) by performing calculations and sending the results rather than sending intermediate values calculated to provide the results. When multiple (Q) intermediate values are processed to provide a result, then the compression ratio may be Q to 1.

图93I说明使用WOW技术实施的存储器处理单元11019'的实例。逻辑单元9030(可为处理器子单元)、控制器9020及总线9021位于一个芯片111061中，分配给不同逻辑单元的存储器组9040位于第二芯片11062中，而第一及第二芯片使用穿过WOW接合部11061的导体11012'彼此连接，该WOW接合部可包括一个或多个WOW中间层。93I illustrates an example of a memory processing unit 11019' implemented using WOW technology. The logic unit 9030 (which may be a processor sub-unit), the controller 9020 and the bus 9021 are located in one chip 111061, the memory banks 9040 allocated to different logic units are located in the second chip 11062, and the first and second chips use the The conductors 11012' of the WOW junction 11061 are connected to each other, which WOW junction may include one or more WOW interlayers.

图93J为用于存储器密集型处理的方法11100的实例。存储器密集意谓处理需要高带宽存储器消耗或与高带宽存储器消耗相关联。93J is an example of a method 11100 for memory intensive processing. Memory intensive means that processing requires or is associated with high bandwidth memory consumption.

方法11100可开始于步骤11110、11120及11130。Method 11100 may begin at steps 11110, 11120, and 11130.

步骤11110包括藉由多个处理器混合装置执行处理操作，该混合装置包含基础晶粒、至少另一晶粒的第一存储器资源及至少一个其他晶粒的第二存储器资源；其中基础晶粒及至少另一晶粒藉由晶圆上晶圆接合彼此连接。Step 11110 includes performing processing operations by a plurality of processor hybrid devices including a base die, a first memory resource of at least one other die, and a second memory resource of at least one other die; wherein the base die and At least one other die is connected to each other by wafer-on-wafer bonding.

步骤11120包括藉由多个处理器撷取储存于第一存储器资源中的所撷取信息。Step 11120 includes retrieving, by the plurality of processors, the retrieved information stored in the first memory resource.

步骤11130可包括将来自第二存储器资源的额外信息发送至第一存储器资源，其中基础晶粒与至少另一晶粒之间的第一路径的总带宽超过至少另一晶粒与至少一个其他晶粒之间的第二路径的总带宽，且其中第一存储器资源的储存容量为第二存储器资源的储存容量的一部分。Step 11130 can include sending additional information from the second memory resource to the first memory resource, wherein the total bandwidth of the first path between the base die and the at least one other die exceeds the at least one other die and the at least one other die The total bandwidth of the second path between the particles, and wherein the storage capacity of the first memory resource is a part of the storage capacity of the second memory resource.

方法11100还可包括藉由包括多个处理器子单元及第一存储器资源的另一晶粒执行额外处理的步骤11140。The method 11100 may also include the step 11140 of performing additional processing by another die including the plurality of processor subunits and the first memory resource.

每一处理器子单元可耦接至分配给处理器子单元的第一存储器资源的唯一部分。Each processor subunit may be coupled to a unique portion of the first memory resource allocated to the processor subunit.

步骤11110、11120、11130及11140可同时、以部分重叠方式及其类似方式执行。Steps 11110, 11120, 11130, and 11140 may be performed simultaneously, in a partially overlapping manner, and the like.

第二存储器资源可为高带宽存储器(HBM)存储器资源或可不同于HBM存储器资源。The second memory resource may be a high bandwidth memory (HBM) memory resource or may be different from the HBM memory resource.

至少一个其他晶粒为高带宽存储器(HBM)存储器芯片的堆叠。At least one other die is a stack of high bandwidth memory (HBM) memory chips.

通信芯片Communication chip

数据库包括许多条目，该些条目包括多个字段。数据库处理通常包括执行一个或多个查询，该一个或多个查询包括一个或多个筛选参数(例如，识别一个或多个相关字段及一个或多个相关字段值)且亦包括一个或多个操作参数，该一个或多个操作参数可判定待执行的操作的类型、待在应用操作时使用的变量或常数，及其类似者。数据处理可包括数据库分析或其他数据库处理程序。The database includes many entries that include multiple fields. Database processing typically involves executing one or more queries including one or more filter parameters (eg, identifying one or more related fields and one or more related field values) and also including one or more Operational parameters, the one or more operational parameters may determine the type of operation to be performed, variables or constants to be used in applying the operation, and the like. Data processing may include database analysis or other database processing procedures.

可提供一种可包括数据库加速集成电路的装置。An apparatus may be provided that may include a database acceleration integrated circuit.

可提供一种可包括数据库加速集成电路的一个或多个群组的装置，该些数据库加速集成电路可被配置为在数据库加速集成电路的一个或多个群组中的数据库加速集成电路之间交换信息和/或加速结果(藉由数据库加速集成电路进行的处理的最终结果)。An apparatus may be provided that may include one or more groups of database acceleration integrated circuits that may be configured between database acceleration integrated circuits in the one or more groups of database acceleration integrated circuits Exchange information and/or accelerate results (accelerating the final result of processing by the integrated circuit by means of a database).

群组的数据库加速度集成电路可连接至同一印刷电路板。A group of database acceleration ICs can be connected to the same printed circuit board.

群组的数据库加速度集成电路可属于计算机化系统的模块化单元。A group of database acceleration integrated circuits may belong to modular units of a computerized system.

不同群组的数据库加速集成电路可连接至不同印刷电路板。Different groups of database acceleration integrated circuits can be connected to different printed circuit boards.

不同群组的数据库加速集成电路可属于计算机化系统的不同模块化单元。Different groups of database acceleration integrated circuits may belong to different modular units of the computerized system.

该装置可被配置为藉由一个或多个群组的数据库加速集成电路执行分布式处理程序。The apparatus may be configured to accelerate the execution of distributed processing by integrated circuits with one or more groups of databases.

该装置可被配置为使用至少一个交换器以用于在一个或多个群组中的不同群组的数据库加速集成电路之间交换(a)信息及(b)数据库加速结果中的至少一者。The apparatus may be configured to use at least one switch for exchanging at least one of (a) information and (b) database acceleration results between database acceleration integrated circuits of different groups of one or more groups .

该装置可被配置为藉由一个或多个群组中的一些的数据库加速集成电路中的一些执行分布式处理程序。The apparatus may be configured to accelerate the execution of distributed processing by some of the integrated circuits by means of a database of some of the one or more groups.

该装置可被配置为执行第一及第二数据结构的分布式处理程序，其中第一及第二数据结构的总大小超过多个存储器处理集成电路的储存能力。The apparatus may be configured to execute distributed processing of first and second data structures, wherein the combined size of the first and second data structures exceeds the storage capacity of the plurality of memory processing integrated circuits.

该装置可被配置为藉由执行以下步骤的多个迭代来执行分布式处理程序：(a)执行将第一数据结构部分及第二数据结构部分的不同对新分配给不同数据库加速集成电路；及(b)处理不同对。The apparatus may be configured to execute a distributed processing program by performing a plurality of iterations of: (a) performing a new assignment of different pairs of the first data structure portion and the second data structure portion to different database acceleration integrated circuits; and (b) deal with different pairs.

图94A及图9B说明储存系统11560、计算机系统11150及用于数据库加速的一个或多个装置11520的实例。用于数据库加速的一个或多个装置11520可用各种方式(藉由监听或藉由定位于计算机系统11150与储存系统11560之间)监视储存系统11560与计算机系统11150之间的通信。94A and 9B illustrate examples of storage system 11560, computer system 11150, and one or more devices 11520 for database acceleration. One or more devices 11520 for database acceleration may monitor communications between storage system 11560 and computer system 11150 (either by listening or by being positioned between computer system 11150 and storage system 11560) in various ways.

储存系统11560可包括许多(例如，多于20个、50个、100个、100个及其类似者)储存单元(诸如，磁盘或磁盘的raid)，且可例如储存多于100万亿字节信息。计算系统11510可为大型计算机系统且可包括数十、数百及甚至数千个处理单元。Storage system 11560 may include many (eg, more than 20, 50, 100, 100, and the like) storage units (such as disks or raids of disks), and may, for example, store more than 100 terabytes information. Computing system 11510 can be a large computer system and can include tens, hundreds, and even thousands of processing units.

运算系统11510可包括由管理器11511控制的多个运算节点11512。The computing system 11510 may include a plurality of computing nodes 11512 controlled by the manager 11511 .

运算节点可控制或以其他方式与用于数据库加速的一个或多个装置11520交互。A computing node may control or otherwise interact with one or more devices 11520 for database acceleration.

用于数据库加速的一个或多个装置11520可包括一个或多个数据库加速集成电路(参见例如图94A及图94B的数据库加速集成电路11530)及存储器资源11550。存储器资源可属于专用于存储器但可属于存储器/处理单元的一个或多个芯片。One or more means 11520 for database acceleration may include one or more database acceleration integrated circuits (see, eg, database acceleration integrated circuits 11530 of FIGS. 94A and 94B ) and memory resources 11550. A memory resource may belong to one or more chips dedicated to memory but may belong to a memory/processing unit.

图94C及图94D说明计算机系统11150及用于数据库加速的一个或多个装置11520的实例。94C and 94D illustrate an example of a computer system 11150 and one or more devices 11520 for database acceleration.

用于数据库加速的一个或多个装置11520的一个或多个数据库加速集成电路可由管理单元11513控制，该管理单元可位于计算机系统内(参见图94C)或位于用于数据库加速的一个或多个装置11520内(图94D)。One or more database acceleration integrated circuits of one or more devices 11520 for database acceleration may be controlled by a management unit 11513, which may be located within the computer system (see Figure 94C) or at one or more of the means for database acceleration within device 11520 (FIG. 94D).

图94E说明用于数据库加速的装置11520，该装置包括数据库加速集成电路11530及多个存储器处理集成电路1151。每一存储器处理集成电路可包括控制器、多个处理器子单元及多个存储器单元。94E illustrates an apparatus 11520 for database acceleration that includes a database acceleration integrated circuit 11530 and a plurality of memory processing integrated circuits 1151. Each memory processing integrated circuit may include a controller, multiple processor sub-units, and multiple memory units.

数据库加速集成电路11530经说明为包括网络通信接口11531、第一处理单元11532、存储器控制器11533、数据库加速单元11535、互连件11536及管理单元11513。Database acceleration integrated circuit 11530 is illustrated as including network communication interface 11531 , first processing unit 11532 , memory controller 11533 , database acceleration unit 11535 , interconnect 11536 , and management unit 11513 .

网络通信接口(11531)可被配置为自大量储存单元接收(例如，经由网络通信接口的第一端口11531(1))大量信息。每一储存单元可用超过数十及甚至数百兆字节/秒的速率输出信息，而数据传输速度预期随时间增加(例如，每2至3年加倍)。储存数据单元的数目(大数目)可超过10个、50个、100个、200个及甚至更多个。大量信息可超过数十、数百千兆位组/秒，且甚至可在万亿字节/秒及千兆字节/秒的范围内。The network communication interface (11531) may be configured to receive (eg, via the first port 11531(1) of the network communication interface) bulk information from the bulk storage unit. Each storage unit can output information at rates in excess of tens and even hundreds of megabytes per second, and data transfer speeds are expected to increase over time (eg, doubling every 2-3 years). The number (large number) of stored data units may exceed 10, 50, 100, 200 and even more. Large amounts of information can exceed tens, hundreds of gigabytes per second, and can even be in the range of terabytes per second and gigabytes per second.

第一处理单元11532可被配置为对大量信息进行第一处理(预处理)以提供第一经处理信息。The first processing unit 11532 may be configured to perform first processing (preprocessing) on the bulk information to provide first processed information.

存储器控制器11533可被配置为经由大吞吐量接口11534将第一经处理信息发送至多个存储器处理集成电路。The memory controller 11533 may be configured to send the first processed information to the plurality of memory processing integrated circuits via the high throughput interface 11534.

多个存储器处理集成电路11551可被配置为藉由多个存储器处理集成电路对第一经处理信息的至少部分进行第二处理(处理)以提供第二经处理信息。The plurality of memory processing integrated circuits 11551 may be configured to perform second processing (processing) on at least a portion of the first processed information by the plurality of memory processing integrated circuits to provide second processed information.

存储器控制器11533可被配置为自多个存储器处理集成电路撷取所撷取信息。所撷取信息可包括以下各者中的至少一者：(a)第一经处理信息的至少一部分；及(b)第二经处理信息的至少一部分。The memory controller 11533 may be configured to retrieve the retrieved information from the plurality of memory processing integrated circuits. The retrieved information may include at least one of: (a) at least a portion of the first processed information; and (b) at least a portion of the second processed information.

数据库加速单元11535可被配置为对所撷取信息执行数据库处理操作，以提供数据库加速结果。The database acceleration unit 11535 may be configured to perform database processing operations on the retrieved information to provide database accelerated results.

数据库加速集成电路可被配置为输出数据库加速结果，例如经由网络通信接口的一个或多个第二端口11531(2)。The database acceleration integrated circuit may be configured to output database acceleration results, eg, via one or more second ports 11531(2) of the network communication interface.

图94E亦说明管理单元11513，该管理单元被配置为管理以下各者中的至少一者：所撷取信息的撷取、第一处理(预处理)、第二处理(处理)及第三处理(数据库处理)。管理单元11513可位于数据库加速集成电路外部。94E also illustrates a management unit 11513 that is configured to manage at least one of: retrieval of the retrieved information, first processing (preprocessing), second processing (processing), and third processing (database processing). The management unit 11513 may be located outside the database acceleration integrated circuit.

管理单元可被配置为基于执行计划而执行该管理。执行计划可由管理单元产生，或可由位于数据库加速集成电路外部的实体产生。执行计划可包括以下各者中的至少一者：(a)待由数据库加速集成电路的各种组件执行的指令、(b)实施执行计划所需的数据和/或系数、(c)指令和/或数据的存储器分配。The management unit may be configured to perform the management based on the execution plan. The execution plan may be generated by the management unit, or may be generated by an entity external to the database acceleration integrated circuit. The execution plan may include at least one of: (a) instructions to be executed by the various components of the database acceleration integrated circuit, (b) data and/or coefficients required to implement the execution plan, (c) instructions and / or memory allocation for data.

管理单元可被配置为藉由分配以下各者中的至少一些来执行管理：(a)网络通信网路接口资源、(b)解压缩单元资源、(c)存储器控制器资源、(d)多个存储器处理集成电路资源，及(e)数据库加速单元资源。The management unit may be configured to perform management by allocating at least some of: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, (d) multiple memory processing integrated circuit resources, and (e) database acceleration unit resources.

如图94E及图94G中所说明，网络通信网路接口可包括不同类型的网络通信端口。As illustrated in Figures 94E and 94G, the network communication network interface may include different types of network communication ports.

不同类型的网络通信端口可包括储存接口协议端口(例如，SATA端口、ATA端口、ISCSI端口、网络文件系统、光纤通道端口)及通用网络储存接口协议端口(例如，以太网络ATA、以太网络光纤通道、NVME、Roce及其他)。Different types of network communication ports may include storage interface protocol ports (eg, SATA ports, ATA ports, ISCSI ports, network file systems, Fibre Channel ports) and common network storage interface protocol ports (eg, Ethernet ATA, Ethernet Fibre Channel , NVME, Roce, and others).

不同类型的网络通信端口可包括储存接口协议端口及PCIe端口。Different types of network communication ports may include storage interface protocol ports and PCIe ports.

图94F包括虚线，该些虚线说明大量信息、第一经处理信息、所撷取信息及数据库加速结果的流。图94F将数据库加速集成电路11530说明为耦接至多个存储器资源11550。多个存储器资源11550可能不属于存储器处理集成电路。Figure 94F includes dashed lines that illustrate the flow of bulk information, first processed information, retrieved information, and database acceleration results. 94F illustrates database acceleration integrated circuit 11530 as coupled to a plurality of memory resources 11550. A number of memory resources 11550 may not belong to a memory processing integrated circuit.

用于数据库加速的装置11520可被配置为藉由数据库加速集成电路11530同时执行多个任务，这是因为网络通信接口11531可接收多个信息串流(同时)，第一处理单元11532可同时对多个信息单元执行第一处理，存储器控制器11533可同时将多个第一经处理信息单元发送至多个存储器处理集成电路11551，数据库加速单元11535可同时处理多个所撷取信息单元。The apparatus 11520 for database acceleration can be configured to perform multiple tasks simultaneously by the database acceleration integrated circuit 11530, because the network communication interface 11531 can receive multiple information streams (simultaneously), and the first processing unit 11532 can simultaneously The plurality of information units perform the first processing, the memory controller 11533 can send the plurality of first processed information units to the plurality of memory processing integrated circuits 11551 at the same time, and the database acceleration unit 11535 can simultaneously process the plurality of retrieved information units.

用于数据库加速的装置11520可被配置为藉由大型运算系统的运算节点基于发送至数据库加速集成电路的执行计划而执行撷取、第一处理、发送及第三处理中的至少一者。The means 11520 for database acceleration may be configured to perform at least one of fetching, first processing, sending, and third processing by a computing node of a large computing system based on an execution plan sent to the database acceleration integrated circuit.

用于数据库加速的装置11520可被配置为用实质上优化数据库加速集成电路的利用的方式管理撷取、第一处理、发送及第三处理中的至少一者。该优化考虑潜时、吞吐量及任何其他时序或储存或处理考虑因素，且尝试使沿着流径的所有组件保持忙碌且无瓶颈。The means 11520 for database acceleration can be configured to manage at least one of fetching, first processing, sending, and third processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit. This optimization takes into account latency, throughput, and any other timing or storage or processing considerations, and attempts to keep all components along the flow path busy and bottleneck-free.

用于数据库加速的装置11520可被配置为实质上优化藉由网络通信网路接口交换的业务的带宽。The means 11520 for database acceleration can be configured to substantially optimize the bandwidth of traffic exchanged over the network communication network interface.

用于数据库加速的装置11520可被配置为用实质上优化数据库加速集成电路的利用的方式实质上防止在撷取、第一处理、发送及第三处理中的至少一者中形成瓶颈。The means 11520 for database acceleration can be configured to substantially prevent bottlenecks from forming in at least one of fetching, first processing, sending, and third processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit.

用于数据库加速的装置11520可被配置为根据时间I/O带宽来分配数据库加速集成电路的资源。The means 11520 for database acceleration may be configured to allocate resources of the database acceleration integrated circuit according to temporal I/O bandwidth.

图94G说明用于数据库加速的装置11520，该装置包括数据库加速集成电路11530及多个存储器处理集成电路1151。图94G亦说明耦接至数据库加速集成电路11530的各种单元：远程RAM 11546、以太网络存储器DIMM11547、储存系统11560、本地储存单元11561及非易失性存储器(NVM)11563(该非易失性存储器可为快速NVM单元(NVME))。94G illustrates an apparatus 11520 for database acceleration that includes a database acceleration integrated circuit 11530 and a plurality of memory processing integrated circuits 1151. 94G also illustrates the various units coupled to database acceleration integrated circuit 11530: remote RAM 11546, Ethernet memory DIMM 11547, storage system 11560, local storage unit 11561, and non-volatile memory (NVM) 11563 (the non-volatile memory 11563). The memory may be an express NVM unit (NVME).

数据库加速集成电路11530经说明为包括以太网络端口11531(1)、RDMA单元11545、串行扩展端口11531(15)、SATA控制器11540、PCIe端口11531(9)、第一处理单元11532、存储器控制器11533、数据库加速单元11535、互连件11536、管理单元11513、用于执行密码操作的密码编译引擎11537，及二阶静态随机存取存储器(L2 SRAM)11538。Database acceleration integrated circuit 11530 is illustrated as including Ethernet port 11531(1), RDMA unit 11545, serial expansion port 11531(15), SATA controller 11540, PCIe port 11531(9), first processing unit 11532, memory control 11533, database acceleration unit 11535, interconnect 11536, management unit 11513, cryptographic engine 11537 for performing cryptographic operations, and second order static random access memory (L2 SRAM) 11538.

数据库加速单元经说明为包括DMA引擎11549、三阶(L3)存储器11548及数据库加速子单元11547。数据库加速子单元11547可为可配置单元。The database acceleration unit is illustrated as including a DMA engine 11549, a third-level (L3) memory 11548, and a database acceleration sub-unit 11547. The database acceleration subunit 11547 may be a configurable unit.

以太网络端口11531(1)、RDMA单元11545、串行扩展端口11531(15)、SATA控制器11540、PCIe端口11531(9)可被视为网络通信接口11531的部分。Ethernet port 11531(1), RDMA unit 11545, serial expansion port 11531(15), SATA controller 11540, PCIe port 11531(9) may be considered part of network communication interface 11531.

远程RAM 11546、以太网络存储器DIMM 11547、储存系统11560耦接至以太网络端口11531(1)，该以太网络端口又耦接至RDMA单元11545。Remote RAM 11546, Ethernet DIMM 11547, storage system 11560 are coupled to Ethernet port 11531(1), which in turn is coupled to RDMA unit 11545.

本地储存单元11561耦接至SATA控制器11540。The local storage unit 11561 is coupled to the SATA controller 11540 .

PCIe端口11531(9)耦接至NVM 11563。PCIe端口亦可用于交换命令，例如用于管理目的。PCIe port 11531(9) is coupled to NVM 11563. PCIe ports can also be used to exchange commands, eg for management purposes.

图94H为数据库加速单元11535的实例。94H is an example of the database acceleration unit 11535.

数据库加速单元11535可被配置为藉由数据库处理子单元11573同时执行数据库处理指令，其中数据库加速单元可包括共享一共享存储器单元11575的数据库加速器子单元的群组。Database acceleration unit 11535 can be configured to concurrently execute database processing instructions by database processing sub-unit 11573 , which can include a group of database accelerator sub-units that share a shared memory unit 11575 .

数据库加速子单元11535的不同组合可动态地彼此链接(经由可配置链路或互连件11576)以提供执行可包括多个指令的数据库处理操作所需的执行管线。Different combinations of database acceleration subunits 11535 may be dynamically linked to each other (via configurable links or interconnects 11576) to provide the execution pipeline required to perform database processing operations that may include multiple instructions.

每一数据库处理子单元可被配置为执行特定类型的数据库处理指令(例如，筛选、合并、累加及其类似者)。Each database processing subunit may be configured to execute a particular type of database processing instruction (eg, filter, merge, accumulate, and the like).

图94H亦说明耦接至高速缓存11571的独立数据库处理单元11572。替代DB加速器的可重配置阵列11574或除DB加速器的可重配置阵列11574以外，亦可提供数据库处理单元11572及高速缓存11571。FIG. 94H also illustrates a separate database processing unit 11572 coupled to cache 11571. A database processing unit 11572 and a cache 11571 may also be provided in place of or in addition to the reconfigurable array 11574 of the DB accelerator.

该装置可便利向内扩展和/或向外扩展，因此使得多个数据库加速集成电路11530(及其相关联的存储器资源11550或其相关联的多个存储器处理集成电路11551)能够例如藉由参与数据库操作的分布式处理而彼此相配合。The apparatus may facilitate scale-in and/or scale-out, thus enabling multiple database acceleration integrated circuits 11530 (and their associated memory resources 11550 or their associated multiple memory processing integrated circuits 11551 ) to, for example, by participating in The distributed processing of database operations complements each other.

图94I说明包括两个数据库加速集成电路11530(及其相关联的存储器资源11550)的模块化单元，诸如刀片(blade)11580。该刀片可包括一个、两个或多于两个存储器处理集成电路11551及其相关联的存储器资源11550。94I illustrates a modular unit, such as a blade 11580, that includes two database acceleration integrated circuits 11530 (and their associated memory resources 11550). The blade may include one, two, or more than two memory processing integrated circuits 11551 and their associated memory resources 11550.

该刀片还可包括一个或多个非易失性存储器单元、以太网络交换器、PCIe交换器及以太网络交换器。The blade may also include one or more non-volatile memory units, an Ethernet switch, a PCIe switch, and an Ethernet switch.

多个刀片可使用任何通信方法、通信协议及连接性彼此通信。Multiple blades can communicate with each other using any communication method, communication protocol, and connectivity.

图94I说明彼此完全连接的四个数据库加速集成电路11530(及其相关联的存储器资源11550)，每一数据库加速集成电路11530连接至所有三个其他数据库加速集成电路11530。连接性可使用任何通信协议，例如藉由使用以太网络RDMA协议达成。94I illustrates four database acceleration integrated circuits 11530 (and their associated memory resources 11550) fully connected to each other, each database acceleration integrated circuit 11530 being connected to all three other database acceleration integrated circuits 11530. Connectivity can be achieved using any communication protocol, such as by using the Ethernet RDMA protocol.

图94I亦说明数据库加速集成电路11530，该数据库加速集成电路连接至其相关联的存储器资源11550以及包括RAM存储器及以太网络端口的单元11531。Figure 94I also illustrates database acceleration integrated circuit 11530 connected to its associated memory resource 11550 and unit 11531 including RAM memory and Ethernet ports.

图94J、图94K、图94L及图94M说明数据库加速集成电路的四个群组11580，每一群组包括四个数据库加速集成电路11530(彼此完全连接)及其相关联的存储器资源11550。不同群组经由交换器11590彼此连接。94J, 94K, 94L, and 94M illustrate four groups 11580 of database acceleration integrated circuits, each group including four database acceleration integrated circuits 11530 (fully connected to each other) and their associated memory resources 11550. The different groups are connected to each other via switch 11590.

群组的数目可为两个、三个或多于四个。每群组的数据库加速集成电路的数目可为两个、三个或多于四个。群组的数目可相同于(或可不同于)每群组的数据库加速集成电路的数目。The number of groups can be two, three or more than four. The number of database acceleration integrated circuits per group may be two, three, or more than four. The number of groups may be the same as (or may be different from) the number of database acceleration integrated circuits per group.

图94K说明两个表A及B，该两个表过大(例如，1万亿字节)而无法一次高效地接合(join)。Figure 94K illustrates two tables, A and B, which are too large (eg, 1 trillion bytes) to be efficiently joined at one time.

将表实际上分段成分片且将接合操作应用于包括表A的分片及表B的分片的对。The table is actually segmented into shards and the join operation is applied to the pair comprising the shard of Table A and the shard of Table B.

数据库加速集成电路的群组可用各种方式处理分片。The group of database acceleration integrated circuits can handle sharding in various ways.

举例而言，装置可被配置为藉由以下操作来执行分布式处理程序：For example, a device may be configured to execute distributed processing by:

g.将不同的第一数据结构部分(表A的分片，例如第一至第十六分片A0至A15)分配给一个或多个群组的不同数据库加速集成电路。g. Allocate different first data structure portions (slices of Table A, eg, first to sixteenth slices A0 to A15) to one or more groups of different database acceleration integrated circuits.

h.执行以下各者的多个迭代：(i)将不同的第二数据结构部分(表B的分片，例如第一直至第十六分片B0至B15)新分配给一个或多个群组的不同数据库加速集成电路；及(ii)藉由数据库加速集成电路处理第一及第二数据结构部分。h. Perform multiple iterations of: (i) Newly assign different second data structure parts (shards of Table B, eg first through sixteenth slices B0 to B15) to one or more groups different database acceleration integrated circuits of the set; and (ii) processing the first and second data structure portions by the database acceleration integrated circuit.

装置可被配置为用与当前迭代的处理至少部分时间重叠的方式执行下一迭代的新分配。The apparatus may be configured to perform the new assignment of the next iteration in a manner that overlaps at least part of the time with the processing of the current iteration.

装置可被配置为藉由在不同数据库加速集成电路之间交换第二数据结构部分来执行新分配。The apparatus may be configured to perform the new allocation by exchanging portions of the second data structure between different database acceleration integrated circuits.

交换可用与处理程序至少部分时间重叠的方式执行。The exchange may be performed in a manner that overlaps at least part of the time with the handler.

装置可被配置为藉由以下操作来执行新分配：在群组的不同数据库加速集成电路之间交换第二数据结构部分；及一旦该交换已完成，则在数据库加速集成电路的不同群组之间交换第二数据结构部分。The device may be configured to perform the new allocation by exchanging the second data structure portion between the different groups of database acceleration integrated circuits; and once the exchange has been completed, between the different groups of the database acceleration integrated circuits. exchange the second data structure portion.

在图94K中，展示接合操作中的一些的四个循环，例如参考左上方群组的左上方数据库加速集成电路11530，四个循环包括计算Join(A0，B0)、Join(A0，B3)、Join(A0，B2)及Join(A0，B1)。在这四个循环期间，A0保持在同一数据库加速集成电路11530处，而矩阵B的分片(B0、B1、B2及B3)在数据库加速集成电路11530的同一群组的成员之间旋转。In FIG. 94K, four cycles of some of the joining operations, eg, with reference to the upper left database acceleration integrated circuit 11530 of the upper left group, are shown, the four cycles include computing Join(A0, B0), Join(A0, B3), Join(A0, B2) and Join(A0, B1). During these four cycles, A0 remains at the same database acceleration integrated circuit 11530, while the slices of matrix B (B0, Bl, B2, and B3) are rotated among members of the same group of database acceleration integrated circuits 11530.

在图94L中，第二矩阵的分片在不同群组之间旋转，(a)将分片B0、B1、B2及B3(先前由左上方群组处理)自左上方群组发送至左下方群组，(b)将分片B4、B5、B6及B7(先前由左下方群组处理)自左下方群组发送至右上方群组，(c)将分片B8、B9、B10及B11(先前由右上方群组处理)自右上方群组发送至右下方群组，且(d)将分片B12、B13、B14及B15(先前由右下方群组处理)自右下方群组发送至左上方群组。In Figure 94L, the slices of the second matrix are rotated between different groups, (a) slices B0, B1, B2 and B3 (previously processed by the upper left group) are sent from the upper left group to the lower left group, (b) send shards B4, B5, B6 and B7 (previously processed by the lower left group) from the lower left group to the upper right group, (c) send shards B8, B9, B10 and B11 (previously processed by the top right group) is sent from the top right group to the bottom right group, and (d) slices B12, B13, B14 and B15 (previously processed by the bottom right group) are sent from the bottom right group to the upper left group.

图94N为系统的实例，该系统包括多个刀片11580、SATA控制器11540、本地储存单元11561、NVME 11563、PCIe交换器11601、以太网络存储器DIMM 11547及以太网络端口11531(4)。94N is an example of a system that includes multiple blades 11580, SATA controller 11540, local storage unit 11561, NVME 11563, PCIe switch 11601, Ethernet storage DIMM 11547, and Ethernet port 11531(4).

刀片11580可耦接至PCIE交换器11601、以太网络端口11531及SATA控制器11540中的每一者。Blade 11580 may be coupled to each of PCIE switch 11601 , Ethernet port 11531 , and SATA controller 11540 .

图94O说明两个系统11621及11622。Figure 94O illustrates two systems 11621 and 11622.

系统11621可包括用于数据库加速的一个或多个装置11520、交换系统11611、储存系统11612及运算系统11613。交换系统11611提供用于数据库加速的一个或多个装置11520、储存系统11612及运算系统11613之间的连接性。System 11621 may include one or more devices 11520 for database acceleration, switching system 11611, storage system 11612, and computing system 11613. Switching system 11611 provides connectivity between one or more devices 11520, storage system 11612, and computing system 11613 for database acceleration.

系统11622可包括储存系统以及用于数据库加速的一个或多个装置11615、交换系统11611及运算系统11613。交换系统11611提供储存系统以及用于数据库加速的一个或多个装置11615及运算系统11613之间的连接性。System 11622 may include a storage system and one or more devices 11615 for database acceleration, switching system 11611 and computing system 11613. The switching system 11611 provides connectivity between the storage system and one or more devices 11615 and computing system 11613 for database acceleration.

图95A说明用于数据库加速的方法11200。95A illustrates a method 11200 for database acceleration.

方法11200可开始于藉由数据库加速集成电路的网络通信网路接口自大量储存单元撷取大量信息的步骤11210。The method 11200 may begin with a step 11210 of retrieving mass information from mass storage units via a database acceleration integrated circuit's network communication network interface.

连接至大量储存单元(例如，使用多个不同总线)使得网络通信网路接口能够接收大量信息，即使当单个储存单元具有有限吞吐量时亦如此。Connecting to a large number of storage units (eg, using multiple different buses) enables the network communication network interface to receive large amounts of information, even when a single storage unit has limited throughput.

步骤11210之后可接着为对大量信息进行第一处理以提供第一经处理信息。第一处理可包括缓冲、自有效负载提取信息、移除标头、解压缩、压缩、解密、筛选数据库查询或执行任何其他处理操作。第一处理亦可能限于缓冲。Step 11210 may be followed by a first processing of the bulk information to provide first processed information. The first processing may include buffering, extracting information from the payload, removing headers, decompressing, compressing, decrypting, filtering database queries, or performing any other processing operation. The first process may also be limited to buffering.

步骤11210之后可接着为藉由数据库加速集成电路的存储器控制器且经由大吞吐量接口将第一经处理信息发送至多个存储器处理集成电路的步骤11220，其中每一存储器处理集成电路可包括控制器、多个处理器子单元及多个存储器单元。存储器处理集成电路可为存储器/处理单元或分布式处理器或存储器芯片，如本专利申请案的任何其他部分中所说明。Step 11210 may be followed by step 11220 of accelerating the memory controller of the integrated circuit by a database and sending the first processed information to a plurality of memory processing integrated circuits via a high-throughput interface, wherein each memory processing integrated circuit may include a controller , multiple processor sub-units, and multiple memory units. The memory processing integrated circuit may be a memory/processing unit or a distributed processor or memory chip, as described in any other section of this patent application.

步骤11220之后可接着为藉由多个存储器处理集成电路对第一经处理信息的至少部分进行第二处理以提供第二经处理信息的步骤11230。Step 11220 may be followed by step 11230 of performing a second processing of at least a portion of the first processed information by a plurality of memory processing integrated circuits to provide second processed information.

步骤11230可包括藉由数据库加速集成电路同时执行多个任务。Step 11230 may include accelerating the integrated circuit to perform multiple tasks concurrently with the database.

步骤11230可包括藉由数据库处理子单元同时执行数据库处理指令，其中数据库加速单元可包括共享一共享存储器单元的数据库加速器子单元的群组。Step 11230 may include concurrent execution of database processing instructions by database processing subunits, wherein the database acceleration unit may comprise a group of database accelerator subunits that share a shared memory unit.

步骤11230之后可接着为藉由数据库加速集成电路的存储器控制器自多个存储器处理集成电路撷取所撷取信息的步骤11240，其中所撷取信息可包括以下各者中的至少一者：(a)第一经处理信息的至少一部分；及(b)第二经处理信息的至少一部分。Step 11230 may be followed by step 11240 of retrieving the retrieved information from the plurality of memory processing integrated circuits by the memory controller of the database acceleration integrated circuit, wherein the retrieved information may include at least one of: ( a) at least a portion of the first processed information; and (b) at least a portion of the second processed information.

步骤11240之后可接着为藉由数据库加速集成电路的数据库加速单元对所撷取信息执行数据库处理操作以提供数据库加速结果的步骤11250。Step 11240 may be followed by step 11250 of performing database processing operations on the retrieved information by the database acceleration unit of the database acceleration integrated circuit to provide database acceleration results.

步骤11250可包括根据时间I/O带宽分配数据库加速集成电路的资源。Step 11250 may include allocating resources of the database acceleration integrated circuit based on time I/O bandwidth.

步骤11250之后可接着为输出数据库加速结果的步骤11260。Step 11250 may be followed by step 11260 of outputting database acceleration results.

步骤11260可包括动态地链接数据库处理子单元以提供执行可包括多个指令的数据库处理操作所需的执行管线。Step 11260 may include dynamically linking database processing subunits to provide the execution pipeline required to perform database processing operations that may include multiple instructions.

步骤11260可包括将数据库加速结果输出至本地储存器及自本地储存器撷取数据库加速结果。Step 11260 may include outputting and retrieving database acceleration results to local storage.

应注意，方法11100的步骤11210、11220、11230、11240、11250及11260或任何其他步骤可用管线化方式执行。可同时或以不同于上文所提及的次序的次序执行这些步骤。It should be noted that steps 11210, 11220, 11230, 11240, 11250, and 11260 or any other steps of method 11100 may be performed in a pipelined fashion. These steps may be performed simultaneously or in an order different from that mentioned above.

举例而言，步骤1120之后可接着为步骤11250，使得第一经处理信息由数据库加速单元进一步处理。For example, step 1120 may be followed by step 11250 such that the first processed information is further processed by the database acceleration unit.

又对于另一实例，第一经处理信息可发送至多个存储器处理集成电路，且接着发送(不由多个存储器处理集成电路处理)至数据库加速单元。For yet another example, the first processed information may be sent to multiple memory processing integrated circuits, and then sent (not processed by multiple memory processing integrated circuits) to a database acceleration unit.

又对于另一实例，第一经处理信息和/或第二经处理信息可自数据库加速集成电路输出，而不由数据库加速度单元进行数据库处理。For yet another example, the first processed information and/or the second processed information may be output from the database acceleration integrated circuit without database processing by the database acceleration unit.

该方法可包括藉由大型运算系统的运算节点基于发送至数据库加速集成电路的执行计划而执行以下操作中的至少一者：撷取、第一处理、发送及第三处理。The method may include, by a computing node of the large computing system, performing at least one of the following operations: fetching, first processing, sending, and third processing based on the execution plan sent to the database acceleration integrated circuit.

该方法可包括以实质上优化数据库加速集成电路的利用的方式管理撷取、第一处理、发送及第三处理中的至少一者。The method can include managing at least one of fetching, first processing, sending, and third processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit.

该方法可包括实质上优化藉由网络通信网路接口交换的业务的带宽。The method may include substantially optimizing the bandwidth of traffic exchanged over the network communication network interface.

该方法可包括以实质上优化数据库加速集成电路的利用的方式实质上防止在撷取、第一处理、发送及第三处理中的至少一者中形成瓶颈。The method may include substantially preventing bottlenecks from forming in at least one of fetching, first processing, sending, and third processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit.

方法11200还可包括以下步骤中的至少一者：Method 11200 may also include at least one of the following steps:

步骤11270可包括藉由数据库加速集成电路的管理单元来管理撷取、第一处理、发送及第三处理中的至少一者。Step 11270 may include managing at least one of retrieval, first processing, sending, and third processing by a management unit of the database acceleration integrated circuit.

该管理可基于由数据库加速集成电路的管理单元产生的执行计划而执行。The management may be performed based on an execution plan generated by a management unit of the database acceleration integrated circuit.

该管理可基于由数据库加速集成电路的管理单元接收而并非由管理单元产生的执行计划而执行。The management may be performed based on an execution plan received by a management unit of the database acceleration integrated circuit rather than generated by the management unit.

该管理可包括分配以下各者中的至少一些：(a)网络通信网路接口资源、(b)解压缩单元资源、(c)存储器控制器资源、(d)多个存储器处理集成电路资源，及(e)数据库加速单元资源。The managing may include allocating at least some of: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, (d) multiple memory processing integrated circuit resources, and (e) database acceleration unit resources.

步骤11271可包括藉由大型运算系统的运算节点控制撷取、第一处理、发送及第三处理中的至少一者中的至少一者。Step 11271 may include controlling at least one of at least one of fetching, first processing, sending, and third processing by a computing node of a large computing system.

步骤11272可包括藉由位于数据库加速集成电路外部的管理单元来管理撷取、第一处理、发送及第三处理中的至少一者。Step 11272 may include managing at least one of fetching, first processing, sending, and third processing by a management unit external to the database acceleration integrated circuit.

图95B说明用于操作数据库加速集成电路的群组的方法11300。95B illustrates a method 11300 for operating a group of database acceleration integrated circuits.

方法11300可开始于藉由数据库加速集成电路执行数据库加速操作的步骤11310。步骤11310可包括执行方法11200的一个或多个步骤。Method 11300 may begin with step 11310 of performing database acceleration operations by a database acceleration integrated circuit. Step 11310 may include performing one or more steps of method 11200.

方法11300还可包括在数据库加速集成电路的一个或多个群组的数据库加速集成电路之间交换(a)信息及(b)数据库加速结果中的至少一者的步骤11320。The method 11300 may also include the step 11320 of exchanging at least one of (a) information and (b) database acceleration results between the database acceleration integrated circuits of the one or more groups of database acceleration integrated circuits.

步骤11310及11320的组合可相当于藉由一个或多个群组的数据库加速集成电路执行分布式处理。The combination of steps 11310 and 11320 may amount to accelerating the integrated circuit's execution of distributed processing by means of one or more groups of databases.

可使用一个或多个群组的数据库加速集成电路的网络通信网路接口执行交换。The exchange may be performed using one or more groups of databases to accelerate the network communication network interface of the integrated circuit.

可经由多个群组执行交换，该些群组可藉由星形连接而彼此连接。Swapping can be performed through multiple groups, which can be connected to each other by star connections.

步骤11320可包括使用至少一个交换器以用于在一个或多个群组中的不同群组的数据库加速集成电路之间交换以下各者中的至少一者：(a)信息；及(b)数据库加速结果。Step 11320 can include using at least one switch for exchanging at least one of: (a) information; and (b) between database acceleration integrated circuits of different ones of the one or more groups Database speeds up results.

步骤11310可包括藉由一个或多个群组中的一些的数据库加速集成电路中的一些执行分布式处理的步骤11311。Step 11310 may include step 11311 of accelerating some of the integrated circuits to perform distributed processing by means of a database of some of the one or more groups.

步骤11311可包括执行第一及第二数据结构的分布式处理，其中第一及第二数据结构的总大小超过多个存储器处理集成电路的储存能力。Step 11311 may include performing distributed processing of the first and second data structures, wherein the combined size of the first and second data structures exceeds the storage capacity of the plurality of memory processing integrated circuits.

分布式处理的执行可包括执行以下各者的多个迭代：(a)执行将第一数据结构部分及第二数据结构部分的不同对新分配给不同数据库加速集成电路；及(b)处理不同对。Execution of the distributed processing may include performing multiple iterations of: (a) performing a new assignment of different pairs of the first data structure portion and the second data structure portion to different database acceleration integrated circuits; and (b) processing the different right.

分布式处理的执行可包括执行数据库接合操作。Execution of distributed processing may include performing database join operations.

步骤11310可包括(a)将不同的第一数据结构部分分配给一个或多个群组的不同数据库加速集成电路的步骤11312；及(b)执行以下各者的多个迭代：将不同的第二数据结构部分新分配给一个或多个群组的不同数据库加速集成电路的步骤11314；及藉由数据库加速集成电路处理第一及第二数据结构部分的步骤11316。Step 11310 may comprise the step 11312 of (a) assigning different first data structure portions to different database acceleration integrated circuits of one or more groups; and (b) performing multiple iterations of: assigning different first data structure portions Step 11314 of newly assigning the data structure parts to different database acceleration integrated circuits of one or more groups; and step 11316 of processing the first and second data structure parts by the database acceleration integrated circuit.

可用与当前迭代的处理至少部分时间重叠的方式执行步骤11314。Step 11314 may be performed in a manner that overlaps at least part of the time with processing of the current iteration.

步骤11314可包括在不同数据库加速集成电路之间交换第二数据结构部分。Step 11314 may include exchanging portions of the second data structure between different database acceleration integrated circuits.

可用与步骤11310至少部分时间重叠的方式执行步骤11320。Step 11320 may be performed in a manner that overlaps at least partially with step 11310 in time.

步骤11314可包括在群组的不同数据库加速集成电路之间交换第二数据结构部分；及一旦交换已完成，便在数据库加速集成电路的不同群组之间交换第二数据结构部分。Step 11314 may include exchanging the second data structure portion between the different database acceleration integrated circuits of the group; and once the exchange has been completed, exchanging the second data structure portion between the different groups of the database acceleration integrated circuit.

图95C说明用于数据库加速的方法11350。Figure 95C illustrates a method 11350 for database acceleration.

方法11350可包括藉由数据库加速集成电路的网络通信网路接口自大量储存单元撷取大量信息的步骤11352。The method 11350 may include the step 11352 of retrieving mass information from mass storage units via a network communication network interface of the database acceleration integrated circuit.

步骤11352之后可接着为对大量信息进行第一处理以提供第一经处理信息的步骤11354。Step 11352 may be followed by step 11354 of first processing the bulk information to provide first processed information.

步骤11352之后可接着藉由数据库加速集成电路的存储器控制器且经由大吞吐量接口将第一经处理信息发送至多个存储器资源的步骤11354。Step 11352 may be followed by step 11354 of accelerating the memory controller of the integrated circuit by the database and sending the first processed information to the plurality of memory resources via the high throughput interface.

步骤11354之后可接着为自多个存储器资源撷取所撷取信息的步骤11356。Step 11354 may be followed by step 11356 of retrieving the retrieved information from the plurality of memory resources.

步骤11356之后可接着为藉由数据库加速集成电路的数据库加速单元对所撷取信息执行数据库处理操作以提供数据库加速结果的步骤11358。Step 11356 may be followed by step 11358 of performing database processing operations on the retrieved information by the database acceleration unit of the database acceleration integrated circuit to provide database acceleration results.

步骤11358之后可接着为输出数据库加速结果的步骤11359。Step 11358 may be followed by step 11359 of outputting database acceleration results.

该方法还可包括对第一经处理信息进行第二处理以提供第二经处理信息的步骤11355。第二处理由多个处理器执行，该多个处理器位于进一步包含多个存储器资源的一个或多个存储器处理集成电路中。步骤11355在步骤11354之后且在步骤11356之前。The method may also include the step 11355 of subjecting the first processed information to a second process to provide second processed information. The second processing is performed by a plurality of processors located in one or more memory processing integrated circuits that further include a plurality of memory resources. Step 11355 follows step 11354 and precedes step 11356.

第二经处理信息的总大小可小于第一经处理信息的总大小。The total size of the second processed information may be smaller than the total size of the first processed information.

第一经处理信息的总大小可小于大量信息的总大小。The total size of the first processed information may be less than the total size of the bulk information.

第一处理可包括筛选数据库条目。因此，在执行任何其他处理之前和/或甚至在将不相关的数据库条目储存于多个存储器资源之前，筛选出与查询不相关的数据库条目，藉此节省带宽、储存资源及其他处理资源。The first process may include filtering database entries. Thus, database entries that are not relevant to a query are filtered out before any other processing is performed and/or even before they are stored in multiple memory resources, thereby saving bandwidth, storage resources, and other processing resources.

第二处理可包括筛选数据库条目。筛选可在筛选条件可为复杂的(包括多个条件)时应用，且可能需要在筛选进行之前接收多个数据库条目字段。举例而言，当搜寻(a)超过某一年龄且喜欢香蕉的人及(b)超过另一年龄且喜欢苹果的人时。The second process may include filtering database entries. Filtering can be applied when the filter conditions can be complex (including multiple conditions) and may require receiving multiple database entry fields before the filter can proceed. For example, when searching for people who are (a) over a certain age and like bananas and (b) over another age and like apples.

数据库database

以下实例可参考数据库。数据库可为数据中心，可为数据中心的部分，或可能不属于数据中心。The following examples can refer to the database. The database may be a data center, may be part of a data center, or may not be part of a data center.

数据库可经由一个或多个网络耦接至多个用户。数据库可为云端数据库。The database may be coupled to multiple users via one or more networks. The database may be a cloud database.

可提供包括一个或多个管理单元及多个数据库加速器板的数据库，该些加速器板包括一个或多个存储器/处理单元。A database may be provided that includes one or more management units and a plurality of database accelerator boards including one or more memory/processing units.

图96B说明数据库12020，该数据库包括管理单元12021及多个DB加速器板12022，该些加速器板各包括通信/管理处理器(处理器12024)及多个存储器/处理单元12026。96B illustrates a database 12020 that includes a management unit 12021 and a plurality of DB accelerator boards 12022, each of which includes a communication/management processor (processor 12024) and a plurality of memory/processing units 12026.

处理器12024可支持各种通信协议，诸如但不限于PCIe、类似ROCE的协议，及其类似者。The processor 12024 may support various communication protocols such as, but not limited to, PCIe, ROCE-like protocols, and the like.

数据库命令可由存储器/处理单元12026执行，且处理器可在存储器/处理单元12026之间、在不同DB加速器板12022之间且与管理单元12021投送业务。Database commands may be executed by the memory/processing unit 12026, and the processor may route traffic between the memory/processing units 12026, between the different DB accelerator boards 12022, and with the management unit 12021.

尤其在包括大型内部存储器组时，使用多个存储器/处理单元12026可显著加速数据库命令的执行且避免通信瓶颈。The use of multiple memory/processing units 12026 can significantly speed up the execution of database commands and avoid communication bottlenecks, especially when large internal memory banks are included.

图96C说明包括处理器12024及多个存储器/处理单元12026的DB加速器板12022。处理器12024包括多个通信专用组件，诸如用于与存储器/处理单元12026、RDMA引擎12031、DB查询数据库引擎12034及其类似者通信的DDR控制器12033。DDR控制器为通信控制器的实例，且RDMA引擎为任何通信引擎的实例。96C illustrates a DB accelerator board 12022 that includes a processor 12024 and a plurality of memory/processing units 12026. Processor 12024 includes a number of communication specific components, such as DDR controller 12033 for communicating with memory/processing unit 12026, RDMA engine 12031, DB query database engine 12034, and the like. A DDR controller is an example of a communication controller, and an RDMA engine is an example of any communication engine.

可提供一种用于操作图96B、图96C及图96D中的任一者的系统(或操作系统的任何部分)的方法。A method for operating the system (or any portion of an operating system) of any of Figures 96B, 96C, and 96D may be provided.

应注意，数据库加速集成电路11530可与多个存储器资源相关联，该些存储器资源不包括于多个存储器处理集成电路中或以其他方式不与处理单元相关联。在此状况下，处理主要且甚至仅由数据库加速集成电路执行。It should be noted that database acceleration integrated circuit 11530 may be associated with multiple memory resources that are not included in multiple memory processing integrated circuits or otherwise not associated with processing units. In this case, the processing is mainly, and even only, performed by the database acceleration integrated circuit.

图94P说明用于数据库加速的方法11700。Figure 94P illustrates a method 11700 for database acceleration.

方法11700可包括藉由数据库加速集成电路的网络通信接口从储存单元撷取信息的步骤11710。The method 11700 can include the step 11710 of retrieving information from a storage unit via a network communication interface of the database acceleration integrated circuit.

步骤11710之后可接着为对信息量进行第一处理以提供第一经处理信息的步骤11720。Step 11710 may be followed by step 11720 of first processing the amount of information to provide first processed information.

步骤11720之后可接着为藉由数据库加速集成电路的存储器控制器且经由吞吐量接口将第一经处理信息发送至多个存储器资源的步骤11730。Step 11720 may be followed by step 11730 of accelerating the memory controller of the integrated circuit by the database and sending the first processed information to the plurality of memory resources via the throughput interface.

步骤11730之后可接着为自多个存储器资源撷取信息的步骤11740。Step 11730 may be followed by step 11740 of retrieving information from a plurality of memory resources.

步骤11740之后可接着为藉由数据库加速集成电路的数据库加速单元对所撷取信息执行数据库处理操作以提供数据库加速结果的步骤11750。Step 11740 may be followed by step 11750 of performing database processing operations on the retrieved information by the database acceleration unit of the database acceleration integrated circuit to provide database acceleration results.

步骤11750之后可接着为输出数据库加速结果的步骤11760。Step 11750 may be followed by step 11760 of outputting database acceleration results.

第一处理和/或第二处理可包括筛选数据库条目，判定应进一步处理哪些数据库条目。The first process and/or the second process may include filtering database entries to determine which database entries should be further processed.

第二处理包含筛选数据库条目。The second process involves filtering database entries.

混合系统Hybrid system

存储器/处理单元在执行可为存储器密集的和/或瓶颈与撷取操作相关的计算时可为高效的。当瓶颈与运算操作相关时，面向处理(且较少面向存储器)的处理器单元(诸如但不限于图形处理单元、中央处理单元)可更有效。The memory/processing unit may be efficient in performing computations that may be memory intensive and/or bottleneck related to fetch operations. Processing-oriented (and less memory-oriented) processor units (such as, but not limited to, graphics processing units, central processing units) may be more efficient when bottlenecks are associated with computational operations.

混合系统可包括彼此可完全或部分连接的一个或多个处理器单元及一个或多个存储器/处理单元两者。A hybrid system may include both one or more processor units and one or more memory/processing units that are fully or partially connectable to each other.

存储器/处理单元(MPU)可藉由相比逻辑胞元更佳地适合存储器胞元的第一制造制程来制造。举例而言，由第一制造制程制造的记忆胞元可展现相比由第一制造制程制造的逻辑电路的临界尺寸较小且甚至小得多(例如，小超过2倍、3倍、4倍、5倍、6倍、7倍、8倍、9倍、10倍及其类似者)的临界尺寸。举例而言，第一制造制程可为模拟制造制程，第一制造制程可为DRAM制造制程，及其类似者。Memory/processing units (MPUs) can be fabricated by a first manufacturing process that is better suited for memory cells than logic cells. For example, memory cells fabricated by the first fabrication process may exhibit smaller and even much smaller critical dimensions (eg, more than 2 times, 3 times, 4 times smaller) than logic circuits fabricated by the first fabrication process , 5 times, 6 times, 7 times, 8 times, 9 times, 10 times, and the like). For example, the first fabrication process may be an analog fabrication process, the first fabrication process may be a DRAM fabrication process, and the like.

处理器可由较佳地适合逻辑的第二制造制程制造。举例而言，由第二制造制程制造的逻辑电路的临界尺寸可比由第一制造制程制造的逻辑电路的临界尺寸小且甚至小得多。又对于另一实例，由第二制造制程制造的逻辑电路的临界尺寸可比由第一制造制程制造的存储器胞元的临界尺寸小且甚至小得多。举例而言，第二制造制程可为模拟制造制程，第二制造制程可为CMOS制造制程，及其类似者。The processor may be fabricated by a second fabrication process that is better suited for logic. For example, the critical dimension of the logic circuit fabricated by the second fabrication process may be smaller and even much smaller than the critical dimension of the logic circuit fabricated by the first fabrication process. As yet another example, the critical dimensions of logic circuits fabricated by the second fabrication process may be smaller and even much smaller than the critical dimensions of memory cells fabricated by the first fabrication process. For example, the second fabrication process may be an analog fabrication process, the second fabrication process may be a CMOS fabrication process, and the like.

可藉由考虑每一单元的益处及与在单元之间传送数据相关的任何惩罚而以静态或动态方式在不同单元之间分配任务。Tasks can be distributed among different units in a static or dynamic manner by taking into account the benefits of each unit and any penalties associated with transferring data between units.

举例而言，可将存储器密集型处理程序分配给存储器/处理单元，而可将处理密集型存储器轻处理分配给处理单元。For example, memory-intensive processing programs may be assigned to memory/processing units, while processing-intensive memory-light processing may be assigned to processing units.

图96D为包括一个或多个存储器/处理单元(MPU)12043及处理器12042的混合系统12040的实例。处理器12042可将请求或指令发送至一个或多个MPU 12043，该一个或多个MPU又完成(或选择性地完成)请求和/或指令且将结果发送至处理器12042，如上文所说明。96D is an example of a hybrid system 12040 that includes one or more memory/processing units (MPUs) 12043 and processors 12042. The processor 12042 may send the request or instruction to one or more MPUs 12043, which in turn complete (or selectively complete) the request and/or instruction and send the result to the processor 12042, as described above .

处理器12042可进一步处理结果以提供一个或多个输出。The processor 12042 can further process the results to provide one or more outputs.

每一MPU包括存储器资源、处理资源(诸如，紧凑微控制器12044)及高速缓存12049。微控制器可具有有限运算能力(例如，可主要包括乘法累加单元)。Each MPU includes memory resources, processing resources (such as compact microcontroller 12044 ), and cache 12049 . Microcontrollers may have limited computing power (eg, may primarily include multiply-accumulate units).

微控制器12044可出于存储器内加速目的而应用处理程序，亦可为CPU或整个整DB处理引擎或其子集。The microcontroller 12044 may apply processing programs for in-memory acceleration purposes, or may be the CPU or the entire full-DB processing engine or a subset thereof.

MPU 12043可包括可用网状/环形/或其他拓朴连接以用于快速组间通信的微处理器及封包处理单元。MPU 12043 may include a microprocessor and packet processing unit that may be connected in mesh/ring/or other topologies for fast inter-group communication.

可存在多于一个DDR控制器以用于快速DIMM间通信。There may be more than one DDR controller for fast inter-DIMM communication.

存储器内封包处理器的目标为减少BW、数据移动、功率消耗，且增加效能。相比标准解决方案，使用存储器内封包处理器将使效能/TCO显著增加。The goals of in-memory packet processors are to reduce BW, data movement, power consumption, and increase performance. Using an in-memory packet processor will result in a significant increase in performance/TCO compared to standard solutions.

应注意，管理单元为可选的。It should be noted that the snap-in is optional.

每一MPU可作为人工智能(AI)存储器/处理单元操作，这是因为其可执行AI计算且仅将结果传回至处理器，藉此减少业务量，尤其在MPU接收及储存待用于多个计算中的神经网络系数时，且每次使用神经网络的一部分以处理新数据时不需要自外部芯片接收系数。Each MPU can operate as an artificial intelligence (AI) memory/processing unit because it can perform AI computations and only pass the results back to the processor, thereby reducing traffic, especially when the MPU receives and stores for multiple There is no need to receive coefficients from an external chip every time a portion of the neural network is used in a computation of the neural network coefficients to process new data.

MPU可判定系数何时为零，且通知处理器不需要执行包括零值系数的乘法。The MPU can determine when the coefficients are zero, and inform the processor that multiplications involving zero-valued coefficients need not be performed.

应注意，第一处理及第二处理可包括筛选数据库条目。It should be noted that the first process and the second process may include filtering database entries.

MPU可为本说明书中、PCT专利申请案WO2019025862及PCT专利申请案第PCT/IB2019/001005号中的任一者中所说明的任何存储器处理单元。The MPU may be any of the memory processing units described in this specification, in any of PCT patent application WO2019025862 and PCT patent application No. PCT/IB2019/001005.

可提供AI运算系统(及可由系统执行的系统)，其中网络适配器具有AI处理能力且被配置为执行一些AI处理任务，以便减少待经由耦接多个AI加速服务器的网络发送的业务的量。AI computing systems (and systems executable by the systems) may be provided in which network adapters have AI processing capabilities and are configured to perform some AI processing tasks in order to reduce the amount of traffic to be sent over a network coupled to multiple AI acceleration servers.

举例而言，在一些推断系统中，输入为网络(例如，连接至AI服务器的IP相机的多个串流)。在此状况下，在处理及网络连接单元上利用RDMA+AI可减小CPU及PCIe总线的负载且对处理及网络连接单元提供处理，而非由不包括于处理及网络连接单元中的GPU提供处理。For example, in some inference systems, the input is a network (eg, multiple streams of IP cameras connected to an AI server). In this case, utilizing RDMA+AI on the processing and network connection unit can reduce the load on the CPU and PCIe bus and provide processing to the processing and network connection unit, rather than by the GPU that is not included in the processing and network connection unit. deal with.

举例而言，替代计算初始结果及将初始结果发送至目标AI加速服务器(应用一个或多个AI处理操作)，处理及网络连接单元可执行减少发送至目标AI加速服务器的值的量的预处理。目标AI运算服务器为经分配以对由其他AI加速服务器提供的值执行计算的AI运算服务器。此减少在AI加速服务器之间交换的业务的带宽且亦减小目标AI加速服务器的负载。For example, instead of computing and sending the initial results to the target AI acceleration server (applying one or more AI processing operations), the processing and networking unit may perform preprocessing that reduces the amount of values sent to the target AI acceleration server . A target AI computing server is an AI computing server that is assigned to perform computations on values provided by other AI acceleration servers. This reduces the bandwidth of traffic exchanged between AI acceleration servers and also reduces the load on the target AI acceleration server.

可藉由使用负载平衡或其他分配算法以动态或静态方式分配目标AI加速服务器。可存在多于单个目标AI加速服务器。Target AI acceleration servers can be allocated dynamically or statically by using load balancing or other allocation algorithms. There may be more than a single target AI acceleration server.

举例而言，若目标AI加速服务器添加了多个损失，则处理及网络连接单元可添加由其AI加速服务器产生的损失且将损失总和发送至目标AI加速服务器，藉此减少带宽。当执行诸如导数计算及聚集以及其类似者的其他预处理操作时，可获得相同益处。For example, if the target AI acceleration server adds multiple losses, the processing and networking unit may add the losses incurred by its AI acceleration server and send the sum of the losses to the target AI acceleration server, thereby reducing bandwidth. The same benefits can be obtained when performing other preprocessing operations such as derivative calculations and aggregation and the like.

图97B说明包括子系统的系统12060，每一子系统包括用于将具有服务器主板12064的AI处理及网络连接单元12063连接至彼此的交换器12061。服务器主板包括具有网络能力且具有AI处理能力的一个或多个AI处理及网络连接单元12063。AI处理及网络连接单元12063可包括一个或多个NIC及ALU或用于执行预处理的其他计算电路。97B illustrates a system 12060 including subsystems, each including a switch 12061 for connecting AI processing and network connection units 12063 with server motherboards 12064 to each other. The server motherboard includes one or more AI processing and network connection units 12063 with network capability and AI processing capability. AI processing and networking unit 12063 may include one or more NICs and ALUs or other computing circuits for performing preprocessing.

AI处理及网络连接单元12063可为芯片，或可包括多于单个芯片。具有为单个芯片的AI处理及网络连接单元12063可为有益的。The AI processing and network connection unit 12063 may be a chip, or may include more than a single chip. It may be beneficial to have the AI processing and networking unit 12063 being a single chip.

AI处理及网络连接单元12063可包括(仅或主要)处理资源。AI处理及网络连接单元12063可包括存储器内运算电路，或可不包括存储器内运算电路，或可能不包括海量存储器内运算电路。AI processing and networking unit 12063 may include (only or primarily) processing resources. The AI processing and networking unit 12063 may include in-memory arithmetic circuits, or may not include in-memory arithmetic circuits, or may not include mass-memory arithmetic circuits.

AI处理及网络连接单元12063可为集成电路，可包括多于单个集成电路，可为集成电路的一部分，及其类似者。The AI processing and networking unit 12063 may be an integrated circuit, may include more than a single integrated circuit, may be part of an integrated circuit, and the like.

AI处理及网络连接单元12063可在包括AI处理及网络连接单元12063的AI加速服务器与其他AI加速服务器之间输送(参见例如图97C)业务(例如，藉由使用诸如DDR通道、网络通道和/或PCIe通道的通信端口)。AI处理及网络连接单元12063亦可耦接至诸如DDR存储器的外部存储器。处理及网络连接单元可包括存储器和/或可包括存储器/处理单元。The AI processing and network connection unit 12063 can transport (see, eg, FIG. 97C ) traffic between the AI acceleration server including the AI processing and network connection unit 12063 and other AI acceleration servers (eg, by using methods such as DDR channels, network channels and/or or a communication port for a PCIe lane). The AI processing and network connection unit 12063 can also be coupled to external memory such as DDR memory. The processing and networking unit may include memory and/or may include a memory/processing unit.

在图97C中，AI处理及网络连接单元12063经说明为包括本地DDR连接、DDR通道、AI加速器、RAM存储器、加密/解密引擎、PCIe交换器、PCIe接口、多个核心处理阵列、快速网络连接及其类似者。In Figure 97C, AI processing and networking unit 12063 is illustrated as including local DDR connections, DDR channels, AI accelerators, RAM memory, encryption/decryption engines, PCIe switches, PCIe interfaces, multiple core processing arrays, fast network connections and the like.

可提供一种用于操作图97B及图97C中的任一者的系统(或操作系统的任何部分)的方法。A method for operating the system (or any portion of an operating system) of any of Figures 97B and 97C may be provided.

可提供在本申请案中所提及的任何方法的任何步骤的任何组合。Any combination of any steps of any of the methods mentioned in this application may be provided.

可提供在本申请案中所提及的任何单元、集成电路、存储器资源、逻辑、处理子单元、控制器、组件的任何组合。Any combination of units, integrated circuits, memory resources, logic, processing subunits, controllers, components mentioned in this application may be provided.

对“包括”和/或“包含”的任何参考可在细节上作必要修改后应用于“组成”、“实质上组成”。Any reference to "comprising" and/or "comprising" may apply to "consists of", "consisting essentially of" mutatis mutandis.

已出于说明的目的呈现先前描述。先前描述并不详尽且不限于所公开的精确形式或实施例。从本说明书的考虑及所公开实施例的实践，修改及调适对本领域技术人员将为显而易见的。另外，尽管所公开实施例的方面描述为储存于存储器中，但本领域技术人员将了解，这些方面也可储存于其他类型的计算机可读介质上，诸如次要储存设备，例如硬盘或CD ROM，或其他形式的RAM或ROM、USB媒体、DVD、蓝光、4K超HD蓝光，或其他光驱介质。The previous description has been presented for purposes of illustration. The preceding description is not exhaustive and is not limited to the precise form or embodiments disclosed. Modifications and adaptations will be apparent to those skilled in the art from consideration of this specification and practice of the disclosed embodiments. Additionally, although aspects of the disclosed embodiments are described as being stored in memory, those skilled in the art will appreciate that these aspects may also be stored on other types of computer-readable media, such as secondary storage devices such as hard disks or CD ROMs , or other forms of RAM or ROM, USB media, DVD, Blu-ray, 4K Ultra HD Blu-ray, or other optical drive media.

基于书面描述及所公开方法的计算机程序在有经验开发者的技能范围内。可使用本领域技术人员已知的技术中的任一者产生或可结合现有软件设计各种程序或程序模块。例如，程序区段或程序模块可用或藉助于.Net Framework、.Net Compact Framework(及相关语言，诸如Visual Basic、C等)、Java、C++、Objective-C、HTML、HTML/AJAX组合、XML或包括Java小程序的HTML来设计。Computer programs based on the written description and the disclosed methods are within the skill of an experienced developer. Various programs or program modules may be created using any of the techniques known to those skilled in the art or may be designed in conjunction with existing software. For example, program sections or program modules may be available or by means of .Net Framework, .Net Compact Framework (and related languages such as Visual Basic, C, etc.), Java, C++, Objective-C, HTML, HTML/AJAX combination, XML or Include the HTML of the Java applet to design.

此外，尽管本文已经描述了说明性实施例，但是本领域技术人员基于本公开将了解具有等效组件、修改、省略、组合(例如，跨各种实施例的方面的组合)、调适和/或更改的任何及所有实施例的范畴。权利要求中的限制应基于权利要求中所使用的语言来广泛地解释，且不限于本说明书中所描述或在本申请的审查期间的实施例。实施例应解释为非排他性的。此外，所公开方法的步骤可用任何方式进行修改，包括通过对步骤重排序和/或插入或删除步骤。因此，本说明书及实施例仅被认为是说明性的，其中真实的范围和精神由所附权利要求及其等同物的全部范围指示。Furthermore, although illustrative embodiments have been described herein, those skilled in the art will appreciate having equivalent components, modifications, omissions, combinations (eg, combinations across aspects of the various embodiments), adaptations and/or based on this disclosure The scope of any and all embodiments of modification. The limitations in the claims are to be construed broadly based on the language used in the claims, and are not limited to the embodiments described in this specification or during the prosecution of this application. The examples should be construed as non-exclusive. Furthermore, the steps of the disclosed methods may be modified in any way, including by reordering steps and/or inserting or deleting steps. Accordingly, the specification and examples are to be regarded as illustrative only, with the true scope and spirit being indicated by the appended claims along with their full scope of equivalents.

Claims

1. An integrated circuit comprising:

substrate;

a memory array disposed on the substrate, the memory array comprising a plurality of discrete memory banks;

a processing array disposed on the substrate, the processing array including a plurality of processor subunits, each of the plurality of processor subunits and one or more of the plurality of discrete memory banks Discrete memory bank associations; and

Controller, which is configured as:

At least one security measure is implemented with respect to the operation of the integrated circuit.

2. The integrated circuit of claim 1, wherein the controller is configured to take one or more remedial actions if the at least one safety measure is triggered.

3. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure in at least one memory location.

4. The integrated circuit of claim 2, wherein the data includes weight data for the neural network model.

5. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising locking out one of the memory arrays not used for input data or output data operations or access to multiple memory sections.

6. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising locking only a subset of the memory array.

7. The integrated circuit of claim 6, wherein the subset of the array is designated by a particular memory address.

8. The integrated circuit of claim 6, wherein the subset of the memory array is configurable.

9. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising controlling traffic to or from the integrated circuit.

10. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure including uploading changeable data, code, or fixed data.

11. The integrated circuit of claim 1, wherein the uploading of the changeable data, code, or fixed data occurs during a power-on handler.

12. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising, during a power-on handler, uploading a configuration file, the configuration file identifying the pending A specific memory address of at least a portion of the memory array that is locked upon completion of the power-on handler.

13. The integrated circuit of claim 1, wherein the controller is further configured to require a complex password to unlock access to memory portions of the memory array associated with one or more memory addresses .

14. The integrated circuit of claim 1, wherein the at least one security measure is triggered upon detection of an attempted access to at least one locked memory address.

15. The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure comprising:

computing a checksum, hash, CRC (Cyclic Redundancy Check) or check bit computed relative to at least a portion of the memory array; and

The calculated checksum, hash, CRC or check digit is compared to a predetermined value.

16. The integrated circuit of claim 15, wherein the controller is configured to determine, as part of the at least one security measure, whether the calculated checksum, hash, CRC or check bit matches the predetermined value.

17. The integrated circuit of claim 1, wherein the at least one security measure includes duplicating program code in at least two different memory portions.

18. The integrated circuit of claim 17, wherein the at least one security measure includes determining whether output results of executing the program code in the at least two different memory portions are different.

19. The integrated circuit of claim 18, wherein the output results comprise intermediate or final output results.

20. The integrated circuit of claim 17, wherein the at least two different memory portions are included within the integrated circuit.

21. The integrated circuit of claim 1, wherein the at least one security measure includes determining whether an operating pattern is different from one or more predetermined operating patterns.

22. The integrated circuit of claim 2, wherein the one or more remedial actions include ceasing to perform operations.

23. A method of protecting an integrated circuit against tampering, the method comprising:

At least one security measure is implemented with respect to operation of the integrated circuit using a controller associated with the integrated circuit; wherein the integrated circuit includes:

substrate;

a memory array disposed on the substrate, the memory array including a plurality of discrete memory banks; and

a processing array disposed on the substrate, the processing array including a plurality of processor subunits, each of the plurality of processor subunits and one or more of the plurality of discrete memory banks Discrete memory banks are associated.

24. The method of claim 23, further comprising taking one or more remedial actions if the at least one security measure is triggered.

25. An integrated circuit comprising:

substrate;

Controller, which is configured as:

At least one security measure is implemented with respect to operation of the integrated circuit; wherein the at least one security measure includes duplicating program code in at least two different memory portions.

26. An integrated circuit comprising:

substrate;

A controller configured to implement at least one security measure with respect to operation of the integrated circuit.

27. The integrated circuit of claim 26, wherein the controller is further configured to take one or more remedial actions if the at least one security measure is triggered.

28. A distributed processor memory chip comprising:

substrate;

a first communication port configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip; and

A second communication port configured to establish a communication connection between the distributed processor memory chip and the first additional distributed processor memory chip.

29. The distributed processor memory chip of claim 28, further comprising a third communication port configured to interface between the distributed processor memory chip and a second additional distributed processor A communication connection is established between the memory chips.

30. The distributed processor memory chip of claim 29, further comprising a controller configured to communicate via the first communication port, the second communication port, the third communication port At least one of them controls the communication.

31. The distributed processor memory chip of claim 29, wherein each of the first communication port, the second communication port, and the third communication port is associated with a corresponding bus.

32. The distributed processing memory chip of claim 31, wherein the corresponding bus is a bus common to each of the first communication port, the second communication port, and the third communication port .

33. The distributed processing memory chip of claim 31, wherein

The corresponding buses associated with each of the first communication port, the second communication port, and the third communication port are connected to the plurality of discrete memory banks.

34. The distributed processor memory chip of claim 31, wherein at least one bus associated with the first communication port, the second communication port, and the third communication port is unidirectional.

35. The distributed processor memory chip of claim 31, wherein at least one bus associated with the first communication port, the second communication port, and the third communication port is bidirectional.

36. The distributed processor memory chip of claim 30, wherein the controller is configured to schedule data transfers between the distributed processor memory chip and the first additional distributed processor memory chip , causing the receiving processor sub-unit of the first additional distributed processor memory chip to execute its associated program code based on the data transmission and during a time period when the data transmission is received.

37. The distributed processor memory chip of claim 30, wherein the controller is configured to send a clock enable signal to at least one of the plurality of processor subunits of the distributed processor memory chip One, to control one or more operational aspects of the at least one of the plurality of processor sub-units.

38. The distributed processor memory chip of claim 37, wherein the controller is configured to control the clock enable signal sent to the at least one of the plurality of processor subunits to control the timing of one or more communication commands associated with the at least one of the plurality of processor subunits.

39. The distributed processor memory chip of claim 30, wherein the controller is configured to selectively initiate processing by one of the plurality of processor subunits on the distributed processor memory chip One or more executable program codes.

40. The distributed processor memory chip of claim 30, wherein the controller is configured to use a clock enable signal to control communication from one or more of the plurality of processor subunits to the second Timing of data transmission for at least one of the communication port and the third communication port.

41. The distributed processor memory chip of claim 28, wherein a communication speed associated with the first communication port is lower than a communication speed associated with the second communication port.

42. The distributed processor memory chip of claim 30, wherein the controller is configured to determine whether a first processor sub-unit of the plurality of processor sub-units is ready to transfer data to a a second processor subunit in the first additional distributed processor memory chip and uses a clock after determining that the first processor subunit is ready to transfer the data to the second processor subunit An enable signal initiates the transfer of the data from the first processor subunit to the second processor subunit.

43. The distributed processor memory chip of claim 42, wherein the controller is further configured to determine whether the second processor sub-unit is ready to receive the data, and after determining the second processing The clock enable signal is used to initiate the transfer of the data from the first processor subunit to the second processor subunit after the processor subunit is ready to receive the data.

44. The distributed processor memory chip of claim 42, wherein the controller is further configured to determine whether the second processor sub-unit is ready to receive the data and buffer the data included in the transfer the data until after a determination that the second processor sub-unit of the first additional distributed processor memory chip is ready to receive the data.

45. A method of transferring data between a first distributed processor memory chip and a second distributed processor memory chip, the method comprising:

using a controller associated with at least one of the first distributed processor memory chip and the second distributed processor memory chip to determine a plurality of disposed on the first distributed processor memory chip whether a first processor subunit of the processor subunits is ready to transfer data to a second processor subunit included in the second distributed processor memory chip; and

A clock enable signal controlled by the controller is used to initiate processing of the data from the first processor subunit after determining that the first processor subunit is ready to transfer the data to the second processor subunit transfer from the processor subunit to the second processor subunit.

46. The method of claim 45, further comprising:

determining, using the controller, whether the second processor sub-unit is ready to receive the data; and

The clock enable signal is used after determining that the second processor subunit is ready to receive the data to initiate the transfer of the data from the first processor subunit to the second processor subunit send.

47. The method of claim 45, further comprising:

determining, using the controller, whether the second processor sub-unit is ready to receive the data, and buffering the data included in the transfer until the first additional distributed processor memory chip's After a determination that the second processor sub-unit is ready to receive the data.

48. A memory chip comprising:

substrate;

a first communication port configured to establish a communication connection between the memory chip and an external entity other than another memory chip; and

A second communication port configured to establish a communication connection between the memory chip and the first additional memory chip.

49. The memory chip of claim 48, wherein the first communication port is connected to at least one of a main bus internal to the memory chip or at least one processor sub-unit included in the memory chip.

50. The memory chip of claim 48, wherein the second communication port is connected to at least one of a main bus internal to the memory chip or at least one processor sub-unit included in the memory chip.

51. A memory cell comprising:

a memory array that includes a plurality of memory banks;

at least one controller configured to control at least one aspect of read operations relative to the plurality of memory banks;

at least one zero value detection logic unit configured to detect multi-bit zero values associated with data stored in particular addresses of the plurality of memory banks; and

wherein the at least one controller is configured to communicate a zero value indicator back to one or more circuits in response to a zero value detection by the at least one zero value detection logic unit.

52. The memory cell of claim 51, wherein the one or more circuits that return the zero-value indicator are external to the memory cell.

53. The memory cell of claim 51, wherein the one or more circuits that return the zero-value indicator are internal to the memory cell.

54. The memory cell of claim 51, wherein the memory cell further comprises at least one read disable element configured to detect logic cells at the at least one zero value A read command associated with the particular address is interrupted upon detection of a zero value associated with the particular address.

55. The memory cell of claim 51, wherein the at least one controller is configured to send the zero value indicator to the one or more circuits rather than sending a zero stored in the particular address value data.

56. The memory cell of claim 51, wherein the size of the zero value indicator is less than the size of zero data.

57. The memory cell of claim 51, wherein the energy consumed by the first handler comprising the following operations is less than the energy consumed by sending zero-valued data to the one or more circuits: (a) detecting measuring the zero value; (b) generating the zero value indicator; and (c) sending the zero value indicator to the one or more circuits.

58. The memory cell of claim 57, wherein the energy consumed by the first handler is less than the energy consumed by sending the zero-valued data to the one or more circuits half.

59. The memory cell of claim 51, wherein the memory cell further comprises at least one sense amplifier configured to perform zero detection at the at least one zero detection unit At least one of the plurality of memory banks is then blocked from starting.

60. The memory cell of claim 59, wherein the at least one sense amplifier comprises a plurality of transistors configured to sense low power signals from the plurality of memory banks, and the At least one sense amplifier amplifies small voltage swings to higher voltage levels so that data stored in the plurality of memory banks can be interpreted by the at least one controller.

61. The memory cell of claim 51, wherein each of the plurality of memory banks is further organized into subgroups, the at least one controller comprises a subgroup controller, and wherein the at least one zero value The detection logic unit includes zero value detection logic associated with the subset.

62. The memory cell of claim 61, wherein the memory cell further comprises at least one read disable element comprising a read disable element associated with each of the subsets sense amplifier.

63. The memory unit of claim 51, further comprising a plurality of processor sub-units spatially distributed within the memory unit, wherein each of the plurality of processor sub-units is associated with the A dedicated at least one of the plurality of memory banks is associated, and wherein each of the plurality of processor sub-units is configured to access and operate on data stored in the corresponding memory bank.

64. The memory unit of claim 63, wherein the one or more circuits comprise one or more of the processor subunits.

65. The memory unit of claim 63, wherein each of the plurality of processor sub-units is connected to two or more of the plurality of processor sub-units by one or more buses Two other processor subunits.

66. The memory cell of claim 51, further comprising a plurality of buses.

67. The memory unit of claim 66, wherein the plurality of buses are configured to transfer data between the plurality of memory banks.

68. The memory cell of claim 67, wherein at least one of the plurality of buses is configured to communicate the zero-value indicator to the one or more circuits.

69. A method for detecting zero values in specific addresses of a plurality of memory banks, comprising:

receiving, from circuitry external to the memory unit, a request to read data stored at addresses of a plurality of memory banks;

in response to the received request, enabling a zero value detection logic unit to detect, by the controller, a zero value in the received address; and

A zero value indicator is communicated to the circuit by the controller in response to the zero value detection by the zero value detection logic unit.

70. The method of claim 69, further comprising configuring, by the controller, a read disable element to interrupt when the zero value detection logic unit detects a zero value associated with a requested address a read command associated with the requested address.

71. The method of claim 69, further comprising configuring, by the controller, a sense amplifier to block at least one of the plurality of memory banks when the zero value detection unit detects a zero value starter.

72. A non-transitory computer-readable medium storing a set of instructions executable by a controller of a memory unit to cause the memory unit to detect a zero value in a particular address of a plurality of memory banks, method Include:

73. The non-transitory computer-readable medium of claim 72, wherein the method further comprises configuring, by the controller, a read disable element to detect at the zero value detection logic unit and all A read command associated with the requested address is interrupted when the zero value associated with the requested address is interrupted.

74. The non-transitory computer-readable medium of claim 72, wherein the method further comprises configuring, by the controller, a sense amplifier to prevent all of the zero values when the zero value detection unit detects a zero value activation of at least one of the plurality of memory banks.

75. An integrated circuit comprising:

a memory unit comprising a plurality of memory banks, at least one controller configured to control at least one aspect of a read operation relative to the plurality of memory banks, and configured to detect and store in the plurality of memory banks at least one zero value detection logic unit of the multi-bit zero value associated with the data in the specific address;

a processing unit configured to send a read request to the memory unit to the memory for reading data from the memory unit; and

wherein the at least one controller and the at least one zero-value detection logic are configured to communicate a zero-value indicator back to one or more of the zero-value indicators in response to a zero-value detection by the at least one zero-value detection logic a circuit.

76. A memory cell comprising:

a memory array that includes a plurality of memory banks;

at least one detection logic unit configured to detect predetermined multi-bit values associated with data stored in particular addresses of the plurality of memory banks; and

wherein the at least one controller is configured to communicate a value indicator back to one or more circuits in response to detection of the predetermined multi-bit value by the at least one detection logic.

77. The memory cell of claim 76, wherein the predetermined multi-bit value is selectable by a user.

78. A memory cell comprising:

a memory array that includes a plurality of memory banks;

at least one controller configured to control at least one aspect of write operations with respect to the plurality of memory banks;

at least one detection logic unit configured to detect a predetermined multi-bit value associated with data to be written to a particular address of the plurality of memory banks; and

wherein the at least one controller is configured to provide a value indicator to one or more circuits in response to detection of the predetermined multi-bit value by the at least one detection logic.

79. A distributed processor memory chip comprising:

substrate;

a memory array including a plurality of memory banks disposed on the substrate;

a plurality of processor subunits disposed on the substrate;

wherein the at least one controller is configured to return a value indicator to one of the plurality of processor sub-units in response to a detection of the predetermined multi-bit value by the at least one detection logic or multiple.

80. A memory cell comprising:

one or more memory banks;

group controller; and

address generator;

where the address generator is configured as:

providing a current address of a current row to be accessed in an associated one of the one or more memory banks to the bank controller;

determining the predicted address of the next row to be accessed in the associated memory bank; and

The predicted address is provided to the bank controller prior to completion of operations relative to the current row associated with the current address.

81. The memory cell of claim 80, wherein the operation relative to the current row associated with the current address is a read operation or a write operation.

82. The memory cell of claim 80, wherein the current row and the next row are in the same memory bank.

83. The memory cell of claim 82, wherein the same memory bank allows access to the next row while the current row is being accessed.

84. The memory cell of claim 80, wherein the current row and the next row are in different memory banks.

85. The memory unit of claim 80, a distributed processor, wherein the distributed processor comprises a plurality of processor subunits of a processing array, the plurality of processor subunits of the processing array being spatially distributed in multiple discrete memory banks of the memory array.

86. The memory cell of claim 80, wherein the bank controller is configured to access the current row and start the next row prior to completion of the operation relative to the current row.

87. The memory cell of claim 80, wherein each of the one or more memory banks includes at least a first subset and a second subset, and wherein the one or more memory banks are the same as in the one or more memory banks The group controllers associated with each of the include a first subgroup controller associated with the first subgroup and a second group controller associated with the second subgroup.

88. The memory cell of claim 87, wherein a first subset controller is configured to enable access to data included in a current row of the first subset, and a second subset controller enables all . the next row in the second subgroup.

89. The memory cell of claim 88, wherein an activated next row of the second subset is separated by at least two rows from the current row of data being accessed in the first subset.

90. The memory cell of claim 87, wherein the second subset controller is configured such that data included in a current row of the second subset is accessed while the first subset controls The processor starts the next row in the first subgroup.

91. The memory cell of claim 90, wherein an activated next row of the first subset is separated by at least two rows from the current row of data being accessed in the second subset.

92. The memory cell of claim 80, wherein the predicted address is determined using a trained neural network.

93. The memory cell of claim 80, wherein the predicted address is determined based on the determined bank access pattern.

94. The memory cell of claim 80, wherein the address generator comprises a first address generator configured to generate the current address and a second address generator configured to generate the predicted address.

95. The memory cell of claim 94, wherein the second address generator is configured to calculate the predicted address within a predetermined period of time after the current address generator has generated the current address.

96. The memory cell of claim 95, wherein the predetermined period of time is adjustable.

97. The memory cell of claim 96, wherein the predetermined period of time is adjusted based on a value of at least one operating parameter associated with the memory cell.

98. The memory cell of claim 97, wherein the at least one operating parameter comprises a temperature of the memory cell.

99. The memory cell of claim 80, wherein the address generator is further configured to generate a confidence level associated with the predicted address, and cause the confidence level to fall below a predetermined threshold if the confidence level falls below a predetermined threshold. The group controller relinquishes access to the next row at the predicted address.

100. The memory cell of claim 80, wherein the predicted address is generated by a series of flip-flops that sample the delayed generated address.

101. The memory cell of claim 100, wherein the delay is configurable via a multiplexer that selects between flip-flops that store sampled addresses.

102. The memory cell of claim 80, wherein the bank controller is configured to ignore predicted addresses received from the address generator during a predetermined period of time following a reset of the memory cell.

103. The memory cell of claim 80, wherein the address generator is configured to forego providing the predicted address to a random pattern in row access relative to the associated memory bank after detecting a random pattern. the group controller.

104. A memory cell comprising:

one or more memory banks, wherein each of the one or more memory banks comprises:

multiple lines;

a first row controller configured to control a first subset of the plurality of rows;

a second row controller configured to control a second subset of the plurality of rows;

a single data input for receiving data to be stored in the plurality of rows; and

a single data output to provide data retrieved from the plurality of rows.

105. The memory unit of claim 104, wherein the memory unit is configured to receive a first address at predetermined times for processing and to receive a second address for activation and access.

106. The memory cell of claim 104, wherein the first subset of the plurality of rows consists of even-numbered rows.

107. The memory cell of claim 106, wherein even numbered rows are located in half of the one or more memory banks.

108. The memory cell of claim 106, wherein odd-numbered rows are located in half of the one or more memory banks.

109. The memory cell of claim 104, wherein the second subset of the plurality of rows consists of odd-numbered rows.

110. The memory cell of claim 104, wherein the first subset of the plurality of rows is contained in a first subset of a memory bank adjacent to the first subset containing the plurality of A second subset of the memory banks of the second subset of rows.

111. The memory cell of claim 104, wherein the first row controller is configured to cause access to data included in a row of the first subset of the plurality of rows, and The second row controller activates a row of the second subset of the plurality of rows.

112. The memory cell of claim 111, wherein an activated row in the second subset of the plurality of rows is the same as the first subset of the plurality of rows of data being accessed Lines are separated by at least two lines.

113. The memory cell of claim 104, wherein the second row controller is configured to cause access to data included in a row of the second subset of the plurality of rows, and The first row controller activates a row of the second subset of the plurality of rows.

114. The memory cell of claim 113, wherein an enabled row in the first subset of the plurality of rows and a difference in data being accessed in the second subset of the plurality of rows Lines are separated by at least two lines.

115. The memory cell of claim 104, wherein each of the one or more memory banks includes a column input for receiving a column identification indicating a portion of a row to be accessed symbol.

116. The memory cell of claim 104, wherein an additional row of redundant pads is placed between each of the two rows of pads to create a distance for enabling activation.

117. The memory cell of claim 104, wherein rows in close proximity to each other may not be activated at the same time.

118. A distributed processor on a memory chip comprising:

substrate;

a processing array disposed on the substrate, the processing array including a plurality of processor subunits, each of the processor subunits associated with a corresponding dedicated memory bank of the plurality of discrete memory banks and

at least one memory pad disposed on the substrate, wherein the at least one memory pad is configured to function as at least one register of a register file for one or more of the plurality of processor subunits.

119. The memory chip of claim 118, wherein the at least one memory pad is included in at least one of the plurality of processor subunits of the processing array.

120. The memory chip of claim 118, wherein the register file is configured as a data register file.

121. The memory chip of claim 118, wherein the register file is configured as an address register file.

122. The memory chip of claim 118, wherein the at least one memory pad is configured to provide at least one register of a register file for one or more of the plurality of processor subunits to store pending memory data accessed by one or more of the plurality of processor subunits.

123. The memory chip of claim 118, wherein the at least one memory pad is configured to provide at least one register of a register file for one or more of the plurality of processing subunits, wherein the register The at least one register of the file is configured to store coefficients used by the plurality of processor subunits during execution of convolution accelerator operations by the plurality of processor subunits.

124. The memory chip of claim 118, wherein the at least one memory pad is a DRAM memory pad.

125. The memory chip of claim 118, wherein the at least one memory pad is configured to communicate via unidirectional access.

126. The memory chip of claim 118, wherein the at least one memory pad allows bidirectional access.

127. The memory chip of claim 1, further comprising at least one redundant memory pad disposed on the substrate, wherein the at least one redundant memory pad is configured to provide for the plurality of processors At least one redundancy register of one or more of the subunits.

128. The memory chip of claim 118, further comprising at least one memory pad disposed on the substrate, wherein the at least one memory pad contains elements configured to be provided in the plurality of processor subunits at least one redundant memory bit of one or more of at least one redundant register.

129. The memory chip of claim 118, further comprising:

a first plurality of buses, each of the first plurality of buses connecting one of the plurality of processor sub-units to a corresponding dedicated memory bank; and

A second plurality of buses, each of the second plurality of buses connecting one of the plurality of processor subunits to another of the plurality of processor subunits.

130. The memory chip of claim 118, wherein at least one of the processor subunits includes a counter configured to count backwards from a predefined number, and after the counter reaches a zero value, the The at least one of the processor sub-units is configured to stop the current task and trigger a memory refresh operation.

131. The memory chip of claim 118, wherein at least one of the processor sub-units includes a mechanism to stop a current task and trigger a memory refresh operation at a specific time to refresh the memory pad.

132. The memory chip of claim 118, wherein the register file is configured to function as a cache.

133. A method of executing at least one instruction in a distributed processor memory chip, the method comprising:

retrieving one or more data values from a memory array of the distributed processor memory chips;

storing the one or more data values in a register formed in a memory pad of the distributed processor memory chip; and

Accessing the one or more data values stored in the register in accordance with at least one instruction executed by a processor element;

wherein the memory array includes a plurality of discrete memory banks disposed on a substrate;

wherein the processor element is a processor subunit included in a plurality of processor subunits in a processing array disposed on the substrate, wherein each of the processor subunits is associated with the plurality of Corresponding dedicated memory banks in the discrete memory banks are associated; and

wherein the registers are provided by memory pads disposed on the substrate.

134. The method of claim 133, wherein the processor element is configured to function as an accelerator, and the method further comprises:

accessing first data stored in the register;

accessing second data from the memory array;

An operation is performed on the first data and the second data.

135. The method of claim 133, wherein at least one memory pad includes a plurality of word lines and bit lines, and the method further comprises:

Timing of loading the word lines and bit lines is determined, the timing being determined by the size of the memory pad.

136. The method of claim 133, further comprising:

The registers are refreshed periodically.

137. The method of claim 12, wherein the memory pads comprise DRAM memory pads.

138. The method of claim 133, wherein the memory pad is included in at least one of the plurality of discrete memory banks of the memory array.

139. An apparatus comprising:

substrate;

a processing unit disposed on the substrate; and

a memory unit disposed on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit, and

wherein the processing unit includes a memory pad configured to act as a cache for the processing unit.

140. A method for distributed processing of at least one stream of information, the method comprising:

The at least one information stream is received by one or more memory processing integrated circuits via a first communication channel; wherein each memory processing integrated circuit includes a controller, a plurality of processor sub-units, and a plurality of memory units;

buffering, by the one or more memory processing integrated circuits, the at least one stream of information;

performing, by the one or more memory processing integrated circuits, a first processing operation on the at least one stream of information to provide a first processing result;

sending the first processing result to a processing integrated circuit; and

performing, by the one or more memory processing integrated circuits, a second processing operation on the first processing result to provide a second processing result;

wherein the size of the logical cells of the one or more memory processing integrated circuits is smaller than the size of the logical cells of the processing integrated circuits.

141. The method of claim 140, wherein each of the plurality of memory units is coupled to at least one of the plurality of processor sub-units.

142. The method of claim 140, wherein a total size of information elements of the at least one information stream received during a particular time duration exceeds the total size of a first processing result output during the particular time duration.

143. The method of claim 140, wherein a total size of the at least one information stream is less than a total size of the first processing result.

144. The method of claim 140, wherein the memory class manufacturing process is a DRAM manufacturing process.

145. The method of claim 140, wherein the processing integrated circuit is fabricated by a memory class of fabrication process; and

Wherein the processing integrated circuit is manufactured by a logic class of manufacturing process.

146. The method of claim 140, wherein a logical cell of the one or more memory processing integrated circuits is at least twice the size of a corresponding logical cell of the processing integrated circuit.

147. The method of claim 140, wherein a critical dimension of a logical cell of the one or more memory processing integrated circuits is at least twice the critical dimension of a corresponding logical cell of the processing integrated circuit.

148. The method of claim 140, wherein a critical dimension of a memory cell of the one or more memory processing integrated circuits is at least twice the critical dimension of a corresponding logic cell of the processing integrated circuit.

149. The method of claim 140, comprising requesting, by the processing integrated circuit, the one or more memory processing integrated circuits to perform the first processing operation.

150. The method of claim 140, comprising instructing, by the processing integrated circuit, the one or more memory processing integrated circuits to perform the first processing operation.

151. The method of claim 140, comprising configuring, by the processing integrated circuit, the one or more memory processing integrated circuits to perform the first processing operation.

152. The method of claim 140, comprising performing the first processing operation by the one or more memory processing integrated circuits without the processing integrated circuits intervening.

153. The method of claim 140, wherein the first processing operation is less computationally complex than the second processing operation.

154. The method of claim 140, wherein the overall throughput of the first processing operation exceeds the overall throughput of the second processing operation.

155. The method of claim 140, wherein the at least one stream of information comprises one or more streams of preprocessed information.

156. The method of claim 157, wherein the one or more preprocessed information streams are data extracted from a network transport unit.

157. The method of claim 140, wherein a portion of the first processing operation is performed by one of the plurality of processor subunits and another portion of the first processing operation is performed by the plurality of The other of the processor subunits executes.

158. The method of claim 140, wherein the first processing operation and the second processing operation comprise cellular network processing operations.

159. The method of claim 140, wherein the first processing operation and the second processing operation comprise database processing operations.

160. The method of claim 140, wherein the first processing operation and the second processing operation comprise database analysis processing operations.

161. The method of claim 140, wherein the first processing operation and the second processing operation comprise artificial intelligence processing operations.

162. A method for distributed processing, the method comprising:

Information units are received by one or more memory processing integrated circuits of a disaggregated system, the disaggregated system including one or more arithmetic subsystems separate from one or more storage subsystems; wherein the one or more memories Each of the processing integrated circuits includes a controller, a plurality of processor sub-units, and a plurality of memory units;

wherein the one or more computing subsystems comprise a plurality of processing integrated circuits;

wherein the size of logical cells of the one or more memory processing integrated circuits is at least twice the size of corresponding logical cells of the plurality of processing integrated circuits;

performing, by the one or more memory processing integrated circuits, processing operations on the information units to provide processing results; and

The processing results are output from the one or more memory processing integrated circuits.

163. The method of claim 162, comprising outputting the processing results to the one or more arithmetic subsystems of the factorized system.

164. The method of claim 162, comprising receiving the information unit from the one or more storage subsystems of the disaggregated system.

165. The method of claim 162, comprising outputting the processing results to the one or more storage subsystems of the disaggregated system.

166. The method of claim 162, comprising receiving the information unit from the one or more arithmetic subsystems of the factorized system.

167. The method of claim 166, wherein the information units sent from different groups of processing units of the plurality of processing integrated circuits comprise different portions of intermediate results of processing programs executed by the plurality of processing integrated circuits , wherein the group of processing units includes at least one processing integrated circuit.

168. The method of claim 167, comprising outputting results of an entire processing procedure by the one or more memory processing integrated circuits.

169. The method of claim 168, comprising sending the result of the entire processing procedure to each of the plurality of processing integrated circuits.

170. The method of claim 168, wherein the different portions of the intermediate results are different portions of an updated neural network model, and wherein the result of the entire processing procedure is the updated neural network model .

171. The method of claim 168, comprising sending the updated neural network model to each of the plurality of processing integrated circuits.

172. The method of claim 162, comprising outputting the processing result using a switching subunit of the disaggregated system.

173. The method of claim 162, wherein the one or more memory processing integrated circuits are included in a memory processing subsystem of the disaggregated system.

174. The method of claim 162, wherein at least one of the one or more memory processing integrated circuits is included in one or more computing subsystems of the disaggregated system.

175. The method of claim 162, wherein at least one of the one or more memory processing integrated circuits is included in one or more memory subsystems of the disaggregated system.

176. The method of claim 162, wherein at least one of the following is true: (a) receiving the information unit from at least one of the plurality of processing integrated circuits; and (b) converting The processing results are sent to one or more memory processing integrated circuits of the plurality of processing integrated circuits.

177. The method of claim 176, wherein a critical dimension of logical cells of the one or more memory processing integrated circuits exceeds a critical dimension of corresponding logical cells of the plurality of processing integrated circuits by at least two times.

178. The method of claim 176, wherein a critical dimension of memory cells of the one or more memory processing integrated circuits exceeds a critical dimension of corresponding logic cells of the plurality of processing integrated circuits by at least two times.

179. The method of claim 162, wherein the information units comprise preprocessed information units.

180. The method of claim 179, comprising preprocessing the information unit by the plurality of processing integrated circuits to provide the preprocessed information unit.

181. The method of claim 162, wherein the information unit conveys part of a model of a neural network.

182. The method of claim 162, wherein the information element conveys partial results of at least one database query.

183. The method of claim 162, wherein the information element conveys partial results of at least one aggregated database query.

184. A method for database analysis acceleration, the method comprising:

receiving, by the memory processing integrated circuit, a database query, the database query including at least one relevance criterion that prompts a database entry in a database related to the database query;

wherein the memory processing integrated circuit includes a controller, a plurality of processor sub-units, and a plurality of memory units;

determining, by the memory processing integrated circuit and based on the at least one correlation criterion, a group of related database entries stored in the memory processing integrated circuit; and

sending the group of related database entries to one or more processing entities for further processing without substantially sending unrelated database entries stored in the memory processing integrated circuit to the one or more processing entities entity;

wherein the unrelated database entry is different from the related database entry.

185. The method of claim 184, wherein the one or more processing entities are included in the plurality of processor subunits of the memory processing integrated circuit.

186. The method of claim 185, comprising further processing, by the memory processing integrated circuit, the group of related database entries to complete a response to the database query.

187. The method of claim 186, comprising outputting the response to the database query from the memory processing integrated circuit.

188. The method of claim 187, wherein the output comprises an application flow control handler.

189. The method of claim 188, wherein the application of the flow control handler is responsive to an indicator output from the one or more processing entities, the indicator relating to one or more of the groups Completion of processing of multiple database entries.

190. The method of claim 185, comprising further processing, by the memory processing integrated circuit, the group of related database entries to provide an intermediate response to the database query.

191. The method of claim 190, comprising outputting the intermediate response to the database query from the memory processing integrated circuit.

192. The method of claim 191, wherein the output comprises an application flow control handler.

193. The method of claim 192, wherein the application of the flow control handler is responsive to an indicator output from the one or more processing entities, the indicator pertaining to a database entry for the group part of the processing is complete.

194. The method of claim 185, comprising generating, by the one or more processing entities, a processing status indicator that prompts the further processing of the group of related database entries progress.

195. The method of claim 185, comprising using the memory processing integrated circuit to further process the group of related database entries.

196. The method of claim 195, wherein the processing is performed by the plurality of processor subunits.

197. The method of claim 195, wherein the processing comprises computing an intermediate result by one of the plurality of processing sub-units, sending the intermediate result to one of the plurality of processing sub-units The other and additional computations are performed by the other processing sub-unit.

198. The method of claim 195, wherein the processing is performed by the controller.

199. The method of claim 195, wherein the processing is performed by the plurality of processor subunits and the controller.

200. The method of claim 184, wherein the one or more processing entities are external to the memory processing integrated circuit.

201. The method of claim 200, comprising outputting the group of related database entities from the memory processing integrated circuit.

202. The method of claim 201, wherein the output comprises an application flow control handler.

203. The method of claim 202, wherein the application of the flow control handler is responsive to an indicator output from the one or more processing entities, the indicator relating to a relationship with the one or more processing entities Handles dependencies of database entries associated with entities.

204. The method of claim 184, wherein the plurality of processor subunits comprise complete arithmetic logic units.

205. The method of claim 184, wherein the plurality of processor subunits comprise portions of arithmetic logic units.

206. The method of claim 184, wherein the plurality of processor subunits comprise memory controllers.

207. The method of claim 184, wherein the plurality of processor subunits comprise portions of a memory controller.

208. The method of claim 184, comprising outputting at least one of: (i) the group of related database entries, (ii) a response to the database query, and (iii) a pair of Intermediate responses to the database query.

209. The method of claim 212, wherein the output comprises application business shaping.

210. The method of claim 212, wherein the outputting comprises attempting to match a bandwidth used during the outputting with a maximum allowable bandwidth on a link coupling the memory processing integrated circuit to Requester unit.

211. The method of claim 212, wherein the output of the output comprises maintaining fluctuations in output traffic rates below a threshold.

212. The method of claim 184, wherein the one or more processing entities comprise a plurality of processing entities, wherein at least one of the plurality of processing entities belongs to the memory processing integrated circuit, and the plurality of processing entities At least another of the processing entities does not belong to the memory processing integrated circuit.

213. The method of claim 184, wherein the one or more processing entities belong to another memory processing integrated circuit.

214. A method for database analysis acceleration, the method comprising:

receiving, by a plurality of memory processing integrated circuits, a database query including at least one relevance criterion that prompts a database entry in a database related to the database query; wherein each of the plurality of memory processing integrated circuits comprising a controller, a plurality of processor sub-units and a plurality of memory units;

determining, by each of the plurality of memory processing integrated circuits and based on the at least one correlation criterion, a group of related database entries stored in the memory processing integrated circuit; and

by each of the plurality of memory processing integrated circuits sending the group of related database entries stored in the memory processing integrated circuit to one or more processing entities for further processing, in essence Unrelated database entries stored in the memory processing integrated circuit are not sent to the one or more processing entities; wherein the unrelated database entries are different from the related database entries.

215. A method for database analysis acceleration, the method comprising:

receiving, by an integrated circuit, a database query, the database query including at least one relevance criterion prompting a database entry in a database related to the database query; wherein the integrated circuit includes a controller, a screening unit, and a plurality of memory units;

determining, by the screening unit and based on the at least one correlation criterion, a group of related database entries stored in the integrated circuit; and

sending the group of related database entries to one or more processing entities external to the integrated circuit for further processing without substantially sending unrelated database entries stored in the integrated circuit to the integrated circuit One or more processing entities.

216. A method for database analysis acceleration, the method comprising:

receiving, by the integrated circuit, a database query, the database query including at least one relevance criterion that prompts a database entry in a database related to the database query;

wherein the integrated circuit includes a controller, a processing unit and a plurality of memory units;

determining, by the processing unit and based on the at least one correlation criterion, a group of related database entries stored in the integrated circuit;

processing the group of related database entries by the processing unit and not processing unrelated data entries stored in the integrated circuit by the processing unit to provide processing results; wherein the unrelated database entries are different from the relevant database entry; and

The processing result is output from the integrated circuit.

217. A method for extracting feature vector related information, the method comprising:

Fetch information is received by a memory processing integrated circuit for retrieval of a plurality of requested feature vectors mapped to a plurality of sentence segments; wherein the memory processing integrated circuit includes a controller, a plurality of processor sub-units, and a plurality of memory units, each of the memory units coupled to a processor sub-unit;

retrieving the plurality of requested feature vectors from at least some of the plurality of memory cells; wherein the retrieving includes concurrently requesting storage in the two or more memory cells from the two or more memory cells the requested feature vector in the memory cells; and

An output including at least one of: (a) the requested feature vector; and (b) the result of processing the requested feature vector is output from the memory processing integrated circuit.

218. The method of claim 217, wherein the output comprises the requested feature vector.

219. The method of claim 217, wherein the output comprises the result of the processing of the requested feature vector.

220. The method of claim 219, wherein the processing is performed by the plurality of processor subunits.

221. The method of claim 220, wherein the processing comprises sending the requested feature vector from one processing sub-unit to another processing sub-unit.

222. The method of claim 220, wherein the processing comprises computing an intermediate result by one processing subunit, sending the intermediate result to another processing subunit and computing another by the other processing subunit. An intermediate result or processing result.

223. The method of claim 219, wherein the processing is performed by the controller.

224. The method of claim 219, wherein the processing is performed by the plurality of processor subunits and the controller.

225. The method of claim 219, wherein the processing is performed by a vector processor of the memory processing integrated circuit.

226. The method of claim 217, wherein the controller is configured to concurrently request the requested feature based on a known mapping between statement segments and locations of feature vectors mapped to the statement segments vector.

227. The method of claim 11, wherein the mapping is uploaded during a power-on handler of the memory processing integrated circuit.

228. The method of claim 217, wherein the controller is configured to manage the retrieval of the plurality of requested feature vectors.

229. The method of claim 217, wherein the plurality of statement segments have a particular order, and wherein the outputting of the requested feature vector is performed according to the particular order.

230. The method of claim 229, wherein the retrieving of the plurality of requested feature vectors is performed according to the particular order.

231. The method of claim 229, wherein the retrieving of the plurality of requested feature vectors is performed at least partially out-of-order; and wherein the retrieving further comprises re-processing the plurality of requested feature vectors sort.

232. The method of claim 217, wherein the retrieving of the plurality of requested features comprises buffering the plurality of requested features before the plurality of requested feature vectors are read by the controller vector.

233. The method of claim 232, wherein the retrieving of the plurality of requested features comprises generating a buffer status indicator, the buffer status indicator indicating one or more associated with the plurality of memory cells When multiple buffers store one or more requested feature vectors.

234. The method of claim 233, comprising conveying the buffer status indicator via a dedicated control line.

235. The method of claim 234, wherein one dedicated control line is allocated per memory cell.

236. The method of claim 234, wherein the buffer status indicator comprises one or more status bits stored in one or more of the buffers.

237. The method of claim 234, comprising conveying the buffer status indicator via one or more shared control lines.

238. The method of claim 217, wherein the capture information is included in one or more capture commands at a first resolution, the first resolution representing a specified number of bits.

239. The method of claim 238, comprising managing, via the controller, the acquisition at a higher resolution, the higher resolution representing a number of bits below the particular number of bits.

240. The method of claim 238, wherein the controller is configured to manage the acquisition according to eigenvector resolution.

241. The method of claim 238, comprising independently managing, by the controller, the fetching.

242. The method of claim 217, wherein the plurality of processor subunits comprise complete arithmetic logic units.

243. The method of claim 217, wherein the plurality of processor subunits comprise portions of arithmetic logic units.

244. The method of claim 217, wherein the plurality of processor subunits comprise memory controllers.

245. The method of claim 217, wherein the plurality of processor subunits comprise portions of memory controllers.

246. The method of claim 217, wherein the output of the output comprises applying business shaping to the output.

247. The method of claim 217, wherein the output of the output comprises matching a bandwidth used during the output to a maximum allowable bandwidth on a link that couples the memory processing integrated circuit Coupled to the requestor unit.

248. The method of claim 217, wherein the outputting of the output comprises maintaining fluctuations in output traffic rates below a predetermined threshold.

249. The method of claim 217, wherein the fetching comprises applying predictive fetching to at least some of the requested feature vectors from a set of requested feature vectors stored in a single memory unit.

250. The method of claim 217, wherein the requested feature vector is distributed among the memory cells.

251. The method of claim 217, wherein the requested feature vector is distributed among the memory cells based on an expected retrieval pattern.

252. A method for memory-intensive processing, the method comprising:

Processing operations are performed by a plurality of processors included in a hybrid device including a base die, a first memory resource associated with at least one second die, and a memory resource associated with at least one third die a second memory resource; wherein the base die and the at least one second die are connected to each other by inter-wafer bonding;

retrieving information stored in the first memory resource using the plurality of processors; and

sending additional information from the second memory resource to the first memory resource, wherein the total bandwidth of the first path between the base die and the at least one second die exceeds the at least one second The total bandwidth of the second path between the die and the at least one third die, and wherein the storage capacity of the first memory resource is less than the storage capacity of the second memory resource.

253. The method of claim 252, wherein the second memory resource comprises a high bandwidth memory (HBM) resource.

254. The method of claim 252, wherein the at least one third die comprises a stack of high bandwidth memory (HBM) chips.

255. The method of claim 252, wherein at least some of the second memory resources belong to a third one of at least one third die, the third die not using inter-wafer bonding connected to the base die.

256. The method of claim 252, wherein at least some of the second memory resources belong to a third one of at least one third die, the third die not using inter-wafer bonding is connected to a second one of the at least one second die.

257. The method of claim 252, wherein the first memory resource and the second memory resource comprise different levels of cache.

258. The method of claim 252, wherein the first memory resource is located between the base die and the second memory resource.

259. The method of claim 252, wherein the first memory resource is not located on top of the second memory resource.

260. The method of claim 252, comprising performing additional processing by a second one of at least one second die, the second die comprising a plurality of processor subunits and the first memory resource.

261. The method of claim 260, wherein at least one processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

262. The method of claim 261, wherein the dedicated portion of the first memory resource comprises at least one memory bank.

263. The method of claim 252, wherein the plurality of processors belong to a memory processing chip that also includes the first memory resource.

264. The method of claim 252, wherein the base die includes the plurality of processors, wherein the plurality of processors include conductors coupled to the first processor via conductors formed using the wafer-to-wafer bonding A plurality of processor subunits of a memory resource.

265. The method of claim 264, wherein each processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

266. A hybrid device for memory-intensive processing, the hybrid device comprising:

base grain;

multiple processors;

a first memory resource of at least one second die;

a second memory resource of at least one third die;

wherein the base die and the at least one second die are connected to each other by inter-wafer bonding;

wherein the plurality of processors are configured to perform processing operations and retrieve information stored in the first memory resource; and

wherein the second memory resource is configured to send additional information from the second memory resource to the first memory resource;

wherein the total bandwidth of the first paths between the base die and the at least one second die exceeds the total bandwidth of the second paths between the at least one second die and the at least one third die bandwidth, and

The storage capacity of the first memory resource is smaller than the storage capacity of the second memory resource.

267. The hybrid device of claim 266, wherein the second memory resource comprises a high bandwidth memory (HBM) resource.

268. The hybrid device of claim 266, wherein the at least one third die comprises a stack of high bandwidth memory (HBM) memory chips.

269. The hybrid device of claim 266, wherein at least some of the second memory resources belong to a third one of the at least one third die that is not using a wafer connected to the base die in the case of an indirect bond.

270. The hybrid device of claim 266, wherein at least some of the second memory resources belong to a third one of the at least one third die that is not using a wafer is connected to a second one of the at least one second die in the case of an indirect bond.

271. The hybrid device of claim 266, wherein the first memory resource and the second memory resource comprise different levels of cache.

272. The hybrid device of claim 266, wherein the first memory resource is located between the base die and the second memory resource.

273. The hybrid device of claim 266, wherein the first memory resource is located to one side of the second memory resource.

274. The hybrid device of claim 266, wherein a second one of the at least one second die is configured to perform additional processing, wherein the second die comprises a plurality of processor subunits and all the first memory resource.

275. The hybrid device of claim 274, wherein each processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

276. The hybrid device of claim 275, wherein the dedicated portion of the first memory resource comprises at least one memory bank.

277. The hybrid device of claim 266, wherein the plurality of processors comprise a plurality of processor sub-units of a memory processing chip that also includes the first memory resource.

278. The hybrid device of claim 266, wherein the base die includes the plurality of processors, wherein the plurality of processors include conductors coupled to the plurality of processors via conductors formed using the wafer-to-wafer bonding a plurality of processor subunits of the first memory resource.

279. The hybrid device of claim 278, wherein each processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

280. A method for database acceleration, the method comprising:

Retrieve a certain amount of information from the storage unit through the network communication interface of the database acceleration integrated circuit;

first processing the amount of information to provide first processed information;

A memory controller of an integrated circuit is accelerated using the database and the first processed information is sent via an interface to a plurality of memory processing integrated circuits, wherein each memory processing integrated circuit includes a controller, a plurality of processor sub-units, and a plurality of memory unit,

performing second processing on at least a portion of the first processed information using the plurality of memory processing integrated circuits to provide second processed information;

retrieving information from the plurality of memory processing integrated circuits by the memory controller of the database acceleration integrated circuit, wherein the retrieved information includes at least one of: (a) the first processed processing at least a portion of the information; and (b) at least a portion of the second processed information;

performing database processing operations on the retrieved information using a database acceleration unit of the database acceleration integrated circuit to provide database acceleration results; and

The database acceleration results are output.

281. The method of claim 280, comprising managing at least one of the fetching, first processing, sending, and processing using a management unit of the database acceleration integrated circuit.

282. The method of claim 281, wherein the managing is performed based on an execution plan generated by the management unit of the database acceleration integrated circuit.

283. The method of claim 281, wherein the managing is performed based on an execution plan received by the management unit of the database acceleration integrated circuit and not generated by the management unit.

284. The method of claim 281, wherein the managing comprises allocating at least one of: (a) network communication network interface resources; (b) decompression unit resources; (c) memory controllers resources; (d) multiple memory processing integrated circuit resources; and (e) database acceleration unit resources.

285. The method of claim 280, wherein the network communication interface comprises two or more different types of network communication ports.

286. The method of claim 285, wherein the two or more different types of network communication ports comprise a storage interface protocol port and a common network protocol storage interface port.

287. The method of claim 285, wherein the two or more different types of network communication ports comprise a storage interface protocol port and an ethernet protocol storage interface port.

288. The method of claim 285, wherein the two or more different types of network communication ports comprise storage interface protocol ports and PCIe ports.

289. The method of claim 280, comprising a management unit comprising a computing node of a computing system and controlled by a manager of the computing system.

290. The method of claim 280, comprising controlling, by a computing node of a computing system, at least one of the at least one of the fetching, first processing, sending, and third processing.

291. The method of claim 280, comprising using the database to accelerate integrated circuit execution of multiple tasks simultaneously.

292. The method of claim 280, comprising managing at least one of the fetching, first processing, sending, and third processing using a management unit external to the database acceleration integrated circuit.

293. The method of claim 280, wherein the database acceleration integrated circuit belongs to a computing system.

294. The method of claim 280, wherein the database acceleration integrated circuit is not part of a computing system.

295. The method of claim 280, comprising performing, by a computing node of a computing system, of the fetching, first processing, sending, and third processing based on an execution plan sent to the database acceleration integrated circuit at least one.

296. The method of claim 280, wherein the execution of the database processing operations comprises concurrent execution of database processing instructions by database processing subunits, wherein the database acceleration units comprise database accelerator subunits that share a shared memory unit. group of units.

297. The method of claim 296, wherein each database processing subunit is configured to execute a particular type of database processing instruction.

298. The method of claim 297, comprising dynamically linking database processing subunits to provide an execution pipeline for executing database processing operations comprising a plurality of instructions.

299. The method of claim 280, wherein the performing of the database processing operation comprises allocating resources of the database acceleration integrated circuit according to temporal I/O bandwidth.

300. The method of claim 280, comprising outputting the database acceleration results to local storage and retrieving the database acceleration results from the local storage.

301. The method of claim 280, wherein the network communication interface comprises an RDMA unit.

302. The method of claim 280, comprising exchanging information between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

303. The method of claim 280, comprising exchanging database acceleration results between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

304. The method of claim 280, comprising exchanging at least one of: (a) information; and (b) between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits ) database to speed up results.

305. The method of claim 304, wherein the database acceleration integrated circuits of the group are connected to a common printed circuit board.

306. The method of claim 304, wherein the database acceleration integrated circuits of the group belong to modular units of the computerized system.

307. The method of claim 304, wherein different groups of database acceleration integrated circuits are connected to different printed circuit boards.

308. The method of claim 304, wherein different groups of database acceleration integrated circuits belong to different modular units of the computerized system.

309. The method of claim 304, comprising accelerating integrated circuit execution of distributed processing by the database of the one or more groups.

310. The method of claim 304, wherein the exchanging is performed using a network communication interface of the database acceleration integrated circuit of one or more groups.

311. The method of claim 304, wherein the exchange is performed on a plurality of groups connected to each other by star connections.

312. The method of claim 304, comprising using at least one switch for switching at least one of the following between database acceleration integrated circuits of different groups of the one or more groups : (a) information and (b) database accelerated results.

313. The method of claim 304, comprising accelerating at least some of the integrated circuits to perform distributed processing by the database of the one or more groups.

314. The method of claim 304, comprising performing distributed processing of first and second data structures, wherein the combined size of the first and second data structures exceeds storage of the plurality of memory processing integrated circuits ability.

315. The method of claim 314, wherein the performing of the distributed processing comprises performing multiple iterations of: (a) performing different pairs of a first data structure portion and a second data structure portion to Different databases speed up new assignments of integrated circuits; and (b) process the different pairs.

316. The method of claim 315, wherein the performing of the distributed processing comprises a database join operation.

317. The method of claim 315, wherein the performing of the distributed processing comprises:

assigning different first data structure portions to different database acceleration integrated circuits of the one or more groups; and

Perform multiple iterations of:

newly assigning a different second data structure portion to a different database acceleration integrated circuit of the one or more groups, and

Processing of the first and second data structure portions by the integrated circuit is accelerated by the database.

318. The method of claim 317, wherein the new assignment of the next iteration is performed in a manner that overlaps at least partially in time with the processing of the current iteration.

319. The method of claim 317, wherein the new allocation comprises exchanging a second data structure portion between the different database acceleration integrated circuits.

320. The method of claim 319, wherein the exchanging is performed in a manner that overlaps at least a portion of the time with the processing.

321. The method of claim 317, wherein the new allocation comprises exchanging a second data structure portion between the different database acceleration integrated circuits of a group; and once the exchange has been completed, integrating at the database acceleration Parts of the second data structure are exchanged between different groups of circuits.

322. The method of claim 280, wherein the database acceleration integrated circuit is included in a blade, the blade comprising a plurality of database acceleration integrated circuits, one or more non-volatile memory cells, an Ethernet switch, PCIe switches and Ethernet switches, and the plurality of memory processing integrated circuits.

323. An apparatus for database acceleration, the apparatus comprising:

database acceleration integrated circuits; and

a plurality of memory processing integrated circuits; wherein each memory processing integrated circuit includes a controller, a plurality of processor sub-units, and a plurality of memory units;

wherein the network communication interface of the database acceleration integrated circuit is configured to receive information from the storage unit;

wherein the database acceleration integrated circuit is configured to perform a first process on an amount of information to provide first processed information;

wherein the memory controller of the database acceleration integrated circuit is configured to send the first processed information to the plurality of memory processing integrated circuits via an interface;

wherein the plurality of memory processing integrated circuits are configured to perform second processing, by the plurality of memory processing integrated circuits, on at least a portion of the first processed information to provide second processed information;

wherein the memory controller of the database acceleration integrated circuit is configured to retrieve information from the plurality of memory processing integrated circuits, wherein the retrieved information includes at least one of: (a) the first at least a portion of once processed information; and (b) at least a portion of said second processed information;

wherein the database acceleration unit of the database acceleration integrated circuit of the database acceleration integrated circuit is configured to perform database processing operations on the retrieved information to provide database acceleration results; and

wherein the database acceleration integrated circuit is configured to output the database acceleration result.

324. The device of claim 323, configured to manage at least one of the retrieval, first processing, and second processing of the retrieved information using a management unit of the database acceleration integrated circuit .

325. The apparatus of claim 324, wherein the management unit is configured to manage based on an execution plan generated by the management unit of the database acceleration integrated circuit.

326. The apparatus of claim 324, wherein the management unit is configured to manage based on an execution plan received by the management unit of the database acceleration integrated circuit and not generated by the management unit.

327. The apparatus of claim 324, wherein the management unit is configured to manage by allocating one or more of: (a) network communication network interface resources; (b) a decompression unit resources; (c) memory controller resources; (d) multiple memory processing integrated circuit resources; and (e) database acceleration unit resources.

328. The device of claim 323, wherein the network communication interface comprises different types of network communication ports.

329. The device of claim 328, wherein the different types of network communication ports comprise storage interface protocol ports and common network protocol storage interface ports.

330. The device of claim 328, wherein the different types of network communication ports include a storage interface protocol port and an ethernet protocol storage interface port.

331. The device of claim 328, wherein the different types of network communication ports comprise storage interface protocol ports and PCIe ports.

332. The device of claim 323, wherein the device is coupled to a management unit, the management unit comprising a computing node of a computing system and controlled by a manager of the computing system.

333. The apparatus of claim 323, configured to be controlled by a computing node of a computing system.

334. The apparatus of claim 323, configured to accelerate an integrated circuit by the database to perform multiple tasks simultaneously.

335. The apparatus of claim 323, wherein the database acceleration integrated circuit belongs to a computing system.

336. The apparatus of claim 323, wherein the database acceleration integrated circuit is not part of a computing system.

337. The apparatus of claim 323, configured to perform the fetching, first processing, sending, and third processing by a computing node of a computing system based on an execution plan sent to the database acceleration integrated circuit at least one of the.

338. The apparatus of claim 323, wherein the database acceleration unit is configured to concurrently execute database processing instructions by database processing subunits, wherein the database acceleration unit comprises a database accelerator subunit that shares a shared memory unit. group.

339. The apparatus of claim 338, wherein each database processing subunit is configured to execute a particular type of database processing instruction.

340. The apparatus of claim 339, wherein the apparatus is configured to dynamically link database processing subunits to provide an execution pipeline for performing database processing operations comprising a plurality of instructions.

341. The apparatus of claim 323, wherein the apparatus is configured to allocate resources of the database acceleration integrated circuit according to temporal I/O bandwidth.

342. The device of claim 323, wherein the device comprises local storage accessible by the database acceleration integrated circuit.

343. The device of claim 323, wherein the network communication interface comprises an RDMA unit.

344. The apparatus of claim 323, wherein the apparatus comprises one or more groups of database acceleration integrated circuits configured to operate at the one or more groups of database acceleration integrated circuits A database of groups accelerates the exchange of information between integrated circuits.

345. The apparatus of claim 323, wherein the apparatus comprises one or more groups of database acceleration integrated circuits configured to operate at the one or more groups of database acceleration integrated circuits The group's database accelerates the exchange of accelerated results between integrated circuits.

346. The apparatus of claim 323, wherein the apparatus comprises one or more groups of database acceleration integrated circuits configured to operate at the one or more groups of database acceleration integrated circuits The database-accelerated integrated circuits of the set perform at least one of: (a) information; and (b) database-accelerated results.

347. The apparatus of claim 346, wherein the database acceleration integrated circuits of the group are connected to the same printed circuit board.

348. The apparatus of claim 346, wherein the database acceleration integrated circuits of the group belong to modular units of the computerized system.

349. The apparatus of claim 346, wherein different groups of database acceleration integrated circuits are connected to different printed circuit boards.

350. The apparatus of claim 346, wherein different groups of database acceleration integrated circuits belong to different modular units of the computerized system.

351. The apparatus of claim 346, wherein the exchanging is performed using a network communication interface of the database acceleration integrated circuit of one or more groups.

352. The device of claim 346, wherein the exchange is performed on a plurality of groups connected to each other by star connections.

353. The apparatus of claim 346, wherein the apparatus is configured to use at least one switch for switching of the following between database acceleration integrated circuits of different ones of the one or more groups At least one of: (a) information; and (b) database accelerated results.

354. The apparatus of claim 346, wherein the apparatus is configured to accelerate some of the integrated circuits to perform distributed processing by the database of some of the one or more groups.

355. The apparatus of claim 346, wherein the apparatus is configured to perform distributed processing using first and second data structures, wherein the combined size of the first and second data structures exceeds the plurality of memories Handles the memory capacity of integrated circuits.

356. The apparatus of claim 355, wherein the apparatus is configured to perform the distributed processing by performing a plurality of iterations of: (a) performing a first data structure portion and a second data structure Partially different pairs to different databases accelerate new assignment of integrated circuits; and (b) processing the different pairs.

357. The apparatus of claim 355, wherein the distributed processing comprises a database join operation.

358. The apparatus of claim 355, wherein the apparatus is configured to perform the distributed processing by:

Perform multiple iterations of:

359. The apparatus of claim 358, wherein the apparatus is configured to perform the new assignment of the next iteration in a manner that overlaps at least a portion of the time with processing of the current iteration.

360. The apparatus of claim 358, wherein the apparatus is configured to perform the new allocation by exchanging portions of a second data structure between the different database acceleration integrated circuits.

361. The device of claim 360, wherein the exchange is performed by the database acceleration integrated circuit in a manner that overlaps at least a portion of the time with the processing.

362. The apparatus of claim 358, wherein the apparatus is configured to perform the new allocation by: exchanging a second data structure portion between the different database acceleration integrated circuits of a group; and once After the exchange is complete, the second data structure portion is exchanged between the different groups of database acceleration integrated circuits.

363. The apparatus of claim 323, wherein the database acceleration integrated circuit is included in a blade, the blade comprising a plurality of database acceleration integrated circuits, one or more non-volatile memory units, an Ethernet switch, PCIe switches and Ethernet switches, and the plurality of memory processing integrated circuits.

364. A method for database acceleration, the method comprising:

retrieving information from the storage unit through the network communication interface of the database acceleration integrated circuit;

performing a first process on an amount of information to provide first processed information;

Accelerating, by the database, a memory controller of an integrated circuit and sending the first processed information to a plurality of memory resources via an interface;

retrieving information from the plurality of memory resources;

performing database processing operations on the retrieved information by a database acceleration unit of the database acceleration integrated circuit to provide database acceleration results; and

The database acceleration results are output.

365. The method of claim 364, further comprising processing the first processed information to provide second processed information, wherein the processing of the first processed information is performed by a plurality of processors, the The plurality of processors are located in one or more memory processing integrated circuits that further include the plurality of memory resources.

366. The method of claim 364, wherein the first process comprises filtering database entries.

367. The method of claim 364, wherein the second process comprises filtering database entries.

368. The method of claim 364, wherein the first and second processes comprise filtering database entries.