HK1207179B

HK1207179B - Engine architecture for processing finite automata

Info

Publication number: HK1207179B
Application number: HK15107653.6A
Authority: HK
Inventors: R‧戈亚尔; S‧L‧比拉; Y‧沙纳瓦; G‧A‧鲍查德; T‧T‧纳卡达
Original assignee: 马维尔亚洲私人有限公司
Priority date: 2013-08-30
Filing date: 2015-08-08
Publication date: 2020-02-28

Description

Engine architecture for processing finite automata

背景技术Background Art

开放系统互连(OSI)参考模型定义了用于通过传输介质进行通信的7个网络协议层(L1-L7)。上层(L4-L7)表示端到端通信并且下层(L1-L3)表示本地通信。The Open Systems Interconnection (OSI) reference model defines seven network protocol layers (L1-L7) for communication over a transmission medium. The upper layers (L4-L7) represent end-to-end communication and the lower layers (L1-L3) represent local communication.

联网应用感知系统需要处理、过滤和切换L3到L7网络协议层的范围，例如，L7网络协议层诸如超文本传输协议(HTTP)和简单邮件传输协议(SMTP)，以及L4网络协议层诸如传输控制协议(TCP)。除了处理网络协议层以外，联网应用感知系统需要以线速(即，在其上传输和接收数据的网络的物理介质上的数据传输速率)通过L4-L7网络协议层来同时通过基于访问和内容的安全性来保护这些协议，这些协议层包括防火墙、虚拟专用网(VPN)、安全套接字层(SSL)、入侵检测系统(IDS)、互联网协议安全(IPSec)、防病毒(AV)和防垃圾邮件功能。Networked application-aware systems need to process, filter, and switch a range of L3 to L7 network protocol layers, for example, L7 network protocol layers such as Hypertext Transfer Protocol (HTTP) and Simple Mail Transfer Protocol (SMTP), and L4 network protocol layers such as Transmission Control Protocol (TCP). In addition to processing network protocol layers, networked application-aware systems need to protect these protocols at line speed (i.e., the data transfer rate on the physical medium of the network over which data is transmitted and received) through L4-L7 network protocol layers while using access- and content-based security, including firewalls, virtual private networks (VPNs), secure sockets layers (SSL), intrusion detection systems (IDS), Internet Protocol Security (IPSec), antivirus (AV), and anti-spam capabilities.

网络处理器可用于高吞吐量L2和L3网络协议处理，即，执行数据包处理从而线速转发数据包。通常，通用处理器用于处理需要更多智能处理的L4-L7网络协议。虽然通用处理器可以执行计算密集型任务，但是没有足够用于处理数据使得其能够被以线速转发的性能。Network processors are used for high-throughput Layer 2 and Layer 3 network protocol processing, meaning they perform packet processing to forward packets at line speed. Typically, general-purpose processors are used to process Layer 4-7 network protocols, which require more intelligent processing. While general-purpose processors can perform computationally intensive tasks, they lack the performance to process data at line speed.

入侵检测系统(IDS)应用可以检查流过网络的独立数据包的内容，并且可以标识可能指示尝试闯入或威胁系统的可疑模式。可疑模式的一个示例可以是数据包中的特定文本串，该特定文本串在100个字符以后跟随另一特定文本串。此类内容感知联网可能需要以“线速”对数据包的内容进行检查。可以对该内容进行分析以确定是否已经存在安全漏洞或入侵。Intrusion detection system (IDS) applications can examine the content of individual data packets flowing through a network and can identify suspicious patterns that may indicate an attempt to break into or compromise a system. An example of a suspicious pattern might be a specific string of text in a data packet that is followed 100 characters later by another specific string of text. This type of content-aware networking may require inspecting the content of data packets at "wire speed." This content can be analyzed to determine if a security breach or intrusion has occurred.

可以应用大量正则表达式(本文中也被称为正则表达式模式)形式的模式和规则以确保所有的安全漏洞或入侵被检测到。正则表达式是用于描述字符串中的模式的紧凑型方法。由正则表达式所匹配的最简单模式是单个字符或字符串，例如，/c/或/cat/。正则表达式还可以包括具有特殊含义的运算符和元字符。通过使用元字符，正则表达式可以用于更复杂的搜索，如“abc.*xyz”。即，在“abc”和“xyz”之间的无限量字符数量的情况下，发现字符串“abc”，之后是字符串“xyz”。另一示例是正则表达式“abc..abc.*xyz；”，即，发现字符串“abc”，后面两个字符，然后是字符串“abc”并且在无限量的字符后由字符串“xyz”跟随。A large number of patterns and rules in the form of regular expressions (also referred to as regular expression patterns herein) can be applied to ensure that all security holes or intrusions are detected. Regular expressions are compact methods for describing patterns in character strings. The simplest pattern matched by a regular expression is a single character or string, for example, /c/ or /cat/. Regular expressions can also include operators and metacharacters with special meanings. By using metacharacters, regular expressions can be used for more complex searches, such as "abc.*xyz". That is, in the case of an unlimited number of characters between "abc" and "xyz", the string "abc" is found, followed by the string "xyz". Another example is the regular expression "abc..abc.*xyz;", that is, the string "abc" is found, followed by two characters, followed by the string "abc" and followed by the string "xyz" after an unlimited number of characters.

通常使用搜索方法(如用于处理正则表达式的确定有限自动机(DFA)或非确定有限自动机(NFA))来执行内容搜索。Content searches are typically performed using a search method such as a Deterministic Finite Automaton (DFA) or a Non-Deterministic Finite Automaton (NFA) for processing regular expressions.

发明内容Summary of the Invention

在此披露的实施例提供了一种用于处理有限自动机的引擎架构的方法、装置、和相应的系统。The embodiments disclosed herein provide a method, apparatus, and corresponding system for processing an engine architecture of a finite automaton.

根据一个实施例，安全装置可以操作性地耦合至网络。该安全装置可以包括至少一个中央处理单元(CPU)内核和操作性地耦合至该至少一个CPU内核的至少一个超非确定自动机(HNA)处理器。该至少一个HNA处理器可以专用于非确定有限自动机(NFA)处理。该至少一个HNA处理器可以包括多个超级集群。每个超级集群可以包括多个集群。该多个集群中的每个集群可以包括多个HNA处理单元(HPU)。该至少一个CPU内核可以被配置成用于选择该多个超级集群中的至少一个超级集群。该至少一个HNA处理器可以包括一个被配置成用于存储至少一个HNA指令的HNA片上指令队列。该至少一个HNA处理器可以包括一个HNA调度器。该HNA调度器可以被配置成用于选择所选择的该至少一个超级集群中的该多个集群中的该多个HPU中的一个给定HPU并且将该至少一个HNA指令分配给该给定HPU以便发起对从网络接收到的一个输入流中的至少一个正则表达式模式进行匹配。According to one embodiment, a security device may be operatively coupled to a network. The security device may include at least one central processing unit (CPU) core and at least one super nondeterministic automaton (HNA) processor operatively coupled to the at least one CPU core. The at least one HNA processor may be dedicated to nondeterministic finite automaton (NFA) processing. The at least one HNA processor may include multiple superclusters. Each supercluster may include multiple clusters. Each cluster in the multiple clusters may include multiple HNA processing units (HPUs). The at least one CPU core may be configured to select at least one supercluster from the multiple superclusters. The at least one HNA processor may include an HNA on-chip instruction queue configured to store at least one HNA instruction. The at least one HNA processor may include an HNA scheduler. The HNA scheduler may be configured to select a given HPU from the multiple HPUs in the multiple clusters in the at least one supercluster and assign the at least one HNA instruction to the given HPU to initiate matching of at least one regular expression pattern in an input stream received from the network.

每个超级集群可以进一步包括一个相应超级集群专有的一个超级集群图形存储器。该相应超级集群的相应多个集群的相应多个HPU可以访问该超级集群图形存储器。该超级集群图形存储器可以被配置成用于静态地存储至少一个每模式NFA的一个节点子集。该至少一个每模式NFA的一个编译器可以确定该节点子集。Each supercluster may further include a supercluster graph memory dedicated to the corresponding supercluster. The supercluster graph memory is accessible to the corresponding plurality of HPUs of the corresponding plurality of clusters of the corresponding supercluster. The supercluster graph memory may be configured to statically store a subset of nodes of the at least one per-pattern NFA. A compiler of the at least one per-pattern NFA may determine the subset of nodes.

每个超级集群可以进一步包括该相应超级集群专有的至少一个超级集群字符类存储器。每个至少一个超级集群字符类存储器可以被配置成用于静态地存储多个正则表达式模式字符类定义。Each super cluster may further include at least one super cluster character class storage specific to the corresponding super cluster. Each at least one super cluster character class storage may be configured to statically store a plurality of regular expression pattern character class definitions.

该超级集群图形存储器和该至少一个超级集群字符类存储器可以是统一的。The super-cluster graphics memory and the at least one super-cluster character class memory may be unified.

该相应超级集群的该相应多个集群的该相应多个HPU可以共享该至少一个超级集群字符类存储器。The corresponding plurality of HPUs of the corresponding plurality of clusters of the corresponding super cluster may share the at least one super cluster character class memory.

每个超级集群可以进一步包括至少一个超级集群字符类存储器。每个至少一个超级集群字符类存储器可以是相应超级集群的相应多个集群中的给定集群专有的并且由该给定集群的相应多个HPU共享。每个至少一个超级集群字符类存储器可以被配置成用于静态地存储多个正则表达式模式字符类定义。Each supercluster may further include at least one supercluster character class memory. Each at least one supercluster character class memory may be exclusive to a given cluster in the corresponding plurality of clusters of the corresponding supercluster and shared by the corresponding plurality of HPUs in the given cluster. Each at least one supercluster character class memory may be configured to statically store a plurality of regular expression pattern character class definitions.

该至少一个CPU内核可以被进一步配置成用于通过基于一个与该至少一个HNA指令相关联的图形标识符限制超级集群选择来选择该多个超级集群中的该至少一个超级集群。The at least one CPU core may be further configured to select the at least one super cluster of the plurality of super clusters by limiting super cluster selection based on a graph identifier associated with the at least one HNA instruction.

该图形标识符可以与多个每模式NFA中的一个给定每模式NFA相关联，并且限制该超级集群选择可以包括一个确定：该给定每模式NFA的至少一个节点存储在所选择的该至少一个超级集群专有的一个超级集群图形存储器内。The graph identifier may be associated with a given per-mode NFA of the plurality of per-mode NFAs, and limiting the supercluster selection may include a determination that at least one node of the given per-mode NFA is stored within a supercluster graph memory specific to the selected at least one supercluster.

该HNA调度器可以被配置成用于从一个HPU限制集合选择该给定HPU，该HPU限制集合可以包括所选择的该至少一个超级集群的每个相应多个集群各自的每个相应多个HPU。该至少一个CPU内核可以被进一步配置成用于基于与该图形标识符相关联的该给定每模式NFA的至少一个节点存储在所选择的该至少一个超级集群专有的一个超级集群图形存储器内的一个确定来选择该多个超级集群中的该至少一个超级集群。The HNA scheduler may be configured to select the given HPU from a restricted set of HPUs, the restricted set of HPUs including each of the respective plurality of HPUs of each of the respective plurality of clusters of the selected at least one super cluster. The at least one CPU core may be further configured to select the at least one super cluster from the plurality of super clusters based on a determination that at least one node of the given per-pattern NFA associated with the graph identifier is stored within a super cluster graph memory specific to the selected at least one super cluster.

该HNA调度器可以被进一步配置成用于基于该HPU限制集合中的多个HPU的一次循环调度来从该HPU限制集合中选择该给定HPU。The HNA scheduler may be further configured to select the given HPU from the HPU restricted set based on a round-robin schedule of a plurality of HPUs in the HPU restricted set.

该HNA调度器可以被进一步配置成用于基于该HPU限制集合中的每个HPU的瞬时加载来从该HPU限制集合中选择该给定HPU。The HNA scheduler may be further configured to select the given HPU from the HPU limit set based on a momentary loading of each HPU in the HPU limit set.

每个超级集群可以进一步包括该相应超级集群专有的一个超级集群图形存储器。每个超级集群图形存储器可以被配置成用于存储多个每模式NFA中的至少一个每模式NFA的至少一个节点以复制该至少一个NFA处理器的每个超级集群的每个超级集群图形存储器中的该至少一个节点。Each supercluster may further include a supercluster graph memory dedicated to the corresponding supercluster. Each supercluster graph memory may be configured to store at least one node of at least one per-pattern NFA in the plurality of per-pattern NFAs to replicate the at least one node in each supercluster graph memory of each supercluster of the at least one NFA processor.

该至少一个CPU内核可以被进一步配置成用于为该HNA调度器提供一个选项：基于复制与该至少一个HNA指令相关联的该至少一个每模式NFA中的一个给定每模式NFA的一个确定来选择该至少一个超级集群。该HNA调度器可以进一步被配置成用于基于所提供的该选项和(i)该多个超级集群的一次第一循环调度、(ii)该多个超级集群的一次第一瞬时加载或(ii)(i)与(ii)的组合来选择该至少一个超级集群。该HNA调度器可以进一步被配置成用于基于所选择的该至少一个超级集群的该多个集群的该多个HPU的一次第二循环调度、所选择的该至少一个超级集群的该多个集群的该多个HPU的一次第二瞬时加载、或以上的组合来从所选择的该至少一个超级集群的该多个集群的该多个HPU中选择该给定HPU。The at least one CPU core may be further configured to provide the HNA scheduler with an option to select the at least one supercluster based on a determination to replicate a given per-mode NFA in the at least one per-mode NFA associated with the at least one HNA instruction. The HNA scheduler may be further configured to select the at least one supercluster based on the provided option and (i) a first cycle schedule of the plurality of superclusters, (ii) a first transient load of the plurality of superclusters, or (ii) a combination of (i) and (ii). The HNA scheduler may be further configured to select the given HPU from the plurality of HPUs in the plurality of clusters of the at least one supercluster based on a second cycle schedule of the plurality of HPUs in the plurality of clusters of the selected at least one supercluster, a second transient load of the plurality of HPUs in the plurality of clusters of the selected at least one supercluster, or a combination thereof.

该至少一个HNA处理器可以进一步包括该多个超级集群的该多个集群的该多个HPU可访问的一个HNA片上图形存储器。该HNA片上图形存储器可以被配置成用于静态地存储至少一个每模式NFA的一个节点子集。该至少一个每模式NFA的一个编译器可以确定该节点子集。The at least one HNA processor may further include an HNA on-chip graph memory accessible to the plurality of HPUs of the plurality of clusters of the plurality of superclusters. The HNA on-chip graph memory may be configured to statically store a subset of nodes of the at least one per-pattern NFA. A compiler of the at least one per-pattern NFA may determine the subset of nodes.

该至少一个HNA指令可以是第一至少一个HNA指令，并且该安全装置可以进一步包括操作性地耦合至该至少一个CPU内核和该至少一个HNA处理器的至少一个系统存储器。该至少一个系统存储器可以被配置成包括一个用于存储第二至少一个HNA指令的HNA芯片外指令队列。该第二至少一个HNA指令可以未决传输至HNA处理器的该HNA片上指令队列。该至少一个系统存储器可以进一步包括一个被配置成用于静态地存储至少一个每模式NFA的一个节点子集的HNA芯片外图形存储器。该至少一个每模式NFA的一个编译器可以确定该节点子集。The at least one HNA instruction may be a first at least one HNA instruction, and the security device may further include at least one system memory operatively coupled to the at least one CPU core and the at least one HNA processor. The at least one system memory may be configured to include an HNA off-chip instruction queue for storing a second at least one HNA instruction. The second at least one HNA instruction may be transferred to the HNA on-chip instruction queue of the HNA processor pending transmission. The at least one system memory may further include an HNA off-chip graph memory configured to statically store a subset of nodes of the at least one per-pattern NFA. A compiler for the at least one per-pattern NFA may determine the subset of nodes.

该安全装置可以进一步包括至少一个本地存储器控制器(LMC)。该至少一个LMC可以操作性地耦合至该至少一个HNA处理器和该至少一个系统存储器。该至少一个LMC中的一个给定LMC可以被配置成用于能够实现该至少一个系统存储器的非一致性访问以便该至少一个HNA处理器对该HNA芯片外图形存储器进行访问。The security device may further include at least one local memory controller (LMC). The at least one LMC may be operatively coupled to the at least one HNA processor and the at least one system memory. A given LMC in the at least one LMC may be configured to enable non-coherent access to the at least one system memory so that the at least one HNA processor accesses the HNA off-chip graphics memory.

该至少一个系统存储器可以被进一步配置成包括一个被配置成用于连续地存储多个有效载荷的HNA数据包数据存储器，该多个有效载荷中的每个有效载荷可以具有一个固定的最大长度。该多个有效载荷中的每个有效载荷可以与该HNA片上指令队列中所存储的该第一至少一个HNA指令或未决传输至该HNA片上指令队列的该第二至少一个HNA指令中的一个给定HNA指令相关联。The at least one system memory may be further configured to include an HNA packet data memory configured to contiguously store a plurality of payloads, each of the plurality of payloads may have a fixed maximum length. Each of the plurality of payloads may be associated with a given HNA instruction among the first at least one HNA instruction stored in the HNA on-chip instruction queue or the second at least one HNA instruction pending transmission to the HNA on-chip instruction queue.

该至少一个系统存储器可以被进一步配置成包括一个被配置成用于存储至少一个HNA输入堆栈的HNA输入堆栈分区。每个至少一个HNA输入堆栈可以被配置成用于存储针对该多个超级集群的该多个集群的该多个HPU中的至少一个HPU的至少一项HNA输入工作。该至少一个系统存储器可以被进一步配置成包括一个被配置成用于存储至少一个HNA芯片外运行堆栈的HNA芯片外运行堆栈分区以扩展至少一个片上运行堆栈的存储。每个至少一个片上运行堆栈可以被配置成用于存储针对该至少一个HPU的至少一项运行时间HNA工作。该至少一个系统存储器可以被进一步配置成包括一个被配置成用于扩展至少一个片上保存缓冲区的存储的HNA芯片外保存缓冲区分区。每个片上保存缓冲区可以被配置成用于基于检测到一个有效载荷边界而存储针对该至少一个HPU的该至少一项运行时间HNA工作。该至少一个系统存储器可以被进一步配置成包括一个被配置成用于存储该至少一个HPU所确定的该至少一个正则表达式模式的至少一个最终匹配结果条目的HNA芯片外结果缓冲区分区以在该输入流中进行匹配。所存储的每个至少一个HNA指令可以对该HNA输入堆栈分区的一个给定HNA输入堆栈、该HNA芯片外运行堆栈分区的一个给定HNA芯片外运行堆栈、该HNA芯片外保存缓冲区分区的一个给定HNA芯片外保存缓冲区、以及该HNA芯片外结果缓冲区分区的一个给定HNA芯片外结果缓冲区进行标识。The at least one system memory may be further configured to include an HNA input stack partition configured to store at least one HNA input stack. Each of the at least one HNA input stack may be configured to store at least one HNA input job for at least one HPU in the plurality of HPUs in the plurality of clusters of the plurality of superclusters. The at least one system memory may be further configured to include an HNA off-chip run-time stack partition configured to store at least one HNA off-chip run-time stack to extend the storage of the at least one on-chip run-time stack. Each of the at least one on-chip run-time stack may be configured to store at least one runtime HNA job for the at least one HPU. The at least one system memory may be further configured to include an HNA off-chip save buffer partition configured to extend the storage of at least one on-chip save buffer. Each on-chip save buffer may be configured to store the at least one runtime HNA job for the at least one HPU based on detecting a payload boundary. The at least one system memory may be further configured to include an HNA off-chip result buffer partition configured to store at least one final match result entry of the at least one regular expression pattern determined by the at least one HPU to match in the input stream. Each of the at least one stored HNA instructions may identify a given HNA input stack of the HNA input stack partition, a given HNA off-chip run stack of the HNA off-chip run stack partition, a given HNA off-chip save buffer of the HNA off-chip save buffer partition, and a given HNA off-chip result buffer of the HNA off-chip result buffer partition.

该至少一个LMC中的一个给定LMC可以被配置成用于使该至少一个HNA处理器能够通过一条一致路径访问该HNA数据包数据存储器、HNA输入堆栈分区、HNA芯片外指令队列、HNA芯片外运行堆栈分区、HNA芯片外保存缓冲区分区、以及该HNA芯片外结果缓冲区分区，并且使该至少一个HNA处理器能够通过一条非一致路径访问该HNA芯片外图形存储器。A given LMC of the at least one LMC may be configured to enable the at least one HNA processor to access the HNA packet data memory, the HNA input stack partition, the HNA off-chip instruction queue, the HNA off-chip run stack partition, the HNA off-chip save buffer partition, and the HNA off-chip result buffer partition through a consistent path, and to enable the at least one HNA processor to access the HNA off-chip graphics memory through a non-uniform path.

该多个超级集群的该多个集群的该多个HPU中的每个HPU可以包括一个节点高速缓存，该节点高速缓存被配置成用于高速缓存来自一个超级集群图形存储器、一个HNA片上图形存储器、或一个HNA芯片外图形存储器的一个或多个节点。该多个超级集群的该多个集群的该多个HPU中的每个HPU可以进一步包括一个被配置成用于高速缓存来自一个超级集群字符类存储器的一个或多个正则表达式模式字符类定义的字符类高速缓存和一个被配置成用于存储来自一个HNA数据包数据存储器的一个给定有效载荷的有效载荷缓冲区。该至少一个HNA指令可以包括一个用于该HNA数据包数据存储器中的该给定有效载荷的一个位置的标识符。该多个超级集群的该多个集群的该多个HPU中的每个HPU可以进一步包括一个被配置成用于存储单项HNA工作的栈顶寄存器、一个被配置成用于存储多项HNA工作的运行堆栈、以及一个被配置成用于存储一个保存堆栈的第一内容和一个匹配结果缓冲区的第二内容的统一存储器。该第一内容可以包括该运行堆栈中所存储的一项或多项HNA工作，并且该第二内容可以包括一个或多个最终匹配结果。该多个超级集群的该多个集群的该多个HPU中的每个HPU可以进一步包括一个HNA处理内核，该处理内核操作性地耦合至该节点高速缓存、字符类高速缓存、有效载荷缓冲区、栈顶寄存器、运行堆栈、以及该统一存储器。该HNA处理内核可以被配置成用于行走至少一个每模式NFA，其中，多个有效载荷段存储在该有效载荷缓冲区内，以确定该至少一个正则表达式模式的一次匹配。Each of the plurality of HPUs in the plurality of clusters of the plurality of superclusters may include a node cache configured to cache one or more nodes from a supercluster graphics memory, an HNA on-chip graphics memory, or an HNA off-chip graphics memory. Each of the plurality of HPUs in the plurality of clusters of the plurality of superclusters may further include a character class cache configured to cache one or more regular expression pattern character class definitions from a supercluster character class memory and a payload buffer configured to store a given payload from an HNA packet data memory. The at least one HNA instruction may include an identifier for a location of the given payload in the HNA packet data memory. Each of the plurality of HPUs in the plurality of clusters of the plurality of superclusters may further include a top-of-stack register configured to store a single HNA job, a running stack configured to store multiple HNA jobs, and a unified memory configured to store first contents of a save stack and second contents of a match result buffer. The first content may include one or more HNA jobs stored in the run stack, and the second content may include one or more final match results. Each of the plurality of HPUs in the plurality of clusters of the plurality of superclusters may further include an HNA processing core operatively coupled to the node cache, the character class cache, the payload buffer, the top-of-stack register, the run stack, and the unified memory. The HNA processing core may be configured to walk at least one per-pattern NFA, wherein a plurality of payload segments are stored in the payload buffer, to determine a match of the at least one regular expression pattern.

每个超级集群可以进一步包括一个相应超级集群专有的一个超级集群图形存储器。该至少一个HNA处理器可以进一步包括该多个超级集群所共享的一个HNA片上图形存储器。该安全装置可以进一步包括至少一个系统存储器，该系统存储器被配置成包括该多个超级集群所共享的一个HNA芯片外图形存储器。所选择的该给定HPU可以被配置成用于基于所分配的该至少一个HNA指令根据该输入流的一个有效载荷的多个段行走至少一个每模式NFA中的一个给定模式NFA的多个节点。所行走的这些节点可以存储在对所选择的该给定HPU、该超级集群图形存储器、该HNA片上图形存储器、该HNA芯片外图形存储器、或以上的组合专有的一个节点高速缓存中。Each supercluster may further include a supercluster graph memory dedicated to the corresponding supercluster. The at least one HNA processor may further include an HNA on-chip graph memory shared by the multiple superclusters. The security device may further include at least one system memory configured to include an HNA off-chip graph memory shared by the multiple superclusters. The selected given HPU may be configured to walk multiple nodes of a given pattern NFA in at least one per-pattern NFA according to multiple segments of a payload of the input stream based on the at least one assigned HNA instruction. The walked nodes may be stored in a node cache dedicated to the selected given HPU, the supercluster graph memory, the HNA on-chip graph memory, the HNA off-chip graph memory, or a combination thereof.

所选择的该至少一个超级集群的该多个集群的该多个HPU可以形成一个该HNA调度器可用于选择的HPU资源池以能够实现该匹配的加速。The multiple HPUs of the multiple clusters of the selected at least one super cluster may form an HPU resource pool that the HNA scheduler may use to select to enable the matched acceleration.

在此披露的另一个示例实施例包括一种专用于非确定有限自动机(NFA)处理的超非确定有限自动机(HNA)处理器。该HNA处理器可以包括多个超级集群。每个超级集群可以包括多个集群。该多个集群中的每个集群可以包括多个HNA处理单元(HPU)。该HNA处理器可以进一步包括一个可以被配置成用于存储至少一个HNA指令的HNA片上指令队列。所选择的该至少一个超级集群的该多个集群的该多个HPU可以形成一个可用于对该至少一个HNA指令进行分配的HPU资源池。该HNA处理器可以进一步包括一个HNA调度器，该调度器被配置成用于选择所形成的该资源池中的一个给定HPU和将该至少一个HNA指令分配给所选择的该给定HPU以便发起对从一个网络接收到的一个输入流中的至少一个正则表达式模式进行匹配。Another example embodiment disclosed herein includes a super nondeterministic finite automaton (HNA) processor dedicated to nondeterministic finite automaton (NFA) processing. The HNA processor may include multiple superclusters. Each supercluster may include multiple clusters. Each of the multiple clusters may include multiple HNA processing units (HPUs). The HNA processor may further include an HNA on-chip instruction queue configured to store at least one HNA instruction. The multiple HPUs of the multiple clusters of the at least one supercluster may form an HPU resource pool that can be used to allocate the at least one HNA instruction. The HNA processor may further include an HNA scheduler configured to select a given HPU in the formed resource pool and assign the at least one HNA instruction to the selected given HPU to initiate matching of at least one regular expression pattern in an input stream received from a network.

在此披露的另一个示例实施例包括一种与和在此披露的实施例相一致的操作相对应的方法。Another example embodiment disclosed herein includes a method corresponding to operations consistent with the embodiments disclosed herein.

进一步地，又另一个示例实施例可以包括一种使指令序列存储在其上的非瞬态计算机可读介质，当被处理器加载和执行时，该指令序列致使处理器执行在此披露的方法。Further, yet another example embodiment may include a non-transitory computer-readable medium having stored thereon sequences of instructions that, when loaded and executed by a processor, cause the processor to perform the methods disclosed herein.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

从本发明的示例实施例的以下更具体的说明中上述内容将是明显的，如在这些附图中所展示的，其中贯穿这些不同的视图，相似的参考字符是指相同的部分。附图不一定按比例，而是着重于展示本发明的实施例。The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the invention.

图1A为用于有限自动机处理的引擎架构的示例实施例的框图。FIG. 1A is a block diagram of an example embodiment of an engine architecture for finite automaton processing.

图1B为超非确定自动机(HNA)处理器的示例实施例的框图。FIG. 1B is a block diagram of an example embodiment of a hyper-nondeterministic automaton (HNA) processor.

图1C为包括HNA处理器的示例实施例的安全装置的示例实施例的框图。FIG. 1C is a block diagram of an example embodiment of a security device including an example embodiment of an HNA processor.

图1D为HNA处理器的另一个示例实施例的框图。FIG. 1D is a block diagram of another example embodiment of an HNA processor.

图1E为至少一个系统存储器的示例实施例的框图。IE is a block diagram of an example embodiment of at least one system memory.

图1F为一种方法的示例实施例的流程图。FIG. 1F is a flow chart of an example embodiment of a method.

图1G为一种安全装置的示例实施例的框图，在其中可以实施在此披露的实施例。FIG. 1G is a block diagram of an example embodiment of a security device in which embodiments disclosed herein may be implemented.

图2A至图2G为示例NFA和DFA图形和展示图爆的概念的表。2A-2G are example NFA and DFA graphs and tables illustrating the concept of graph explosion.

图3A为安全装置的实施例的框图，在其中可以实施在此披露的实施例。3A is a block diagram of an embodiment of a security device in which embodiments disclosed herein may be implemented.

图3B为可以在至少一个处理器中实施的方法的示例实施例的流程图，该处理器操作性地耦合至安全装置内的至少一个存储器，该安全装置操作性地耦合至网络。3B is a flow diagram of an example embodiment of a method that may be implemented in at least one processor operatively coupled to at least one memory within a security device operatively coupled to a network.

图4A为HNA处理单元(HPU)的示例实施例的框图。4A is a block diagram of an example embodiment of an HNA processing unit (HPU).

图4B为根据在此披露的实施例的可以被存储或取回的上下文的示例实施例的框图。4B is a block diagram of an example embodiment of a context that may be stored or retrieved according to embodiments disclosed herein.

图5A为行走器可以用于对输入流中的正则表达式模式进行匹配的每模式非确定有限自动机(NFA)图形的示例实施例的框图。5A is a block diagram of an example embodiment of a per-pattern nondeterministic finite automaton (NFA) graph that a walker may use to match regular expression patterns in an input stream.

图5B为根据有效载荷行走图5A的每模式NFA图形的处理周期的示例实施例的表。5B is a table of an example embodiment of processing cycles for walking the per-mode NFA graph of FIG. 5A according to payload.

图6为行走器的环境的示例实施例的框图。6 is a block diagram of an example embodiment of an environment for a walker.

图7A为编译器的环境的示例实施例的框图。7A is a block diagram of an example embodiment of an environment for a compiler.

图7B为HNA处理内核的示例实施例的框图，该处理内核操作性地耦合至被映射到存储器层次中的多个层级的多个存储器。7B is a block diagram of an example embodiment of an HNA processing core operatively coupled to multiple memories mapped to multiple levels in a memory hierarchy.

图8为多个每模式NFA的节点分布的示例实施例的框图。8 is a block diagram of an example embodiment of a node distribution of multiple per-pattern NFAs.

图9为可以在至少一个处理器中实施的方法的示例实施例的流程图，该处理器操作性地耦合至被映射到安全装置内的存储器层次中的多个层级的存储器，该安全装操作性地耦合至网络。9 is a flow diagram of an example embodiment of a method that may be implemented in at least one processor operatively coupled to memory mapped to multiple levels of a memory hierarchy within a secure device operatively coupled to a network.

图10为多个每模式NFA的节点的另一节点分布的示例实施例的框图。10 is a block diagram of an example embodiment of another node distribution of nodes of multiple per-pattern NFAs.

图11为一种用于对至少一个每模式NFA的节点进行分布的方法的示例实施例的流程图。11 is a flow diagram of an example embodiment of a method for distributing nodes of at least one per-pattern NFA.

图12为可以在至少一个处理器中实施的方法的另一个示例实施例的流程图，该处理器操作性地耦合至被映射到安全装置内的存储器层次中的多个层级的存储器，该安全装操作性地耦合至网络。12 is a flow diagram of another example embodiment of a method that may be implemented in at least one processor operatively coupled to memory mapped to multiple levels of a memory hierarchy within a secure device operatively coupled to a network.

图13A为可以在至少一个处理器中实施的方法的示例实施例的流程图1300，该处理器操作性地耦合至存储器层次中的多个存储器和安全装置内的节点高速缓存，该安全装操作性地耦合至网络。13A is a flow diagram 1300 of an example embodiment of a method that may be implemented in at least one processor operatively coupled to a plurality of memories in a memory hierarchy and a node cache within a security device operatively coupled to a network.

图13B为有效载荷和有效载荷中带有对应偏移的段的示例实施例的框图。13B is a block diagram of an example embodiment of a payload and segments within the payload with corresponding offsets.

图13C为通过在分离节点处选择懒惰路径根据图13B的有效载荷行走图5A的每模式NFA图形的处理周期的示例实施例的表。13C is a table of an example embodiment of a processing cycle for walking the per-pattern NFA graph of FIG. 5A according to the payload of FIG. 13B by selecting a lazy path at a split node.

图13D为一个表，其为图13C的表的续表。FIG13D is a table that is a continuation of the table in FIG13C.

图14为在此披露的实施例内可选计算机的示例内部结构的框图。14 is a block diagram of an example internal structure of an optional computer within embodiments disclosed herein.

具体实施方式DETAILED DESCRIPTION

根据此披露的实施例，一种用于有限自动机处理的引擎架构可以包括一个超非确定自动机(HNA)处理器，该处理器为非确定有限自动机(NFA)处理提供硬件加速。该HNA处理器可以是与超有限自动机(HFA)协处理器互补的协处理器。该HFA处理器可以为确定有限自动机(DFA)处理提供硬件加速。HNA和HFA可以是可以用于深度数据包检查应用的正则表达式处理器，如入侵检测/预防(IDP)、数据包分类、服务器负载均衡、web切换、存储阵列网络(SAN)、防火墙负载均衡、病毒扫描、或任何其他合适的深度数据包检查应用。HNA和HFA可以从执行计算和内存密集型模式匹配过程的沉重负担中分流通用中央处理单元(CPU)。According to an embodiment of the present disclosure, an engine architecture for finite automaton processing may include a super non-deterministic automaton (HNA) processor that provides hardware acceleration for non-deterministic finite automaton (NFA) processing. The HNA processor may be a coprocessor complementary to a super finite automaton (HFA) coprocessor. The HFA processor may provide hardware acceleration for deterministic finite automaton (DFA) processing. HNA and HFA may be regular expression processors that may be used for deep packet inspection applications, such as intrusion detection/prevention (IDP), packet classification, server load balancing, web switching, storage array network (SAN), firewall load balancing, virus scanning, or any other suitable deep packet inspection applications. HNA and HFA may offload the heavy burden of performing computational and memory-intensive pattern matching processes from a general-purpose central processing unit (CPU).

图1A为用于有限自动机处理的引擎架构的示例实施例的框图150。根据该示例实施例，至少一个CPU内核103可以操作性地耦合至至少一个HFA处理器110和至少一个HNA处理器108。操作性耦合可以包括通过总线、中断、邮箱、一个或多个电路元件的耦合、通信路径、通信耦合、或以任何其他合适的方式的耦合。至少一个HFA处理器110可以专用于DFA处理，并且至少一个HNA处理器108可以专用于NFA处理。至少一个CPU内核103、至少一个HFA处理器110、以及至少一个HNA处理器108可以被配置成共享一个2级高速缓存(L2C)113。FIG1A is a block diagram 150 of an example embodiment of an engine architecture for finite automata processing. According to this example embodiment, at least one CPU core 103 can be operatively coupled to at least one HFA processor 110 and at least one HNA processor 108. The operative coupling can include coupling via a bus, an interrupt, a mailbox, one or more circuit elements, a communication path, a communication coupling, or any other suitable manner. At least one HFA processor 110 can be dedicated to DFA processing, and at least one HNA processor 108 can be dedicated to NFA processing. The at least one CPU core 103, the at least one HFA processor 110, and the at least one HNA processor 108 can be configured to share a level 2 cache (L2C) 113.

至少一个CPU内核103、至少一个HNA处理器108、以及至少一个HFA处理器110可以各自分别通过一致路径115a、115b、和115c操作性地耦合至L2C 113，这些路径可以是单独的一致存储器总线、单条共享一致存储器总线、单独的一致通信信道、共享一致通信信道或任何其他合适的一致路径。L2C存储器控制器(未示出)可以使用L2C 113来保持通过一致路径115a、115b、和115c的存储器访问的存储器参考一致性。例如，如果至少一个HNA处理器108通过一致路径115b访问给定存储器位置，则可以通过使至少一个CPU内核103从该给定存储器位置读取的内容的数据副本无效来保持存储器参考一致性。使数据副本无效可以使至少一个CPU内核103或至少一个HFA处理器110能够获得至少一个HNA处理器108对该给定存储器位置进行的最近更新的值。At least one CPU core 103, at least one HNA processor 108, and at least one HFA processor 110 can each be operatively coupled to L2C 113 via coherent paths 115a, 115b, and 115c, respectively. These paths can be separate coherent memory buses, a single shared coherent memory bus, separate coherent communication channels, a shared coherent communication channel, or any other suitable coherent paths. An L2C memory controller (not shown) can use L2C 113 to maintain memory reference coherence for memory accesses via coherent paths 115a, 115b, and 115c. For example, if at least one HNA processor 108 accesses a given memory location via coherent path 115b, memory reference coherence can be maintained by invalidating a data copy of the contents read from the given memory location by at least one CPU core 103. Invalidating the data copy enables at least one CPU core 103 or at least one HFA processor 110 to obtain the value of the most recent update made by at least one HNA processor 108 to the given memory location.

该示例实施例可以进一步包括至少一个本地存储器控制器(LMC)117，该本地存储器控制器可以操作性地耦合至L2C 113并且被配置成用于管理多种访问，如对至少一个系统存储器151或从其读取、写入、加载、存储或任何其他合适的访问。如此，至少一个CPU内核103、至少一个HNA处理器108、或至少一个HFA处理器110通过一致路径115a、115b、和115c对至少一个系统存储器151中的一个位置的访问使至少一个CPU内核103、至少一个HFA处理器110、以及至少一个HNA处理器108能够为所访问的位置保持一个共同值。The example embodiment may further include at least one local memory controller (LMC) 117, which may be operatively coupled to the L2C 113 and configured to manage various accesses, such as reading, writing, loading, storing, or any other suitable access to or from the at least one system memory 151. Thus, access by the at least one CPU core 103, the at least one HNA processor 108, or the at least one HFA processor 110 to a location in the at least one system memory 151 via the coherent paths 115a, 115b, and 115c enables the at least one CPU core 103, the at least one HFA processor 110, and the at least one HNA processor 108 to maintain a common value for the accessed location.

进一步地，如以下参照图1B和图4A所描述的，至少一个HNA处理器108可以包括多个HNA处理单元(HPU)，每个处理单元包括至少一个HNA处理内核。如此，通过一致路径115a或115b的访问可以使至少一个CPU内核103中的每个至少一个CPU内核以及至少一个HNA处理器108的HPU中的每个的每个至少一个HNA处理内核能够保持存储器参考一致性。这些HPU可以是能够使至少一个HNA处理器108的总体性能实现至少20Gbps的并发HPU。Furthermore, as described below with reference to FIG1B and FIG4A , the at least one HNA processor 108 may include multiple HNA processing units (HPUs), each of which includes at least one HNA processing core. Thus, access via the coherent paths 115 a or 115 b enables each of the at least one CPU core in the at least one CPU core 103 and each of the at least one HNA processing core in each of the HPUs of the at least one HNA processor 108 to maintain memory reference coherency. These HPUs may be concurrent HPUs capable of achieving an overall performance of at least 20 Gbps for the at least one HNA processor 108.

回到图1A，至少一个HFA处理器110和至少一个HNA处理器108可以分别通过非一致路径119a和119b操作性地耦合至LMC 117，从而使得至少一个HFA处理器110和至少一个HNA处理器108能够绕过L2C 113以减少存储器访问延迟，从而提高匹配性能。根据在此披露的实施例，非一致路径119a可以使HNA处理器108能够通过LMC 117直接访问至少一个系统存储器151，从而基于HNA处理器108访问的至少一个系统存储器151的具体分区或位置而绕过LC2 113。Returning to FIG1A , at least one HFA processor 110 and at least one HNA processor 108 may be operatively coupled to LMC 117 via non-coherent paths 119 a and 119 b, respectively, thereby enabling at least one HFA processor 110 and at least one HNA processor 108 to bypass L2C 113 to reduce memory access latency and thereby improve matching performance. According to embodiments disclosed herein, non-coherent path 119 a may enable HNA processor 108 to directly access at least one system memory 151 via LMC 117, thereby bypassing LC2 113 based on the specific partition or location of at least one system memory 151 accessed by HNA processor 108.

例如，从HNA处理器108的角度看，如果至少一个系统存储器151的具体分区或位置包括只读内容，则可以使用非一致路径119a，因为基于访问来保持一致性将不是问题。此类只读内容可以包括图形存储器内容，如HNA处理器108可以用来对输入流中的正则表达式进行匹配的至少一个NFA图形(未示出)的一个或多个节点。通过非一致路径119a访问至少一个系统存储器151来绕过L2C 113可以通过避免将以另外方式发生的延迟来提高至少一个HNA处理器108的匹配性能，以便保持访问的一致性。进一步地，由于只读内容可以有利地包括至少一个NFA图形的可以不具有时间或空间局部性的一个或多个节点，如以下参照图13D所披露的，通过非一致路径119a访问该一个或多个节点可以实现另一个优点，因为此类访问将不污染L2C 113的现有内容。For example, from the perspective of HNA processor 108, if a specific partition or location of at least one system memory 151 includes read-only content, non-coherent path 119a can be used because maintaining access-based consistency is not an issue. Such read-only content can include graph memory content, such as one or more nodes of at least one NFA graph (not shown) that HNA processor 108 can use to match regular expressions in an input stream. Accessing at least one system memory 151 via non-coherent path 119a to bypass L2C 113 can improve matching performance of at least one HNA processor 108 by avoiding delays that would otherwise be incurred to maintain access consistency. Furthermore, because read-only content can advantageously include one or more nodes of at least one NFA graph that may not have temporal or spatial locality, as disclosed below with reference to FIG. 13D , accessing such one or more nodes via non-coherent path 119a can achieve another advantage because such access will not pollute existing contents of L2C 113.

图1B为HNA处理器108的示例实施例的框图155。如以上参照图1A所披露的，HNA处理器108可以专用于NFA处理。HNA处理器108可以包括多个超级集群，如超级集群121a和121b。每个超级集群可以包括多个集群，如超级集群121a的集群123a和123b和超级集群121b的集群123c和123d。该多个集群123a-d中的每个集群可以包括多个HNA处理单元(HPU)，如集群123a的HPU 125a和125b、集群123b的HPU 125c和125d、集群123c的HPU 125e和125f、以及集群123d的HPU 125g和125h。HPU 125a-f中的每个可以具有一个如以下参照图4A所披露的架构。HNA处理器108可以进一步包括一个可以被配置成用于存储至少一个HNA指令153的HNA芯上指令队列154，该指令可以被分配至HPU 125a-f中的给定HPU。FIG1B is a block diagram 155 of an example embodiment of an HNA processor 108. As disclosed above with reference to FIG1A , the HNA processor 108 can be dedicated to NFA processing. The HNA processor 108 can include multiple superclusters, such as superclusters 121a and 121b. Each supercluster can include multiple clusters, such as clusters 123a and 123b of supercluster 121a and clusters 123c and 123d of supercluster 121b. Each of the multiple clusters 123a-d can include multiple HNA processing units (HPUs), such as HPUs 125a and 125b of cluster 123a, HPUs 125c and 125d of cluster 123b, HPUs 125e and 125f of cluster 123c, and HPUs 125g and 125h of cluster 123d. Each of the HPUs 125a-f can have an architecture as disclosed below with reference to FIG4A . The HNA processor 108 may further include an HNA on-core instruction queue 154 that may be configured to store at least one HNA instruction 153 that may be dispatched to a given one of the HPUs 125a-f.

对至少一个HNA指令153进行分配可以包括写入特定门铃寄存器，该门铃寄存器被配置成用于使用与以下参照图4A所披露的HNA指令相关联的信息来触发该多个HPU中的给定HPU以开始图形行走。分配可以包括触发与该给定HPU相关联的中断、或以任何其他合适方式的分配。Dispatching at least one HNA instruction 153 may include writing to a specific doorbell register configured to trigger a given HPU in the plurality of HPUs to begin a graphics walk using information associated with an HNA instruction disclosed below with reference to FIG4A . Dispatching may include triggering an interrupt associated with the given HPU, or dispatching in any other suitable manner.

可以按照HNA指令块的链表来保持HNA片上指令队列154或以任何其他合适的方式来保持。每个HNA指令块可以包括可编程数量的固定长度的HNA指令。软件可以对可以由HPU释放的指令块进行分配。HNA指令块后面可以紧跟着一个可以包括下一个块指针链接的64位字。给定HPU可以被配置成用于与该给定HPU相关联的门铃计数一指示指令块中最后一个HNA指令包含一个有效HNA指令就读取下一个块指针。该给定HPU可以从该指令队列读取字，例如，以头指针开始，并且基于处理指令块中的最后一个指令，该给定HPU可以使用下一个块指针链接来遍历至下一个指令块。以此方式，当止用块中的最后一个HNA指令时，该给定HPU可以自动地将所止用的存储器块释放回管理池。下一个块指针可以是指令块内的最后一个HNA指令后的下一个64位字。下一个块指针可以指定下一个块指针(前向链接)以使该给定HPU能够对可以存储在至少一个系统存储器151内的下一个指令进行定位。The HNA on-chip instruction queue 154 can be maintained as a linked list of HNA instruction blocks or in any other suitable manner. Each HNA instruction block can include a programmable number of fixed-length HNA instructions. Software can allocate instruction blocks that can be released by the HPU. An HNA instruction block can be followed by a 64-bit word that can include a next block pointer link. A given HPU can be configured to read the next block pointer once a doorbell count associated with the given HPU indicates that the last HNA instruction in an instruction block contains a valid HNA instruction. The given HPU can read words from the instruction queue, for example, starting with the head pointer, and based on processing the last instruction in the instruction block, the given HPU can use the next block pointer link to traverse to the next instruction block. In this way, when the last HNA instruction in a block is deactivated, the given HPU can automatically release the deactivated memory block back to the management pool. The next block pointer can be the next 64-bit word following the last HNA instruction in the instruction block. The next block pointer may specify a next block pointer (forward chaining) to enable the given HPU to locate the next instruction that may be stored within the at least one system memory 151 .

为了将HNA指令插入到HNA片上指令队列154内，软件可以将该HNA指令写入到软件所保持的尾指针，然后将与有待添加至HNA片上指令队列154的HNA指令的总数一起写入至给定的HNA门铃计数寄存器。到该给定的HNA门铃计数寄存器的写入可以是累积的并且可以反映未决HNA指令的总数。当该给定HPU止用指令时，可以使相应的HNA门铃计数寄存器自动减量。该给定HPU可以被配置成用于继续处理HNA指令，直到已经为所有未决请求提供了服务，例如，直到相关联的累积门铃计数寄存器为零。To insert an HNA instruction into the HNA on-chip instruction queue 154, software may write the HNA instruction to a tail pointer maintained by the software and then write the total number of HNA instructions to be added to the HNA on-chip instruction queue 154 to a given HNA doorbell count register. Writes to the given HNA doorbell count register may be cumulative and reflect the total number of pending HNA instructions. When the given HPU deactivates an instruction, the corresponding HNA doorbell count register may be automatically decremented. The given HPU may be configured to continue processing HNA instructions until all pending requests have been serviced, e.g., until the associated cumulative doorbell count register reaches zero.

根据在此披露的实施例，该多个超级集群121a和121b中的所选择的至少一个超级集群121a的该多个集群123a和123b的该多个HPU125a-d可以形成一个可用于对至少一个HNA指令153进行分配的HPU资源池127。HNA处理器108可以进一步包括一个可以被配置成用于从所形成的HPU资源池127选择一个给定HPU(如HPU 125b)的HNA调度器129，并且HNA调度器129可以将至少一个HNA指令153分配给所选择的给定HPU 125b以便发起对从网络(未示出)接收到的输入流(未示出)中的至少一个正则表达式模式(未示出)进行匹配。形成HNA调度器129可用于选择的HPU资源池127的多个HPU 125a-d可以能够实现匹配的加速。According to embodiments disclosed herein, the plurality of HPUs 125a-d of the plurality of clusters 123a and 123b of at least one selected super cluster 121a from the plurality of super clusters 121a and 121b may form an HPU resource pool 127 that may be used to allocate at least one HNA instruction 153. The HNA processor 108 may further include an HNA scheduler 129 that may be configured to select a given HPU (e.g., HPU 125b) from the formed HPU resource pool 127, and the HNA scheduler 129 may allocate the at least one HNA instruction 153 to the selected given HPU 125b to initiate matching of at least one regular expression pattern (not shown) in an input stream (not shown) received from a network (not shown). The plurality of HPUs 125a-d that form the HPU resource pool 127 that the HNA scheduler 129 may select may enable accelerated matching.

应理解到，在此被称为“片上”的HNA组件是指可以集成在HNA处理器108的单个芯片基板上的组件，并且针对超级集群、集群、或HPU示出的总数是出于说明目的，并且可以使用任何合适的总数。例如，该多个超级集群的一个第一总数可以至少为二，该多个集群的一个第二总数可以至少为二，并且该多个HPU的一个第三总数可以至少为十。It should be understood that HNA components referred to herein as "on-chip" refer to components that can be integrated on a single chip substrate of the HNA processor 108, and that the total numbers shown for superclusters, clusters, or HPUs are for illustrative purposes, and any suitable total number may be used. For example, a first total number of the plurality of superclusters may be at least two, a second total number of the plurality of clusters may be at least two, and a third total number of the plurality of HPUs may be at least ten.

图1C为包括HNA处理器108的示例实施例的安全装置102的实施例的框图157。安全装置102可以操作性地耦合至网络(未示出)。该网络可以是广域网(WAN)、局域网(LAN)、无线网络、或任何其他合适的网络。安全装置102可以包括至少一个CPU内核103和可以操作性地连至以上参照图1A所披露的至少一个CPU内核103的至少一个HNA处理器108。至少一个HNA处理器108可以专用于非确定有限自动机(NFA)处理。FIG1C is a block diagram 157 of an embodiment of a security device 102 including an example embodiment of an HNA processor 108. The security device 102 can be operatively coupled to a network (not shown). The network can be a wide area network (WAN), a local area network (LAN), a wireless network, or any other suitable network. The security device 102 can include at least one CPU core 103 and at least one HNA processor 108 operatively coupled to the at least one CPU core 103 disclosed above with reference to FIG1A . The at least one HNA processor 108 can be dedicated to nondeterministic finite automaton (NFA) processing.

根据在此披露的实施例，至少一个HNA处理器108可以包括多个超级集群，如以上披露的超级集群121a和121b。每个超级集群可以包括多个集群，如超级集群121a的集群123a和123b和超级集群121b的集群123c和123d。该多个集群123a-d中的每个集群可以包括多个HNA处理单元(HPU)，如集群123a的HPU 125a和125b、集群123b的HPU 125c和125d、集群123c的HPU 125e和125f、以及集群123d的HPU 125g和125h。至少一个CPU内核103可以被配置成用于当向HNA处理器108提交指令时选择该多个超级集群121a和121b中的至少一个超级集群，如超级集群121a。According to embodiments disclosed herein, at least one HNA processor 108 may include multiple superclusters, such as superclusters 121a and 121b disclosed above. Each supercluster may include multiple clusters, such as clusters 123a and 123b of supercluster 121a and clusters 123c and 123d of supercluster 121b. Each of the multiple clusters 123a-d may include multiple HNA processing units (HPUs), such as HPUs 125a and 125b of cluster 123a, HPUs 125c and 125d of cluster 123b, HPUs 125e and 125f of cluster 123c, and HPUs 125g and 125h of cluster 123d. At least one CPU core 103 may be configured to select at least one supercluster from the multiple superclusters 121a and 121b, such as supercluster 121a, when submitting an instruction to the HNA processor 108.

至少一个HNA处理器108可以包括一个被配置成用于存储至少一个HNA指令153的HNA片上指令队列154。至少一个HNA处理器108可以包括HNA调度器129。HNA调度器129可以被配置成用于选择所选择的至少一个超级集群121a的该多个集群123a和123b的中的该多个HPU 125a-d中的给定HPU 125b并且将至少一个HNA指令153分配给给定HPU 125b以便发起对从网络(未示出)接收到的输入流(未示出)中的至少一个正则表达式模式(未示出)进行匹配。The at least one HNA processor 108 may include an HNA on-chip instruction queue 154 configured to store at least one HNA instruction 153. The at least one HNA processor 108 may include an HNA scheduler 129. The HNA scheduler 129 may be configured to select a given HPU 125b from the plurality of HPUs 125a-d in the plurality of clusters 123a and 123b of the selected at least one super cluster 121a and assign the at least one HNA instruction 153 to the given HPU 125b to initiate matching of at least one regular expression pattern (not shown) in an input stream (not shown) received from a network (not shown).

图1D为HNA处理器108的另一个示例实施例的框图158。根据该示例实施例，每个超级集群可以进一步包括一个相应超级集群专有的一个超级集群图形存储器156a。例如，超级集群图形存储器156a可以是对相应超级集群121a专有的。超级集群图形存储器156a对相应超级集群的相应多个集群的相应多个HPU(如集群123a和123b的该多个HPU 125a-d)是可访问的并且可以被配置成用于静态地存储至少一个每模式NFA(未示出)(如以下参照图3A所披露的每模式NFA 314)的一个节点子集(未示出)。该节点子集可以由一个编译器(未示出)确定，如以下参照图3A所披露的编译器306，该编译器可以确定节点分布，如以下参照图7A和图8所披露的节点分布。超级集群图形存储器156a可以被配置成用于存储多种类型的NFA节点。不同节点类型的NFA节点可以配置有给定大小，从而使多种节点类型的多个节点能够具有相同的节点大小。FIG1D is a block diagram 158 of another example embodiment of an HNA processor 108. According to this example embodiment, each supercluster may further include a supercluster graph memory 156a specific to the corresponding supercluster. For example, supercluster graph memory 156a may be specific to supercluster 121a. Supercluster graph memory 156a is accessible to the corresponding HPUs of the corresponding clusters of the corresponding supercluster (e.g., the HPUs 125a-d of clusters 123a and 123b) and may be configured to statically store a subset of nodes (not shown) of at least one per-pattern NFA (not shown) (such as per-pattern NFA 314 described below with reference to FIG3A). The subset of nodes may be determined by a compiler (not shown), such as compiler 306 described below with reference to FIG3A, which may determine a node distribution, such as the node distribution described below with reference to FIG7A and FIG8. Supercluster graph memory 156a may be configured to store multiple types of NFA nodes. NFA nodes of different node types may be configured with a given size, thereby enabling multiple nodes of multiple node types to have the same node size.

根据在此披露的实施例，每个超级集群可以进一步包括相应超级集群专有的至少一个超级集群字符类存储器135。例如，至少一个超级集群字符类存储器135可以是对相应超级集群121a专有的。每个至少一个超级集群字符类存储器可以被配置成用于静态地存储多个正则表达式模式字符类定义(未示出)。所存储的正则表达式模式字符类定义可以用于对输入流中的至少一个正则表达式模式进行匹配。相应超级集群121a的相应多个集群123a和123b的相应多个HPU 125a-d可以共享至少一个超级集群字符类存储器135。根据另一个实施例，超级集群图形存储器156a和至少一个超级集群字符类存储器135可以是统一的。According to embodiments disclosed herein, each super cluster may further include at least one super cluster character class memory 135 that is specific to the corresponding super cluster. For example, the at least one super cluster character class memory 135 may be specific to the corresponding super cluster 121a. Each at least one super cluster character class memory may be configured to statically store a plurality of regular expression pattern character class definitions (not shown). The stored regular expression pattern character class definitions may be used to match at least one regular expression pattern in an input stream. The at least one super cluster character class memory 135 may be shared by the corresponding plurality of HPUs 125a-d of the corresponding plurality of clusters 123a and 123b of the corresponding super cluster 121a. According to another embodiment, the super cluster graph memory 156a and the at least one super cluster character class memory 135 may be unified.

根据在此披露的实施例，每个至少一个HNA指令153可以指定一个用于指定哪个每模式NFA用于对至少一个正则表达式进行匹配的图形标识符。根据一个实施例，编译器(如图3A的编译器306)可以对每个每模式NFA的节点进行分布，从而通过将给定每模式NFA的节点存储到给定超级集群专有的存储器使得该给定每模式NFA对该给定超级集群是专有的。According to embodiments disclosed herein, each at least one HNA instruction 153 may specify a graph identifier for specifying which per-pattern NFA to use for matching at least one regular expression. According to one embodiment, a compiler (such as compiler 306 of FIG. 3A ) may distribute the nodes of each per-pattern NFA so that the given per-pattern NFA is specific to the given supercluster by storing the nodes of the given per-pattern NFA in a memory specific to the given supercluster.

如此，可以基于与至少一个HNA指令153所指定的给定每模式NFA相关联的唯一图形标识符对至少一个HNA指令153进行分配，以遍历(即，行走)有效载荷段以便对至少一个正则表达式模式进行匹配。如此，可以将HPU选择限制到给定超级集群的HPU。在该给定超级集群内，由于该给定超级集群的集群访问共享统一超级集群图形存储器，可以基于该给定集群的每个HPU的循环调度、瞬时加载、以上的组合、或以任何其他合适的方式选择给定该超级集群的给定HPU。In this manner, at least one HNA instruction 153 can be assigned based on a unique graph identifier associated with a given per-pattern NFA specified by the at least one HNA instruction 153 to traverse (i.e., walk) payload segments to match at least one regular expression pattern. In this manner, HPU selection can be limited to HPUs of a given supercluster. Within the given supercluster, because the clusters of the given supercluster access a shared unified supercluster graph memory, a given HPU of the given supercluster can be selected based on round-robin scheduling of each HPU of the given cluster, transient loading, a combination of the above, or in any other suitable manner.

例如，该图形标识符可以与多个每模式NFA中的给定每模式NFA相关联，如图3A的NFA 314。如此，一个给定的模式集合可以共享同一个图形标识符。例如，图3A的规则集310中的所有模式可以共享同一个图形标识符。在某些情况下，可以存在与系统中的规则集310相似的多个规则集。在那种情况下，每个单独的“规则集”可以具有一个唯一的图形标识符。该图形标识符可以与该给定每模式NFA的至少一个节点(未示出)相关联并且可以存储在可以是对该多个超级集群121a和121b的给定超级集群121a专有的超级集群图形存储器156a内，如超级集群121a专有的超级集群图形存储器156a。该图形标识符可以与一个模式集合相关联。至少一个CPU内核103可以基于确定与该图形标识符相关联的给定每模式NFA 314的至少一个节点存储在给定超级集群121a专有的超级集群图形存储器156a内来选择该多个超级集群121a和121b中的给定超级集群121a。For example, the graph identifier can be associated with a given per-pattern NFA in a plurality of per-pattern NFAs, such as NFA 314 of FIG. 3A . Thus, a given set of patterns can share the same graph identifier. For example, all patterns in rule set 310 of FIG. 3A can share the same graph identifier. In some cases, there can be multiple rule sets similar to rule set 310 in the system. In that case, each separate “rule set” can have a unique graph identifier. The graph identifier can be associated with at least one node (not shown) of the given per-pattern NFA and can be stored in a supercluster graph memory 156a that can be specific to a given supercluster 121a of the plurality of superclusters 121a and 121b, such as supercluster graph memory 156a specific to supercluster 121a. The graph identifier can be associated with a set of patterns. At least one CPU core 103 may select a given super cluster 121a of the plurality of super clusters 121a and 121b based on determining that at least one node of a given per-pattern NFA 314 associated with the graph identifier is stored within a super cluster graph memory 156a specific to the given super cluster 121a.

如此，至少一个CPU内核103可以被进一步配置成用于通过基于与至少一个HNA指令153相关联的图形标识符(未示出)限制超级集群选择来选择多个超级集群121a和121b中的该至少一个超级集群(如超级集群121a)。例如，该图形标识符可以与多个每模式NFA中的给定每模式NFA相关联，并且限制超级集群选择可以包括确定该给定每模式NFA的至少一个节点存储在至少一个超级集群121a专有的超级集群图形存储器156a内。至少一个CPU内核103可以基于确定与该图形标识符相关联的该给定每模式NFA的至少一个节点存储在超级集群121a专有的超级集群图形存储器156a内来选择多个超级集群121a和121b中的至少一个超级集群121a。Thus, at least one CPU core 103 may be further configured to select the at least one super cluster (such as super cluster 121a) from the plurality of super clusters 121a and 121b by limiting the super cluster selection based on a graph identifier (not shown) associated with at least one HNA instruction 153. For example, the graph identifier may be associated with a given per-mode NFA from the plurality of per-mode NFAs, and limiting the super cluster selection may include determining that at least one node of the given per-mode NFA is stored within a super cluster graph memory 156a that is dedicated to the at least one super cluster 121a. The at least one CPU core 103 may select at least one super cluster 121a from the plurality of super clusters 121a and 121b based on determining that at least one node of the given per-mode NFA associated with the graph identifier is stored within a super cluster graph memory 156a that is dedicated to the super cluster 121a.

根据在此披露的实施例，HNA调度器129可以被配置成用于从可以包括所选择的该至少一个超级集群的每个相应多个集群的每个相应多个HPU的HPU限制集合(如所选择的相应超级集群121a的相应多个集群123a和123b的相应多个HPU 125a-d)中选择一个给定HPU，如图1C的HPU 125b。HNA调度器129可以被配置成用于基于该HPU限制集合中的HPU 125a-d的循环调度、该HPU限制集合中的HPU 125a-d中的每个HPU的瞬时加载、以上的组合、或基于任何其他合适的调度策略从可以包括HPU 125a-d的该HPU限制集合中选择给定HPU 125b。According to embodiments disclosed herein, the HNA scheduler 129 may be configured to select a given HPU, such as HPU 125b in FIG. 1C , from a restricted set of HPUs that may include each of the plurality of HPUs in each of the plurality of clusters selected from the at least one super cluster (e.g., the plurality of HPUs 125a-d in the plurality of clusters 123a and 123b in the plurality of super clusters selected from the super cluster 121a). The HNA scheduler 129 may be configured to select the given HPU 125b from the restricted set of HPUs that may include the HPUs 125a-d based on round-robin scheduling of the HPUs 125a-d in the restricted set of HPUs, instantaneous loading of each of the HPUs 125a-d in the restricted set of HPUs, a combination thereof, or based on any other suitable scheduling strategy.

根据在此披露的另一个实施例，编译器(如以下披露的图3A的编译器306)可以复制多个超级集群图形存储器中的该至少一个每模式NFA中的给定每模式NFA的一个或多个节点，每个超级集群图形存储器对相应超级集群是专有的。如此，至少一个HNA指令153可以被分配或调度至HNA处理器108的任何超级集群的任何集群的任何HPU。可以基于具体超级集群(或超级集群内的集群)处的瞬时加载或基于HNA处理器108的该多个超级集群的该多个集群的该多个HPU的循环调度来选择HPU。如果希望每个每模式NFA的最大吞吐量，则可以是此类情况。然而，根据替代性示例实施例，当HNA处理器108的多个超级集群的每个超级集群图形存储器包含所复制的每模式NFA节点时，此类配置可以通过HNA处理器108的超级集群的总数来限制所支持的每模式NFA的总数。According to another embodiment disclosed herein, a compiler (such as compiler 306 of FIG. 3A , disclosed below) can replicate one or more nodes of a given per-pattern NFA in at least one per-pattern NFA in multiple supercluster graph memories, each supercluster graph memory being dedicated to a corresponding supercluster. In this manner, at least one HNA instruction 153 can be assigned or scheduled to any HPU in any cluster of any supercluster of the HNA processor 108. The HPU can be selected based on instantaneous load at a particular supercluster (or cluster within a supercluster) or based on the cycle scheduling of the HPUs in the multiple clusters of the multiple superclusters of the HNA processor 108. This may be the case if maximum throughput for each per-pattern NFA is desired. However, according to an alternative exemplary embodiment, when each supercluster graph memory of the multiple superclusters of the HNA processor 108 contains replicated per-pattern NFA nodes, such a configuration can limit the total number of per-pattern NFAs supported by the total number of superclusters of the HNA processor 108.

例如，根据替代性示例实施例，编译器306可以将每个超级集群图形存储器配置成用于存储多个每模式NFA中的至少一个每模式NFA的至少一个节点以复制该至少一个HNA处理器的每个超级集群的每个超级集群图形存储器中的该至少一个节点。如此，至少一个CPU内核103可以为HNA调度器129提供一个选项：基于确定复制与该至少一个HNA指令相关联的至少一个每模式NFA中的给定每模式NFA来选择该至少一个超级集群。For example, according to an alternative exemplary embodiment, the compiler 306 may configure each super-cluster graph memory to store at least one node of at least one per-mode NFA in the plurality of per-mode NFAs to replicate the at least one node in each super-cluster graph memory of each super-cluster of the at least one HNA processor. In this manner, the at least one CPU core 103 may provide the HNA scheduler 129 with an option to select the at least one super-cluster based on determining to replicate a given per-mode NFA in the at least one per-mode NFA associated with the at least one HNA instruction.

如此，至少一个CPU内核103选择至少一个超级集群的替代方案，而是HNA调度器129可以基于所提供的该选项来选择至少一个超级集群，如图1C的超级集群121a。例如，如果所提供的该选项指示HNA调度器129要选择至少一个超级集群，则HNA调度器129可以基于所提供的该选项和(i)该多个超级集群的一次第一循环调度、(ii)该多个超级集群的一次第一瞬时加载或(ii)(i)与(ii)的组合来选择该至少一个超级集群。然后，HNA调度器129可以基于HNA 129所选择的至少一个超级集群121a的多个集群123a和123b的多个HPU 125a-d的一次第二循环调度、HNA 129所选择的至少一个超级集群121a的多个集群123a和123b的多个HPU 125a-d的一次第二瞬时加载、或以上的组合来从所选择的至少一个超级集群121a的多个集群123a和123b的多个HPU125a-d中选择给定HPU 125b。Thus, instead of at least one CPU core 103 selecting an alternative to at least one super cluster, the HNA scheduler 129 may select at least one super cluster based on the provided option, such as super cluster 121a in FIG1C . For example, if the provided option instructs the HNA scheduler 129 to select at least one super cluster, the HNA scheduler 129 may select the at least one super cluster based on the provided option and (i) a first round-robin schedule of the plurality of super clusters, (ii) a first instantaneous load of the plurality of super clusters, or (ii) a combination of (i) and (ii). The HNA scheduler 129 may then select a given HPU 125b from among the plurality of HPUs 125a-d in the plurality of clusters 123a and 123b of the at least one super cluster 121a selected by the HNA 129 based on a second round-robin scheduling of the plurality of HPUs 125a-d in the plurality of clusters 123a and 123b of the at least one super cluster 121a selected by the HNA 129, a second transient loading of the plurality of HPUs 125a-d in the plurality of clusters 123a and 123b of the at least one super cluster 121a selected by the HNA 129, or a combination thereof.

回到图1D，至少一个HNA处理器129可以进一步包括该多个超级集群的该多个集群的该多个HPU(如图1C的该多个超级集群121a和121b的该多个集群123a-d的该多个HPU125a-h)可以访问的HNA片上图形存储器156b。HNA片上图形存储器156b可以被配置成用于静态地存储至少一个每模式NFA(未示出)的一个节点子集(未示出)。该节点子集可以由一个编译器(如至少一个每模式NFA 314的图3A的编译器306)确定，该编译器可以确定节点分布，如以下参照图7A和图8所披露的节点分布。HNA片上图形存储器156b可以被配置成用于存储多种类型的NFA节点。不同节点类型的NFA节点可以配置有给定大小，从而使多种节点类型的多个节点能够具有相同的节点大小。Returning to FIG. 1D , the at least one HNA processor 129 may further include an HNA on-chip graph memory 156 b accessible to the multiple HPUs of the multiple clusters of the multiple superclusters (e.g., the multiple HPUs 125 a-h of the multiple clusters 123 a-d of the multiple superclusters 121 a and 121 b of FIG. 1C ). The HNA on-chip graph memory 156 b may be configured to statically store a subset of nodes (not shown) of at least one per-pattern NFA (not shown). The subset of nodes may be determined by a compiler (e.g., compiler 306 of FIG. 3A for the at least one per-pattern NFA 314), which may determine a node distribution, such as the node distribution disclosed below with reference to FIG. 7A and FIG. 8 . The HNA on-chip graph memory 156 b may be configured to store multiple types of NFA nodes. NFA nodes of different node types may be configured with a given size, thereby enabling multiple nodes of multiple node types to have the same node size.

回到图1C，至少一个HNA指令153可以是第一至少一个HNA指令，并且安全装置102可以进一步包括至少一个系统存储器，如图1A中的可以操作性地耦合至至少一个CPU内核103和至少一个HNA处理器108的至少一个系统存储器151。1C , the at least one HNA instruction 153 may be a first at least one HNA instruction, and the security device 102 may further include at least one system memory, such as the at least one system memory 151 in FIG. 1A , which may be operatively coupled to the at least one CPU core 103 and the at least one HNA processor 108 .

图1E为至少一个系统存储器151的示例实施例的框图160。根据在此披露的实施例，至少一个系统存储器(如以上参照图1A所披露的至少一个系统存储器151)可以被配置成包括用于存储第二至少一个HNA指令(未示出)的HNA片上指令队列163。该第二至少一个HNA指令可以未决传输至HNA处理器108的HNA片上指令队列154。至少一个系统存储器151可以进一步包括一个被配置成用于静态地存储至少一个每模式NFA(未示出)的一个节点子集(未示出)的HNA芯片外图形存储器156c。该节点子集可以由至少一个每模式NFA的一个编译器(如至少一个每模式NFA 314的图3A的编译器306)确定，该编译器可以确定节点分布，如以下参照图7A和图8所披露的节点分布。HNA芯片外图形存储器156c可以被配置成用于存储多种类型的NFA节点。不同节点类型的NFA节点可以配置有给定大小，从而使多种节点类型的多个节点能够具有相同的节点大小。FIG1E is a block diagram 160 of an example embodiment of at least one system memory 151. According to embodiments disclosed herein, at least one system memory (such as the at least one system memory 151 described above with reference to FIG1A ) can be configured to include an HNA on-chip instruction queue 163 for storing a second at least one HNA instruction (not shown). The second at least one HNA instruction can be transferred to the HNA on-chip instruction queue 154 of the HNA processor 108 pending transmission. The at least one system memory 151 can further include an HNA off-chip graph memory 156 c configured to statically store a subset of nodes (not shown) of at least one per-pattern NFA (not shown). The subset of nodes can be determined by a compiler of the at least one per-pattern NFA (such as compiler 306 of FIG3A for at least one per-pattern NFA 314), which can determine a node distribution, such as the node distribution described below with reference to FIG7A and FIG8 . The HNA off-chip graph memory 156 c can be configured to store multiple types of NFA nodes. NFA nodes of different node types can be configured with a given size, thereby enabling multiple nodes of multiple node types to have the same node size.

根据在此披露的实施例，图1C的安全装置102可以进一步包括L2C 113、至少一个LMC 117、以及图1A的至少一个系统存储器151。至少一个LMC 117可以操作性地耦合至至少一个HNA处理器108和至少一个系统存储器151。该至少一个LMC中的给定LMC可以被配置成用于能够实现至少一个系统存储器151的访问以便至少一个HNA处理器108对HNA芯片外图形存储器156c进行访问。通过非一致路径119a来绕过L2C 113可以通过避免由于保持通过一致路径115b对HNA芯片外图形存储器156c的访问的一致性而另外引起的延迟来提高至少一个HNA处理器108的匹配性能。由于HNA芯片外图形存储器156c中所存储的节点可能没有时间或空间局部性并且由于从至少一个HNA处理器108的角度看，此类存储的节点的访问是只读的，通过非一致路径119a访问HNA芯片外图形存储器156c可以实现又另一个优点，因为此类访问将不污染L2C 113的现有内容。According to embodiments disclosed herein, the security device 102 of FIG. 1C may further include an L2C 113, at least one LMC 117, and at least one system memory 151 of FIG. The at least one LMC 117 may be operatively coupled to the at least one HNA processor 108 and the at least one system memory 151. A given LMC of the at least one LMC may be configured to enable access to the at least one system memory 151 so that the at least one HNA processor 108 can access the HNA off-chip graphics memory 156 c. Bypassing the L2C 113 via the non-coherent path 119 a may improve the matching performance of the at least one HNA processor 108 by avoiding delays that would otherwise be incurred by maintaining coherency in accesses to the HNA off-chip graphics memory 156 c via the coherent path 115 b. Because nodes stored in the HNA off-chip graphics memory 156 c may not have temporal or spatial locality and because access to such stored nodes is read-only from the perspective of at least one HNA processor 108, accessing the HNA off-chip graphics memory 156 c via the non-coherent path 119 a may achieve yet another advantage in that such access will not pollute the existing contents of the L2C 113.

回到图1E，至少一个系统存储器151可以被进一步配置成包括可以被配置成用于连续地存储多个有效载荷的HNA数据包数据存储器165。该多个有效载荷中的每个有效载荷可以具有一个固定最大长度，如1536字节或任何其他合适的固定最大长度。该多个有效载荷中的每个有效载荷可以与HNA片上指令队列154中所存储的第一至少一个HNA指令或可以存储在HNA芯片外指令队列163内并且未决传输至HNA片上指令队列154的第二至少一个HNA指令中的一个给定HNA指令相关联。Returning to FIG. 1E , the at least one system memory 151 can be further configured to include an HNA packet data memory 165 that can be configured to contiguously store a plurality of payloads. Each of the plurality of payloads can have a fixed maximum length, such as 1536 bytes or any other suitable fixed maximum length. Each of the plurality of payloads can be associated with a given HNA instruction from among the first at least one HNA instruction stored in the HNA on-chip instruction queue 154 or the second at least one HNA instruction that can be stored in the HNA off-chip instruction queue 163 and pending transmission to the HNA on-chip instruction queue 154.

根据在此披露的实施例，至少一个系统存储器151可以被进一步配置成包括一个可以被配置成用于存储至少一个HNA输入堆栈的HNA输入堆栈分区161。每个至少一个HNA输入堆栈可以被配置成用于存储针对该多个超级集群的该多个集群的该多个HPU的至少一个HPU的至少一项HNA输入工作，如以上所披露的HNA处理器108的该多个超级集群121a和121b的该多个集群123a-d的该多个HPU125a-h。According to embodiments disclosed herein, the at least one system memory 151 may be further configured to include an HNA input stack partition 161 configured to store at least one HNA input stack. Each at least one HNA input stack may be configured to store at least one HNA input job for at least one HPU of the plurality of HPUs of the plurality of clusters of the plurality of superclusters, such as the plurality of HPUs 125a-h of the plurality of clusters 123a-d of the plurality of superclusters 121a and 121b of the HNA processor 108 disclosed above.

至少一个系统存储器151可以进一步包括可以被配置成用于存储至少一个芯片外运行堆栈的HNA芯片外运行堆栈分区167以扩展至少一个片上运行堆栈的存储，如以下参照图4A所披露的运行堆栈460。每个至少一个片上运行堆栈可以被配置成用于存储针对相应HPU的至少一项运行时间HNA工作，如HPU 425，如以下参照图4A所披露的。The at least one system memory 151 may further include an HNA off-chip runtime stack partition 167 that may be configured to store at least one off-chip runtime stack to extend the storage of the at least one on-chip runtime stack, such as runtime stack 460 disclosed below with reference to FIG4A . Each of the at least one on-chip runtime stacks may be configured to store at least one runtime HNA job for a corresponding HPU, such as HPU 425, as disclosed below with reference to FIG4A .

至少一个系统存储器151可以进一步包括可以被配置成用于扩展至少一个片上保存缓冲区的存储的HNA芯片外保存缓冲区分区169，如以下参照图4A所披露的保存缓冲区464。该片上保存缓冲区可以被配置成用于基于检测到有效载荷边界而存储针对相应HPU的至少一项运行时间HNA工作，如HPU 425，如以下参照图4所披露的。The at least one system memory 151 may further include an HNA off-chip save buffer partition 169 that may be configured to extend storage of at least one on-chip save buffer, such as save buffer 464 disclosed below with reference to FIG4A . The on-chip save buffer may be configured to store at least one runtime HNA job for a corresponding HPU, such as HPU 425, based on detection of a payload boundary, as disclosed below with reference to FIG4 .

至少一个系统存储器151可以进一步包括可以被配置成用于存储匹配结果缓冲区的至少一个最终匹配结果条目的HNA芯片外结果缓冲区分区171，如以下参照图4A所披露的匹配结果缓冲区466。该至少一个最终匹配结果可以是至少一个HPU所确定的至少一个正则表达式模式的最终匹配以在输入流中进行匹配。可以存储在HNA片上指令队列154或HNA芯片外指令队列163中的每个至少一个HNA指令可以对HNA输入堆栈分区161的给定HNA输入堆栈、HNA芯片外运行堆栈分区167的给定HNA芯片外运行堆栈、HNA芯片外保存缓冲区分区169的给定HNA芯片外保存缓冲区、以及HNA芯片外结果缓冲区分区171的给定HNA芯片外结果缓冲区进行标识。At least one system memory 151 may further include an HNA off-chip result buffer partition 171 that may be configured to store at least one final match result entry of a match result buffer, such as match result buffer 466 disclosed below with reference to FIG. 4A . The at least one final match result may be a final match of at least one regular expression pattern determined by at least one HPU to match in the input stream. Each at least one HNA instruction that may be stored in the HNA on-chip instruction queue 154 or the HNA off-chip instruction queue 163 may identify a given HNA input stack of the HNA input stack partition 161, a given HNA off-chip run stack of the HNA off-chip run stack partition 167, a given HNA off-chip save buffer of the HNA off-chip save buffer partition 169, and a given HNA off-chip result buffer of the HNA off-chip result buffer partition 171.

回到图1A，至少一个LMC 117中的给定LMC可以被配置成用于使至少一个HNA处理器108能够通过一致路径115b访问HNA数据包数据存储器165、HNA输入堆栈分区161、HNA芯片外指令队列163、HNA芯片外运行堆栈分区167、HNA芯片外保存缓冲区分区169、以及HNA芯片外结果缓冲区分区171，并且可以被配置成用于使至少一个HNA处理器108能够通过非一致路径119a访问HNA芯片外图形存储器156c。1A , a given LMC in the at least one LMC 117 may be configured to enable the at least one HNA processor 108 to access the HNA packet data memory 165, the HNA input stack partition 161, the HNA off-chip instruction queue 163, the HNA off-chip run stack partition 167, the HNA off-chip save buffer partition 169, and the HNA off-chip result buffer partition 171 via the coherent path 115 b, and may be configured to enable the at least one HNA processor 108 to access the HNA off-chip graphics memory 156 c via the non-coherent path 119 a.

回到图1E，HNA输入堆栈分区161可以包括可以是由DFA处理产生的新HNA工作的HNA工作。如以上所披露的，至少一个HNA处理器108可以与为以下参照图1G所披露的确定有限自动机(DFA)处理提供硬件加速的HFA处理器110互补。1E , HNA input stack partition 161 may include HNA jobs that may be new HNA jobs generated by DFA processing. As disclosed above, at least one HNA processor 108 may be complementary to an HFA processor 110 that provides hardware acceleration for deterministic finite automaton (DFA) processing as disclosed below with reference to FIG. 1G .

图1F为一种方法的示例实施例的流程图(180)。该方法可以开始(182)并且在至少一个HNA处理器中包括多个超级集群，该处理器操作性地耦合至至少一个CPU内核并且专用于非确定有限自动机(NFA)处理(184)。该方法可以包括每个超级集群中的多个集群(186)。该方法可以包括在该多个集群中的每个集群中的多个HNA处理单元(HPU)(188)。该方法可以选择该多个超级集群中的至少一个超级集群(190)。该方法可以选择所选择的该至少一个超级集群的多个集群的多个HPU中的给定HPU(192)。该方法可以将至少一个HNA指令分配给所选择的该给定HPU以便发起对从网络接收到的输入流中的至少一个正则表达式模式进行匹配(194)，并且在该示例实施例中，该方法之后结束。Figure 1F is a flowchart (180) of an example embodiment of a method. The method may begin (182) and include a plurality of superclusters in at least one HNA processor operatively coupled to at least one CPU core and dedicated to non-deterministic finite automaton (NFA) processing (184). The method may include a plurality of clusters in each supercluster (186). The method may include a plurality of HNA processing units (HPUs) in each of the plurality of clusters (188). The method may select at least one supercluster from the plurality of superclusters (190). The method may select a given HPU from a plurality of HPUs in the plurality of clusters of the selected at least one supercluster (192). The method may assign at least one HNA instruction to the selected given HPU to initiate matching of at least one regular expression pattern in an input stream received from a network (194), and in this example embodiment, the method may then end.

图1G为以上披露的安全装置102的另一个示例实施例的框图，在其中可以实施在此披露的实施例。安全装置102可以包括一个网络服务处理器100。安全装置102可以是可以将在一个网络接口107a接收到的数据包切换到另一个网络接口107b并且在转发这些数据包之前可以在所接收到的数据包上执行多个安全功能的独立系统。例如，安全装置102可以用于在将处理的数据包101b转发至局域网(LNA)105b、或任何其他合适的网络之前对可以在广域网(WAN)105a、或任何其他合适的网络上接收到的数据包101a进行安全处理。FIG1G is a block diagram of another example embodiment of the security device 102 disclosed above, in which the embodiments disclosed herein may be implemented. The security device 102 may include a network services processor 100. The security device 102 may be a standalone system that can switch packets received on one network interface 107a to another network interface 107b and perform multiple security functions on the received packets before forwarding them. For example, the security device 102 may be configured to perform security processing on packets 101a that may be received on a wide area network (WAN) 105a, or any other suitable network, before forwarding the processed packets 101b to a local area network (LNA) 105b, or any other suitable network.

网络服务处理器100可以被配置成用于对所接收到的数据包中所封装的开放系统互连(OSI)网络L2-L7层协议进行处理。如本领域技术人员所熟知的，开放系统互连(OSI)参考模型定义了七层网络协议层(L1-L7)。物理层(L1)表示将设备连接到传输媒介的实际接口，包括电气接口及物理接口。数据链路层(L2)执行数据组帧。网络层(L3)将数据格式化为数据包。传输层(L4)处理端到端的传输。会话层(L5)管理设备之间的通信，例如，无论通信是半双工的还是全双工的。表现层(L6)管理数据格式化及表现，例如，语法、控制代码、特殊图形及字符集。应用层(L7)允许多个用户之间的通信，例如，文件传输及电子邮件。The network service processor 100 can be configured to process the open systems interconnection (OSI) network L2-L7 layer protocols encapsulated in the received data packets. As is well known to those skilled in the art, the open systems interconnection (OSI) reference model defines seven network protocol layers (L1-L7). The physical layer (L1) represents the actual interface that connects the device to the transmission medium, including the electrical interface and the physical interface. The data link layer (L2) performs data framing. The network layer (L3) formats the data into data packets. The transport layer (L4) handles end-to-end transmission. The session layer (L5) manages communication between devices, for example, whether the communication is half-duplex or full-duplex. The presentation layer (L6) manages data formatting and presentation, for example, syntax, control codes, special graphics and character sets. The application layer (L7) allows communication between multiple users, for example, file transfer and email.

网络服务处理器100可以为高层网络协议(例如，L4-L7)调度和排列工作(例如，数据包处理操作)，并且能够实现在所接收到的待执行的数据包中进行高层网络协议的处理，以便以线速转发数据包。通过处理这些协议来以线速转发这些数据包，网络服务处理器100不会降低网络数据传输速率。网络服务处理器100可以从可以是物理硬件接口的网络接口107a或107b接收数据包，并且可以对所接收到的数据包执行L2-L7网络协议处理。网络服务处理器100可以后续地将所处理的数据包101b转发通过网络接口107a或107b到达网络中的另一跳、最终目的地、或通过另一条总线(未示出)以便主机处理器(未示出)进行进一步处理。网络协议处理可以包括网络安全协议的处理，如防火墙、应用防火墙、包括IP安全(IPSec)和/或安全套接字层(SSL)的虚拟专用网(VPN)、入侵检测系统(IDS)和防病毒(AV)、或任何其他合适的网络协议。The network services processor 100 can schedule and queue work (e.g., packet processing operations) for higher-level network protocols (e.g., L4-L7) and can perform higher-level network protocol processing on received packets to be executed, thereby forwarding the packets at line speed. By processing these protocols to forward these packets at line speed, the network services processor 100 does not reduce the network data transmission rate. The network services processor 100 can receive packets from a network interface 107a or 107b, which can be a physical hardware interface, and can perform L2-L7 network protocol processing on the received packets. The network services processor 100 can subsequently forward the processed packet 101b through the network interface 107a or 107b to another hop in the network, a final destination, or via another bus (not shown) for further processing by a host processor (not shown). Network protocol processing can include processing of network security protocols such as firewalls, application firewalls, virtual private networks (VPNs) including IP Security (IPSec) and/or Secure Sockets Layer (SSL), intrusion detection systems (IDS) and antivirus (AV), or any other suitable network protocols.

网络服务处理器100可以使用多个处理器(即，内核)(如以上披露的至少一个CPU内核103)产生高应用性能。这些内核(未示出)中的每个内核可以专用于执行数据面、控制面操作、或以上的组合。数据面操作可以包括数据包操作以便转发数据包。控制面操作可以包括处理复杂的高层协议的多个部分，如互联网协议安全(IPSec)、传输控制协议(TCP)、安全套接字层(SSL)、或任何其他合适的高层协议。数据面操作可以包括处理这些复杂的高层协议的其他部分。The network services processor 100 can use multiple processors (i.e., cores) (such as the at least one CPU core 103 disclosed above) to produce high application performance. Each of these cores (not shown) can be dedicated to performing data plane operations, control plane operations, or a combination thereof. Data plane operations can include packet operations to forward packets. Control plane operations can include processing multiple parts of complex high-level protocols, such as Internet Protocol Security (IPSec), Transmission Control Protocol (TCP), Secure Sockets Layer (SSL), or any other suitable high-level protocol. Data plane operations can include processing other parts of these complex high-level protocols.

网络服务处理器100还可以包括可以分流内核从而使得网络服务处理器100实现高吞吐量的特定用途协处理器。例如，网络服务处理器100可以包括一个加速单元106，该加速单元可以包括用于NFA处理的硬件加速的HNA处理器108和用于DFA处理的硬件加速的HFA处理器110。HNA处理器108和HFA处理器110可以是被配置成用于从执行计算和内存密集型模式匹配方法的沉重负担分流网络服务处理器100通用内核(如以上披露的至少一个CPU内核103)的协处理器。The network services processor 100 may also include special purpose coprocessors that can offload cores to enable high throughput for the network services processor 100. For example, the network services processor 100 may include an acceleration unit 106 that may include an HNA processor 108 for hardware acceleration of NFA processing and an HFA processor 110 for hardware acceleration of DFA processing. The HNA processor 108 and the HFA processor 110 may be coprocessors configured to offload the heavy burden of executing computationally and memory-intensive pattern matching methods from the general purpose cores of the network services processor 100 (such as the at least one CPU core 103 disclosed above).

网络服务处理器100可以执行模式搜索、正则表达式处理、内容验证、变换和安全以加速数据包处理。正则表达式处理和模式搜索可以用于针对AV和IDS应用以及可能需要字符串匹配的其他应用执行字符串匹配。网络服务处理器100中的存储器控制器(未示出)可以控制对操作性地耦合至网络服务处理器100的存储器104的访问。存储器104可以是内部(即，片上)的或外部(即，芯片外)的、或以上的组合，并且可以被配置成用于存储所接收到的数据包，如用于网络服务处理器100进行处理的数据包101a。存储器104可以被配置成用于存储编译规则数据，该规则数据用于DFA和NFA图形表达式搜索中的查找和模式匹配。可以作为二值图像112存储编译规则数据，该二值图像可以包括用于DFA和NFA两者的编译规则数据、或将DFA编译规则数据与NFA编译规则数据分开的多张二值图像。The network services processor 100 can perform pattern searching, regular expression processing, content validation, transformation, and security to accelerate packet processing. Regular expression processing and pattern searching can be used to perform string matching for AV and IDS applications, as well as other applications that may require string matching. A memory controller (not shown) in the network services processor 100 can control access to a memory 104 operatively coupled to the network services processor 100. The memory 104 can be internal (i.e., on-chip) or external (i.e., off-chip), or a combination thereof, and can be configured to store received packets, such as packet 101a for processing by the network services processor 100. The memory 104 can be configured to store compiled rule data, which is used for lookups and pattern matching in DFA and NFA graph expression searches. The compiled rule data can be stored as a binary image 112, which can include compiled rule data for both DFA and NFA, or multiple binary images that separate DFA compiled rule data from NFA compiled rule data.

如以上所披露的，典型的内容感知应用处理可以使用或者DFA或者NFA来辨识所接收到的数据包的内容中的模式。DFA和NFA两者都是有限状态机，即，计算模型，计算模型中的每一个都包括状态集合、开始状态、输入字母表(所有可能的符号集合)和转换函数。计算在开始状态下开始并且取决于转换函数变化至新的状态。As disclosed above, typical content-aware application processing can use either DFA or NFA to identify patterns in the content of received data packets. Both DFA and NFA are finite state machines, i.e., computational models, each of which includes a set of states, a start state, an input alphabet (the set of all possible symbols), and a transition function. The computation begins in the start state and changes to a new state depending on the transition function.

模式通常使用正则表达式来表达，正则表达式包括基本元素，例如，像A-Z和0-9的正常文本字符、以及像*、^和|的元字符。正则表达式的基本元素是要被匹配的符号(单个字符)。基本元素可以与允许连结、交替(|)、以及克莱尼(Kleene)星号(*)。用于连结的元字符可以用于从单个字符(或子字符串)创建多个字符匹配模式，而用于交替(|)的元字符可以用于创建可以匹配两个或更多个子字符串中的任何一个的正则表达式。元字符Kleene星号(*)允许模式匹配任意次数，包括不会出现之前的字符或字符串。Patterns are typically expressed using regular expressions, which include basic elements such as normal text characters like A-Z and 0-9, and metacharacters like *, ^, and |. The basic element of a regular expression is the symbol (a single character) to be matched. The basic element can be combined with the symbols to allow concatenation, alternation (|), and the Kleene asterisk (*). The metacharacters for concatenation can be used to create multiple character matching patterns from a single character (or substring), while the metacharacters for alternation (|) can be used to create a regular expression that can match any one of two or more substrings. The metacharacter Kleene asterisk (*) allows the pattern to match any number of times, including characters or strings that do not appear before.

组合不同的运算符和单个字符允许构建复杂的表达式的子模式。例如，如(th(is|at)*)的子模式可以匹配多个字符串，如：th、this、that、thisis、thisat、thatis或thatat。表达式的复杂模式的另一个示例可以是结合了允许列出有待搜索的字符列表的字符类结构[…]的一个示例。例如，gr[ea]y寻找grey和gray两者。其他复杂的子模式示例为那些可以使用破折号来指示字符范围的示例，例如，[A-Z]、或可以与任一个字符匹配的“.”。模式的元素可以是基本元素或一个或多个基本元素结合一个或多个元字符的组合。Combining different operators with single characters allows building complex subpatterns of expressions. For example, a subpattern like (th(is|at)*) can match multiple strings like: th, this, that, thisis, thisat, thatis, or thatat. Another example of a complex pattern of expression could be one that incorporates a character class construct […] that allows listing a list of characters to be searched for. For example, gr[ea]y looks for both grey and gray. Other examples of complex subpatterns are those that can use dashes to indicate ranges of characters, for example, [A-Z], or "." which can match any character. The elements of a pattern can be basic elements or a combination of one or more basic elements combined with one or more metacharacters.

对DFA或NFA状态机的输入通常包括来自输入流(即，所接收到的数据包)的多个段，如(8位)字节的字符串，即，字母可以是单字节(一个字符或符号)。输入流中的每个段(例如，字节)可以产生从一种状态到另一种状态的转换。DFA或NFA状态机的状态和转换函数可以由节点图形来表示。图形中的每个节点可以表示一种状态，并且图形中的圆弧(此处也被称为转换或转换圆弧)可以表示状态转换。状态机的当前状态可以由选择图形中的具体字节的节点标识符来表示。The input to a DFA or NFA state machine typically includes multiple segments from an input stream (i.e., received data packets), such as strings of (8-bit) bytes, i.e., letters can be single bytes (one character or symbol). Each segment (e.g., byte) in the input stream can produce a transition from one state to another. The states and transition functions of a DFA or NFA state machine can be represented by a node graph. Each node in the graph can represent a state, and the arcs in the graph (also referred to herein as transitions or transition arcs) can represent state transitions. The current state of the state machine can be represented by a node identifier that selects a specific byte in the graph.

使用DFA来处理正则表达式并且发现字符的输入流中由正则表达式描述的一个或多个模式的特征可以特征在于具有确定运行时间性能。因为每DFA状态仅存在一次状态转换，可以从输入字符(或符号)、和DFA的当前状态确定DFA的下一个状态。这样，DFA的运行时间性能被认为是确定的并且行为可以从输入被完全预测。然而，确定性的折衷为一个图形，其中，节点的数量(或图形大小)可以随着模式的大小呈指数增长。Use DFA to process regular expression and find the feature of one or more patterns described by regular expression in the input stream of character and can be characterized in that and has definite running time performance.Because there is only a state transition in every DFA state, the next state of DFA can be determined from the current state of input character (or symbol) and DFA.Like this, the running time performance of DFA is considered to definite and behavior can be fully predicted from input.Yet the compromise of certainty is a graph, and wherein, the quantity (or graph size) of node can be exponentially increased along with the size of pattern.

相比之下，NFA图形的节点数量可以特征在于随着模式的大小而呈线性增长。然而，使用NFA来处理正则表达式并且发现字符的输入流中由正则表达式描述的一个或多个模式的特征可以在于具有非确定运行时间性能。例如，给定输入字符(或符号)和NFA的当前状态，可能存在要转换到其上的NFA的多于一个下一状态。如此，NFA的下一状态不能唯一地从NFA的输入和当前状态来确定。从而，当行为不能从输入被完全预测时，NFA的运行时间性能被认为是非确定的。In contrast, the number of nodes in an NFA graph can be characterized as growing linearly with the size of the pattern. However, using an NFA to process regular expressions and discover one or more patterns described by the regular expression in an input stream of characters can be characterized by having non-deterministic runtime performance. For example, given an input character (or symbol) and the current state of the NFA, there may be more than one next state of the NFA to transition to. Thus, the next state of the NFA cannot be uniquely determined from the input and the current state of the NFA. Thus, when the behavior cannot be fully predicted from the input, the runtime performance of the NFA is considered to be non-deterministic.

图2A至图2G示出了DFA“图爆”的概念。图2A、图2B、和图2C分别示出了模式“.*a[^\n]”、“.*a[^\n][^\n]”、“.*a[^\n][^\n][^\n]”的NFA图形，并且图2D、图2E和图2F分别示出了相同模式的DFA图形。如图2A至图2F中所示和图2G的表所总结的，针对某些模式，当相同模式的DFA可以呈指数方式增长时，NFA可以呈线性增长，从而引起图爆。如所示，针对一个或多个给定模式，DFA状态的数量可以大于NFA状态的数量，典型地大约几百以上或一千以上个状态的量级。这是“图爆”的示例，这是DFA的标志特点。Figures 2A to 2G illustrate the concept of a DFA "graph explosion." Figures 2A, 2B, and 2C respectively illustrate NFA graphs for the patterns ".*a[^\n]," ".*a[^\n][^\n]," and ".*a[^\n][^\n][^\n]," and Figures 2D, 2E, and 2F respectively illustrate DFA graphs for the same patterns. As shown in Figures 2A to 2F and summarized in the table of Figure 2G, for certain patterns, while the DFA for the same pattern may grow exponentially, the NFA may grow linearly, causing a graph explosion. As shown, for one or more given patterns, the number of DFA states may be greater than the number of NFA states, typically on the order of several hundred or more or a thousand or more states. This is an example of a "graph explosion," a hallmark characteristic of a DFA.

根据在此披露的实施例，可以使用DFA、NFA、或其组合来执行内容搜索。根据一个实施例，运行时间处理器、协处理器、或其组合可以在硬件中实现并且可以被配置成用于实现编译器和行走器。According to embodiments disclosed herein, content searches may be performed using DFA, NFA, or a combination thereof. According to one embodiment, a runtime processor, a coprocessor, or a combination thereof may be implemented in hardware and may be configured to implement a compiler and a walker.

编译器可以将模式或模式的输入列表(也被称为特征或规则)编译成DFA、NFA、或其组合。DFA和NFA可以是二进制数据结构，如DFA和NFA图形和表。The compiler can compile a pattern or an input list of patterns (also referred to as signatures or rules) into a DFA, NFA, or a combination thereof. The DFA and NFA can be binary data structures such as DFA and NFA graphs and tables.

行走器可以执行运行时间处理，即，用于标识输入流中的模式的存在、或将模式与输入流中的内容进行匹配的动作。内容可以是互联网协议(IP)数据报的有效载荷部分、或输入流中的任何其他合适的有效载荷。DFA或NFA图形的运行时间处理可以被称为根据有效载荷行走DFA或NFA图形以确定模式匹配。被配置成用于生成DFA、NFA、或其组合的处理器在此可以被称为编译器。被配置成用于使用所生成的DFA、NFA、或其组合来实施有效载荷的运行时间处理的处理器在此可以被称为行走器。根据在此披露的实施例，网络服务处理器100可以被配置成用于在安全装置102中实现编译器和行走器。The walker can perform runtime processing, that is, the action of identifying the existence of a pattern in an input stream or matching the pattern with the content in the input stream. The content can be the payload portion of an Internet Protocol (IP) datagram, or any other suitable payload in the input stream. The runtime processing of a DFA or NFA graph can be referred to as walking the DFA or NFA graph according to the payload to determine pattern matching. A processor configured to generate a DFA, NFA, or a combination thereof can be referred to as a compiler here. A processor configured to use the generated DFA, NFA, or a combination thereof to implement runtime processing of a payload can be referred to as a walker here. According to the embodiments disclosed herein, the network service processor 100 can be configured to implement a compiler and a walker in the security device 102.

图3A为安全装置102的另一个实施例的框图，在其中可以实施在此披露的实施例。如参照图1G所披露的，安全装置102可以操作性地耦合至一个或多个网络并且可以包括存储器104和可以包括加速单元106的网络服务处理器100。参照图3A，网络服务处理器100可以被配置成用于实现生成二值图像112的编译器306和使用二值图像112的行走器320。例如，编译器306可以生成二值图像112，该二值图像包括行走器320用来对所接收到的数据包101a(图1G中所示)执行模式匹配方法的编译规则数据。根据在此披露的实施例，基于如以下进一步描述的至少一种启发法，编译器306可以通过确定用于DFA、NFA的编译规则数据来生成二值图像112。编译器306可以确定有利地适用于DFA和NFA的规则数据。FIG3A is a block diagram of another embodiment of a security device 102 in which embodiments disclosed herein may be implemented. As disclosed with reference to FIG1G , the security device 102 may be operatively coupled to one or more networks and may include a memory 104 and a network services processor 100 that may include an acceleration unit 106. Referring to FIG3A , the network services processor 100 may be configured to implement a compiler 306 that generates a binary image 112 and a walker 320 that uses the binary image 112. For example, the compiler 306 may generate the binary image 112 including compiled rule data that the walker 320 uses to perform a pattern matching method on a received data packet 101a (shown in FIG1G ). According to embodiments disclosed herein, the compiler 306 may generate the binary image 112 by determining compiled rule data for a DFA or NFA based on at least one heuristic method as further described below. The compiler 306 may determine rule data that is advantageously applicable to both the DFA and the NFA.

根据在此披露的实施例，编译器306可以通过处理规则集310来生成二值图像112，该规则集可以包括一个或多个正则表达式模式的集合304和可选限定符308。从规则集310，编译器306可以使用从一个或多个正则表达式模式的所有中选择的子模式来生成统一DFA312和用于一个或多个正则表达式模式304集合中的至少一个模式的至少一个NFA 314以供行走器320在运行时间处理过程中使用、以及包括映射信息的元数据(未示出)，该映射信息用于将行走器320在统一DFA 312的状态(未示出)与至少一个NFA 314的状态之间进行转换。According to embodiments disclosed herein, compiler 306 may generate binary image 112 by processing rule set 310, which may include a set 304 of one or more regular expression patterns and optional qualifiers 308. From rule set 310, compiler 306 may use subpatterns selected from all of the one or more regular expression patterns to generate a unified DFA 312 and at least one NFA 314 for at least one pattern in the set of one or more regular expression patterns 304 for use by a walker 320 during runtime processing, as well as metadata (not shown) including mapping information for use by the walker 320 in transitioning between states (not shown) of unified DFA 312 and states of at least one NFA 314.

统一DFA 312和至少一个NFA 314可以用逐数据结构的方式表示为图形、或为任何其他合适的形式，并且元数据中的映射可以用逐数据结构的方式表示为一个或多个表、或为任何其他合适的形式。根据在此披露的实施例，如果从模式中选择的子模式为该模式，则不为该模式生成NFA。根据在此披露的实施例，所生成的每个NFA可以用于集合中的具体模式，而可以基于来自集合中所有模式的所有子模式来生成统一DFA。The unified DFA 312 and the at least one NFA 314 can be represented as a graph or any other suitable form on a data structure-by-data structure basis, and the mapping in the metadata can be represented as one or more tables or any other suitable form on a data structure-by-data structure basis. According to the embodiments disclosed herein, if a subpattern selected from a pattern is that pattern, no NFA is generated for that pattern. According to the embodiments disclosed herein, each generated NFA can be used for a specific pattern in a set, and a unified DFA can be generated based on all subpatterns from all patterns in the set.

基于对段进行消耗(即，处理)，如来自所接收到的数据包101a中的有效载荷的字节，行走器320通过转换统一DFA 312和至少一个NFA 314的状态根据有效载荷来行走统一DFA 312和至少一个NFA。如此，行走器320使该有效载荷走过统一DFA 312和可以是针对单个正则表达式模式生成的每模式NFA的至少一个NFA 314。Based on consuming (i.e., processing) segments, such as bytes from the payload in the received data packet 101a, the walker 320 walks the unified DFA 312 and the at least one NFA 314 according to the payload by transforming the states of the unified DFA 312 and the at least one NFA 314. In this way, the walker 320 walks the payload through the unified DFA 312 and the at least one NFA 314, which may be a per-pattern NFA generated for a single regular expression pattern.

规则集310可以包括一个或多个正则表达式模式的集合304并且可以是Perl兼容正则表达式(PCRE)的形式或任何其他合适的形式。PCRE已经成为安全和联网应用中正则表达式语法的事实标准。随着更多应用需要深度数据包检查已经兴起或更多威胁在互联网中变得普遍，用于标识病毒/攻击的相应特征/模式或应用也已经变得更加复杂。例如，特征数据库已经从具有简单字符串模式演进到具有通配字符/范围/字符类和高级PCRE特征的正则表达式(regex)模式。The rule set 310 may include a set 304 of one or more regular expression patterns and may be in the form of Perl Compatible Regular Expressions (PCRE) or any other suitable form. PCRE has become the de facto standard for regular expression syntax in security and networking applications. As more applications requiring deep packet inspection have emerged or more threats have become prevalent on the Internet, the corresponding features/patterns or applications used to identify viruses/attacks have become more complex. For example, feature databases have evolved from having simple string patterns to having regular expression (regex) patterns with wildcard characters/ranges/character classes and advanced PCRE features.

如图3A中所示，可选限定符308每个可以与正则表达式模式集合304中的一个模式相关联。例如，可选限定符322可以与模式316相关联。可选限定符308每个可以是指定所希望的自定义、高级PCRE特征选项、或适用于处理与限定符相关联的模式的其他合适选项的一个或多个限定符。例如，限定符322可以指示是否希望用于模式316的高级PCRE特征选项中的起始偏移(即，模式的第一匹配字符的有效载荷中的在有效载荷中进行匹配的位置)选项。As shown in FIG3A , optional qualifiers 308 can each be associated with a pattern in regular expression pattern set 304. For example, optional qualifier 322 can be associated with pattern 316. Optional qualifiers 308 can each be one or more qualifiers that specify desired customization, advanced PCRE feature options, or other suitable options for processing the pattern associated with the qualifier. For example, qualifier 322 can indicate whether the starting offset (i.e., the position in the payload of the first matching character of the pattern to be matched) option in the advanced PCRE feature options for pattern 316 is desired.

根据在此披露的实施例，编译器306可以使用从一个或多个正则表达式模式的集合304中的所有模式中选择的子模式302生成统一DFA 312。编译器306可以基于如以下进一步描述的至少一种启发法来从一个或多个正则表达式模式的集合304中的每个模式中选择子模式302。编译器306还可以为该集合中的至少一个模式316生成至少一个NFA 314，可以基于所选择的子模式318的长度是否是固定的或可变的和所选择的子模式318在至少一个模式316内的位置来确定至少一个模式316的用于生成至少一个NFA 314的部分(未示出)和至少一个NFA314的运行时间处理(即，行走)的至少一个行走方向。编译器306可以将统一DFA 312和至少一个NFA 314存储在至少一个存储器104内。According to embodiments disclosed herein, compiler 306 can generate a unified DFA 312 using subpatterns 302 selected from all patterns in a set 304 of one or more regular expression patterns. Compiler 306 can select subpatterns 302 from each pattern in the set 304 of one or more regular expression patterns based on at least one heuristic, as described further below. Compiler 306 can also generate at least one NFA 314 for at least one pattern 316 in the set. Compiler 306 can determine a portion (not shown) of the at least one pattern 316 used to generate the at least one NFA 314 and at least one direction of runtime processing (i.e., walking) of the at least one NFA 314 based on whether the length of the selected subpattern 318 is fixed or variable and the position of the selected subpattern 318 within the at least one pattern 316. Compiler 306 can store unified DFA 312 and at least one NFA 314 in at least one memory 104.

编译器可以确定所选择的潜在子模式的长度是否是固定的或可变的。例如，如“cdef”子模式的长度可以被确定为具有一个固定长度4，因为“cdef”为一个字符串，而包括运算符的复杂子模式可以被确定为具有一个可变长度。例如，如“a.*cd[^\n]{0,10}.*y”复杂子模式根据所选择的模式可以具有“cd[^\n]{0,10}”，其可以具有一个2到12的可变长度。The compiler can determine whether the length of a selected potential subpattern is fixed or variable. For example, a subpattern such as "cdef" can be determined to have a fixed length of 4 because "cdef" is a string, while a complex subpattern including operators can be determined to have a variable length. For example, a complex subpattern such as "a.*cd[^\n]{0,10}.*y" can have a length of "cd[^\n]{0,10}", which can have a variable length of 2 to 12, depending on the selected pattern.

根据在此披露的实施例，子模式选择可以基于至少一种启发法。子模式为来自模式的一个或多个连续元素的集合，其中，为了与来自有效载荷的字节或字符匹配，来自模式的每个元素可以用DFA或NFA图形中的节点来表示。如上所述的元素可以是由节点表示的单个文本字符或由节点表示的字符类。编译器306可以基于子模式是否很可能引起如以上参照图2A至图2G所描述的过多DFA图爆来确定模式中的哪些子模式更适用于NFA。例如，从包括连续文本字符的子模式生成DFA将不会引起DFA图爆，而如上所述的复杂子模式可以包括多个运算符和多个字符，并且从而可能引起DFA图爆。例如，包括通配符字符或重复多次的较大字符类(例如，[^\n]*或[^\n]{16})的子模式可以在DFA中产生过多状态，并且因此可以更有利地适用于NFA。如此，编译器306在此可以被称为“智能编译器”。According to the embodiments disclosed herein, sub-pattern selection can be based on at least one heuristic. A sub-pattern is a set of one or more consecutive elements from a pattern, wherein, in order to match bytes or characters from a payload, each element from the pattern can be represented by a node in a DFA or NFA graph. The elements described above can be single text characters represented by a node or character classes represented by a node. The compiler 306 can determine which sub-patterns in a pattern are more suitable for an NFA based on whether the sub-pattern is likely to cause excessive DFA graph explosions as described above with reference to Figures 2A to 2G. For example, generating a DFA from a sub-pattern including consecutive text characters will not cause a DFA graph explosion, while a complex sub-pattern as described above can include multiple operators and multiple characters and thus may cause a DFA graph explosion. For example, a sub-pattern including wildcard characters or a large character class repeated multiple times (e.g., [^\n]* or [^\n]{16}) can generate too many states in a DFA and therefore can be more advantageously applied to an NFA. In this way, the compiler 306 can be referred to as a "smart compiler" herein.

如以上所披露的，从一个或多个正则表达式集合304中的每个模式中选择子模式可以基于至少一种启发法。根据一个实施例，该至少一种启发法可以包括使所选择的唯一子模式的数量和所选择的每个子模式的长度最大化。例如，如“ab.*cdef.*mn”模式可以具有多个潜在子模式，如“ab.*”、“cdef”、和“.*mn”。编译器可以选择“cdef”作为该模式的子模式，因为其是模式“ab.*cdef.*mn”中不太可能引起DFA图爆的最大子模式。然而，如果已经为另一个模式选择了子模式“cdef”，则编译器可以为模式“ab.*cdef.*mn”选择替代子模式。可替代地，编译器可以用其他模式的另一个子模式替换子模式“cdef”，从而能够为模式“ab.*cdef.*mn”选择模式“cdef”。As disclosed above, the selection of a subpattern from each pattern in one or more regular expression sets 304 can be based on at least one heuristic. According to one embodiment, the at least one heuristic can include maximizing the number of unique subpatterns selected and the length of each subpattern selected. For example, a pattern such as "ab.*cdef.*mn" can have multiple potential subpatterns, such as "ab.*", "cdef", and ".*mn". The compiler can select "cdef" as the subpattern of the pattern because it is the largest subpattern in the pattern "ab.*cdef.*mn" that is unlikely to cause a DFA graph explosion. However, if the subpattern "cdef" has already been selected for another pattern, the compiler can select an alternative subpattern for the pattern "ab.*cdef.*mn". Alternatively, the compiler can replace the subpattern "cdef" with another subpattern of other patterns, thereby being able to select the pattern "cdef" for the pattern "ab.*cdef.*mn".

如此，编译器306可以基于模式304中每个模式的可能子模式的上下文为模式304选择子模式，从而能够使所选择的唯一子模式的数量和所选择的每个子模式的长度最大化。如此，编译器306可以从所选择的子模式302生成统一DFA 312，该统一DFA通过增加至少一个NFA314中的模式匹配的概率来使至少一个NFA 314的模式匹配中的误报(即，没有匹配或部分匹配)数量最小化。In this manner, compiler 306 can select subpatterns for pattern 304 based on the context of the possible subpatterns of each pattern in pattern 304, thereby maximizing the number of unique subpatterns selected and the length of each subpattern selected. In this manner, compiler 306 can generate unified DFA 312 from selected subpatterns 302 that minimizes the number of false positives (i.e., no match or partial match) in pattern matching in at least one NFA 314 by increasing the probability of a pattern match in at least one NFA 314.

通过使子模式长度最大化，可以避免NFA处理中的误报。NFA处理中的误报会引起非确定运行时间处理，并且因此会降低运行时间性能。进一步地，通过使所选择的唯一子模式的数量最大化，鉴于统一DFA中(来自模式)子模式的匹配，编译器306能够实现统一DFA到从集合中的模式生成的至少一个NFA 314之间的1:1转换。By maximizing the subpattern length, false positives in NFA processing can be avoided. False positives in NFA processing can cause non-deterministic runtime processing and, therefore, degrade runtime performance. Further, by maximizing the number of unique subpatterns selected, compiler 306 can achieve a 1:1 conversion between the unified DFA and at least one NFA 314 generated from the pattern in the set, given a match of the subpattern (from the pattern) in the unified DFA.

例如，如果多个模式共享所选择的子模式，则统一DFA的行走器将需要转换至多个至少一个NFA，因为每个至少一个NFA是每模式NFA，并且来自统一DFA的子模式匹配意味着针对该多个模式中每个模式的部分匹配。如此，使唯一子模式的数量最大化减少了DFA:NFA1:N转换的数量，从而减少了行走器320进行的运行时间处理。For example, if multiple patterns share the selected subpattern, the walker of the unified DFA will need to transition to multiple at least one NFA, because each at least one NFA is a per-pattern NFA, and a subpattern match from the unified DFA means a partial match for each of the multiple patterns. Thus, maximizing the number of unique subpatterns reduces the number of DFA:NFA1:N transitions, thereby reducing the runtime processing performed by the walker 320.

为了能够使唯一子模式的数量最大化，编译器302可以计算所选择的子模式318的散列值326并将所计算的散列值326与从其中选择子模式318的模式316的标识符(未示出)相关联地一起存储。例如，针对集合304中的每个模式，编译器306可以计算所选择的子模式的散列值。所计算的散列值324可以按照表的方式、或以任何合适的方式存储在至少一个存储器104中。所使用的散列法可以是任何合适的散列法。编译器可以将所计算的散列值与为该集合中的其他模式选择的子模式的散列值列表进行比较，以便确定所选择的子模式是否是唯一的。In order to maximize the number of unique subpatterns, compiler 302 may calculate a hash value 326 for the selected subpattern 318 and store the calculated hash value 326 in association with an identifier (not shown) of pattern 316 from which subpattern 318 was selected. For example, compiler 306 may calculate a hash value for the selected subpattern for each pattern in set 304. The calculated hash values 324 may be stored in at least one memory 104 in a table or in any other suitable manner. The hashing method used may be any suitable hashing method. The compiler may compare the calculated hash value with a list of hash values for subpatterns selected for other patterns in the set to determine whether the selected subpattern is unique.

如果在该列表发现所计算的散列值，则编译器可以确定是否用来自模式的另一个子模式来替换(i)所选择的子模式或用从该集合中的其他模式中选择的替代子模式替换(ii)为该集合中的另一个模式选择的子模式。可以基于与该列表中所计算的散列值的关联性来标识该集合中的其他模式。确定是否替换(i)或(ii)可以基于对考虑替换的子模式的长度进行比较以便如上所述最大化所选择的唯一子模式的长度。替换所选择的子模式可以包括选择为给定模式标识的下一个最长子模式、或下一个优先次序第二高的子模式。例如，可以基于引起DFA爆炸或所预期的DFA爆炸量级的可能性来确定潜在子模式的优先次序。If the calculated hash value is found in the list, the compiler can determine whether to replace (i) the selected subpattern with another subpattern from the pattern or (ii) the subpattern selected for another pattern in the set with an alternative subpattern selected from other patterns in the set. Other patterns in the set can be identified based on their association with the calculated hash value in the list. Determining whether to replace (i) or (ii) can be based on comparing the lengths of the subpatterns being considered for replacement so as to maximize the length of the selected unique subpattern as described above. Replacing the selected subpattern can include selecting the next longest subpattern identified for a given pattern, or the next subpattern with the second highest priority. For example, the priority of potential subpatterns can be determined based on the likelihood of causing a DFA explosion or an expected DFA explosion magnitude.

根据在此披露的实施例，至少一种启发法可以包括标识每个模式的子模式和如果给定子模式具有一个小于最小阈值的长度则忽视每个模式的所标识的子模式中的给定子模式。例如，为了减少至少一个NFA中的误报，编译器可以忽视具有小于最小阈值的长度的子模式，因为此类子模式会引起至少一个NFA中的误报的更高概率。According to embodiments disclosed herein, at least one heuristic may include identifying subpatterns of each pattern and ignoring a given subpattern of the identified subpatterns of each pattern if the given subpattern has a length less than a minimum threshold. For example, to reduce false positives in at least one NFA, the compiler may ignore subpatterns having a length less than the minimum threshold because such subpatterns may cause a higher probability of false positives in the at least one NFA.

该至少一种启发法可以包括访问子模式的与使用指示符的历史频率相关联的知识库(未示出)和如果所访问的知识库中针对给定子模式的使用指示符的历史频率大于等于频率使用阈值则忽视每个模式的所标识的子模式中的给定模式。例如，特定用途或协议特定子模式可能具有高使用频率，如针对超文本传输协议(HTTP)有效载荷、“回车换行”、或流量清零(如来自二进制文件的多个连续0)、或任何其他频繁使用的子模式。The at least one heuristic may include accessing a knowledge base (not shown) of subpatterns associated with historical frequencies of usage indicators and disregarding a given subpattern from each identified subpattern if the historical frequency of usage indicators for the given subpattern in the accessed knowledge base is greater than or equal to a frequency usage threshold. For example, a particular usage or protocol specific subpattern may have a high frequency of usage, such as for Hypertext Transfer Protocol (HTTP) payloads, "carriage return line feed," or flow zeros (e.g., multiple consecutive zeros from a binary file), or any other frequently used subpattern.

该至少一种启发法可以包括标识每个模式的和用于每个模式的子模式、使通过基于具有所标识的子模式的最大数量的连续文本字符的给定子模式和基于为一个或多个正则表达式集合选择的所有子模式之间为唯一的给定子模式选择所标识的子模式中的给定子模式而选择的子模式中的连续文本字符的数量最大化。如以上所披露的，使所选择的子模式的长度最大化可以能够实现至少一个NFA中匹配的更高概率。The at least one heuristic can include identifying subpatterns for each pattern and for each pattern, maximizing the number of consecutive text characters in a subpattern selected by selecting a given subpattern from the identified subpatterns based on the given subpattern having the greatest number of consecutive text characters and based on the given subpattern being unique among all subpatterns selected for the one or more sets of regular expressions. As disclosed above, maximizing the length of the selected subpattern can enable a higher probability of a match in the at least one NFA.

该至少一种启发法可以包括基于给定子模式中的每个子模式的子模式类型和给定子模式的长度来确定每个模式的给定子模式的优先次序。子模式类型可以是纯文本、交替、单字符重复、或多字符重复，并且子模式类型的从最高到最低的优先次序可以是纯文本、交替、单字符重复、以及多字符重复。如此，可以将为具有至少最小长度阈值长度的文本字符串的子模式的优先次序确定为比可变长度的复杂子模式更高。The at least one heuristic method can include prioritizing a given subpattern of each pattern based on the subpattern type of each subpattern in the given subpattern and the length of the given subpattern. The subpattern type can be plain text, alternating, single character repetition, or multiple character repetition, and the subpattern type can be prioritized from highest to lowest to be plain text, alternating, single character repetition, and multiple character repetition. In this way, a subpattern having a text string of at least a minimum length threshold length can be prioritized higher than a complex subpattern of variable length.

编译器306可以使较长长度的子模式的优先次序比较短长度的另一个子模式高。编译器306可以基于确定优先次序来选择唯一子模式作为所选择的子模式。如上所述，所选择的唯一子模式可以具有至少最小长度阈值的长度。The compiler 306 can prioritize a sub-pattern of a longer length over another sub-pattern of a shorter length. The compiler 306 can select a unique sub-pattern as the selected sub-pattern based on the prioritization. As described above, the selected unique sub-pattern can have a length of at least a minimum length threshold.

如果给定子模式中没有是唯一的并且具有一个至少最小长度阈值的长度，则编译器306可以基于确定优先次序来选择非唯一子模式作为所选择的子模式。如此，编译器306可以从模式中选择一个是从另一个模式中选择的子模式的副本的子模式而不是选择具有一个小于最小阈值的长度的子模式。为了方便子模式的最终确定，编译器306可以多次忽略模式并且按长度对可能的子模式进行分类。如此，可以在针对一个或多个正则表达式的集合304中的其他模式的子模式选择的上下文中执行针对一个或多个正则表达式的集合304中的给定模式的编译器子模式选择。If none of the given subpatterns is unique and has a length that is at least the minimum length threshold, the compiler 306 can select the non-unique subpattern as the selected subpattern based on the prioritization. Thus, the compiler 306 can select a subpattern from a pattern that is a duplicate of a subpattern selected from another pattern rather than selecting a subpattern having a length less than the minimum threshold. To facilitate the final determination of the subpattern, the compiler 306 can ignore the pattern multiple times and sort the possible subpatterns by length. Thus, the compiler subpattern selection for a given pattern in the set of one or more regular expressions 304 can be performed in the context of subpattern selection for other patterns in the set of one or more regular expressions 304.

如上所述，限定符322可以指示希望报告起始偏移。然而，起始偏移可能不是容易可辨别的。例如，鉴于如“axycamb”有效载荷，因为两个模式可以匹配，“axycamb”和“amb”，在如“a.*b”或“a.*d”有效载荷匹配模式中发现起始偏移会是困难的。如此，可能需要按照潜在起始偏移跟踪有效载荷中的“a”的两个实例的偏移。根据在此披露的实施例，不需要跟踪潜在起始偏移，因为直到确定已经在有效载荷中发现整个模式的匹配才确定起始偏移。确定可以利用匹配来发现整个模式的匹配由统一DFA、至少一个NFA、或其组合而产生。As described above, qualifier 322 can indicate that it is desired to report a starting offset. However, the starting offset may not be easily discernible. For example, given a payload such as "axycamb", it may be difficult to find the starting offset in a payload matching pattern such as "a.*b" or "a.*d" because two patterns can match, "axycamb" and "amb". Thus, it may be necessary to track the offsets of the two instances of "a" in the payload according to the potential starting offsets. According to the embodiments disclosed herein, there is no need to track the potential starting offsets because the starting offset is not determined until it is determined that a match for the entire pattern has been found in the payload. Determining that a match for the entire pattern can be found using matching is generated by a unified DFA, at least one NFA, or a combination thereof.

根据在此披露的实施例，如果所接收到的数据包101中的有效载荷包括与从模式316中选择的子模式318匹配的内容，则行走器可以转换以行走模式318的至少一个NFA。行走器320可以报告所选择的子模式318的匹配和对匹配子模式的最后一个字符的所接收到的数据包中的位置进行标识的偏移，作为有效载荷中子模式的结束偏移。如果子模式是模式的子集，则子模式匹配可以是模式的部分匹配。如此，行走器320可以通过为模式行走至少一个NFA来继续搜索有效载荷中的模式的剩余部分，以便确定模式的最终匹配。应理解到，模式可以遍历所接收到的数据包101a中的一个或多个有效载荷。According to embodiments disclosed herein, if the payload in the received data packet 101 includes content that matches a subpattern 318 selected from the pattern 316, the walker may transition to walking at least one NFA of the pattern 318. The walker 320 may report a match of the selected subpattern 318 and an offset identifying the position in the received data packet of the last character that matched the subpattern as the end offset of the subpattern in the payload. If the subpattern is a subset of the pattern, the subpattern match may be a partial match of the pattern. Thus, the walker 320 may continue searching for the remainder of the pattern in the payload by walking at least one NFA for the pattern to determine a final match for the pattern. It should be understood that the pattern may traverse one or more payloads in the received data packet 101a.

图3B为可以在至少一个处理器中实施的方法的示例实施例的流程图(350)，该处理器操作性地耦合至安全装置内的至少一个存储器，该安全装置操作性地耦合至网络。该方法可以开始(352)并基于至少一种启发法来从一个或多个正则表达式模式的集合中的每个模式中选择子模式(354)。该方法可以使用从该集合中的所有模式中选择的子模式来生成统一确定有限自动机(DFA)(356)。该方法可以为该集合中的至少一个模式生成至少一个非确定有限自动机(NFA)，基于所选择的子模式的长度是否是固定的或可变的和所选择的子模式在至少一个模式内的位置来确定至少一个模式的用于生成至少一个NFA的部分和至少一个NFA的运行时间处理的至少一个行走方向(358)。该方法可以将统一DFA和所生成的至少一个NFA存储在至少一个存储器内(360)。之后，在该示例实施例中，该方法结束(362)。FIG3B is a flowchart (350) of an example embodiment of a method that can be implemented in at least one processor operatively coupled to at least one memory within a security device operatively coupled to a network. The method can begin (352) and select a subpattern from each pattern in a set of one or more regular expression patterns based on at least one heuristic (354). The method can generate a unified deterministic finite automaton (DFA) (356) using the subpatterns selected from all patterns in the set. The method can generate at least one nondeterministic finite automaton (NFA) for at least one pattern in the set, and determine a portion of the at least one pattern for generating at least one NFA and at least one direction of run-time processing of the at least one NFA based on whether the length of the selected subpattern is fixed or variable and the position of the selected subpattern within the at least one pattern (358). The method can store the unified DFA and the generated at least one NFA in at least one memory (360). Thereafter, in this example embodiment, the method ends (362).

如以上所披露的，编译器306可以生成统一DFA 312和至少一个NFA 314以使行走器320能够搜索所接收到的数据包101a中的一个或多个正则表达式模式304的匹配。编译器306可以基于至少一种启发法来从一个或多个正则表达式模式的集合304中的每个模式中选择子模式。可以使用从集合304中的所有模式中选择的子模式302来生成统一DFA 312。编译器306可以为集合304中的至少一个模式316生成至少一个NFA 314。如此，编译器306可以被配置成用于将规则集310编译成对来自规则集310的可能最适用于DFA或NFA处理的部分进行标识的二值图像112。因此，二值图像112可以包括至少两个部分，其中一个用于DFA处理的第一部分和一个用于NFA处理的第二部分，如统一DFA 312和至少一个NFA 314。As disclosed above, compiler 306 can generate unified DFA 312 and at least one NFA 314 to enable walker 320 to search for matches to one or more regular expression patterns 304 in received data packet 101a. Compiler 306 can select subpatterns from each pattern in set 304 of one or more regular expression patterns based on at least one heuristic. Unified DFA 312 can be generated using subpatterns 302 selected from all patterns in set 304. Compiler 306 can generate at least one NFA 314 for at least one pattern 316 in set 304. Thus, compiler 306 can be configured to compile rule set 310 into binary image 112 that identifies portions of rule set 310 that may be most suitable for DFA or NFA processing. Thus, binary image 112 can include at least two portions, a first portion for DFA processing and a second portion for NFA processing, such as unified DFA 312 and at least one NFA 314.

如以上所披露的，二值图像112可以包括用于DFA和NFA两者的编译规则数据，或可以是将DFA编译规则数据与NFA编译规则数据分开的多张二值图像。例如，NFA编译规则可以与DFA编译规则分开并且存储在操作性地耦合至至少一个HNA处理器108的图形存储器内。存储器104可以是一个可以是多个存储器的图形存储器，如以上参照图1D和图1E所披露的超级集群图形存储器156a、HNA片上图形存储器156b、以及HNA芯片外图形存储器156c。As disclosed above, binary image 112 may include compiled rule data for both DFA and NFA, or may be multiple binary images that separate DFA compiled rule data from NFA compiled rule data. For example, NFA compiled rules may be separated from DFA compiled rules and stored in a graphics memory operatively coupled to at least one HNA processor 108. Memory 104 may be a single graphics memory or may be multiple memories, such as the super cluster graphics memory 156a, HNA on-chip graphics memory 156b, and HNA off-chip graphics memory 156c disclosed above with reference to FIG1D and FIG1E.

如以上所披露的，HNA处理器108和HFA处理器110可以是被配置成用于从执行计算和内存密集型模式匹配方法的沉重负担分流网络服务处理器100通用内核(如以上披露的至少一个CPU内核103)的协处理器。如此，HFA处理器110可以被配置成用于实施行走器320的与DFA处理相关的功能性，并且至少一个HNA处理器108可以被配置成用于实施行走器320的与NFA处理相关的功能性。如以上所披露的，至少一个HNA处理器108可以包括多个超级集群。每个超级集群可以包括多个集群。该多个集群中的每个集群可以包括多个HNA处理单元(HPU)。As disclosed above, the HNA processor 108 and the HFA processor 110 can be coprocessors configured to offload the heavy burden of executing computationally and memory-intensive pattern matching methods from the general-purpose cores of the network services processor 100 (such as the at least one CPU core 103 disclosed above). Thus, the HFA processor 110 can be configured to implement the functionality of the walker 320 related to DFA processing, and the at least one HNA processor 108 can be configured to implement the functionality of the walker 320 related to NFA processing. As disclosed above, the at least one HNA processor 108 can include multiple superclusters. Each supercluster can include multiple clusters. Each of the multiple clusters can include multiple HNA processing units (HPUs).

图4A为HNA处理单元(HPU)425的示例实施例的框图。根据在此披露的实施例，可以从HNA片上指令队列154给HPU 425分配至少一个HNA指令153。至少一个HNA指令153可以包括至少一项HNA工作(未示出)，该HNA工作可以基于图1G的HFA处理器110针对图3A的子模式302中的在输入流中匹配的给定子模式所标识的部分匹配结果来确定。FIG4A is a block diagram of an example embodiment of an HNA processing unit (HPU) 425. According to embodiments disclosed herein, at least one HNA instruction 153 may be assigned to the HPU 425 from the HNA on-chip instruction queue 154. The at least one HNA instruction 153 may include at least one HNA task (not shown), which may be determined based on a partial match result identified by the HFA processor 110 of FIG1G for a given sub-pattern in the sub-pattern 302 of FIG3A that is matched in the input stream.

根据该示例实施例，HPU 425可以包括HNA处理内核408。HNA处理内核408可以操作性地耦合至以下参照图7B、图12、以及图13A至图13D进一步披露的节点高速缓存451。HNA处理内核408可以操作性地耦合至字符类高速缓存454、有效载荷缓冲区462、栈顶寄存器470、和运行堆栈460、以及可以被配置成为统一存储器的匹配结果缓冲区466和保存缓冲区464。HNA处理内核408可以被配置成用于行走至少一个每模式NFA，其中，有效载荷段存储在有效载荷缓冲区462内，以确定至少一个正则表达式模式的匹配。如此，多个超级集群121a和121b的多个集群123a-d的多个HPU 125a-f中的每个HPU可以进一步包括HNA处理内核408，该处理内核操作性地耦合至节点高速缓存451、字符类高速缓存454、有效载荷缓冲区462、栈顶寄存器470、和运行堆栈460、以及可以被配置成为统一存储器的匹配结果缓冲区466和保存缓冲区464。运行堆栈460、保存缓冲区464和结果写入缓冲区466可以包括ECC保护(单个错误校正/双错误检测)。According to this example embodiment, the HPU 425 may include an HNA processing core 408. The HNA processing core 408 may be operatively coupled to a node cache 451, which is further disclosed below with reference to FIG. 7B , FIG. 12 , and FIG. 13A to FIG. 13D . The HNA processing core 408 may be operatively coupled to a character class cache 454, a payload buffer 462, a top-of-stack register 470, and a run stack 460, as well as a match result buffer 466 and a save buffer 464, which may be configured as unified memory. The HNA processing core 408 may be configured to walk at least one per-pattern NFA, wherein payload segments are stored in the payload buffer 462, to determine matches for at least one regular expression pattern. Thus, each HPU in the plurality of HPUs 125a-f of the plurality of clusters 123a-d of the plurality of superclusters 121a and 121b may further include an HNA processing core 408 operatively coupled to a node cache 451, a character class cache 454, a payload buffer 462, a top-of-stack register 470, and a run stack 460, as well as a match result buffer 466 and a save buffer 464 that may be configured as a unified memory. The run stack 460, the save buffer 464, and the result write buffer 466 may include ECC protection (single error correction/double error detection).

该多个超级集群121a和121b的该多个集群123a-d的该多个HPU 125a-f中的每个HPU可以包括节点高速缓存451，该节点高速缓存可以被配置成用于高速缓存来自如以下参照图7B所披露的超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c的一个或多个节点。该多个超级集群121a和121b的该多个集群123a-d的该多个HPU 125a-f中的每个HPU可以进一步包括字符类高速缓存454，该字符类高速缓存可以被配置成用于高速缓存来自超级集群字符类存储器135的一个或多个正则表达式模式字符类定义。该多个超级集群121a和121b的该多个集群123a-d的该多个HPU 125a-f中的每个HPU可以进一步包括有效载荷缓冲区462，该有效载荷缓冲区可以被配置成用于存储来自HNA数据包数据存储器165的给定有效载荷。来自HNA片上指令队列154的至少一个HNA指令153可以包括一个用于HNA数据包数据存储器165中的给定有效载荷中的位置的标识符。该多个超级集群121a和121b的该多个集群123a-d的该多个HPU 125a-f中的每个HPU可以进一步包括栈顶寄存器470，该栈顶寄存器可以被配置成用于存储单项HNA工作。运行堆栈460可以被配置成用于存储多项HNA工作，并且该统一存储器可以被配置成用于存储保存堆栈464的第一内容和匹配结果缓冲区466的第二内容。该第一内容可以包括运行堆栈460中所存储的一项或多项HNA工作，并且该第二内容可以包括一个或多个最终匹配结果。HNA工作在此还可以可互换地被称为上下文或未探究的上下文。Each of the plurality of HPUs 125a-f of the plurality of clusters 123a-d of the plurality of superclusters 121a and 121b may include a node cache 451 that may be configured to cache one or more nodes from the supercluster graph memory 156a, the HNA on-chip graph memory 156b, or the HNA off-chip graph memory 156c as disclosed below with reference to FIG7B . Each of the plurality of HPUs 125a-f of the plurality of clusters 123a-d of the plurality of superclusters 121a and 121b may further include a character class cache 454 that may be configured to cache one or more regular expression pattern character class definitions from the supercluster character class memory 135. Each of the plurality of HPUs 125a-f in the plurality of clusters 123a-d in the plurality of superclusters 121a and 121b may further include a payload buffer 462 configured to store a given payload from the HNA packet data memory 165. At least one HNA instruction 153 from the HNA on-chip instruction queue 154 may include an identifier for a location within the given payload in the HNA packet data memory 165. Each of the plurality of HPUs 125a-f in the plurality of clusters 123a-d in the plurality of superclusters 121a and 121b may further include a top-of-stack register 470 configured to store a single HNA job. The run stack 460 may be configured to store multiple HNA jobs, and the unified memory may be configured to store a first content of a save stack 464 and a second content of a match result buffer 466. The first content may include one or more HNA jobs stored in the run stack 460, and the second content may include one or more final match results. HNA work may also be referred to herein interchangeably as context or unexplored context.

该至少一项HNA工作中的给定HNA工作可以指示至少一个NFA 314中的给定NFA、该给定NFA的至少一个给定节点、给定有效载荷中的至少一个给定偏移、以及至少一个行走方向，每个至少一个行走方向与该至少一个给定节点中的一个节点相对应。每项至少一项HNA工作可以包括由HFA处理器110进行的处理的结果，从而使至少一个HNA处理器108能够将用于至少一个模式304中的给定模式的给定NFA中的与给定子模式的匹配提前。如此，每项HNA工作表示HFA协处理器110所确定的部分匹配结果，以便通过至少一个HNA处理器108经由所分配的HPU 425将给定模式的匹配提前。所分配的HPU可以包括HNA处理内核408。A given HNA job in the at least one HNA job may indicate a given NFA in the at least one NFA 314, at least one given node in the given NFA, at least one given offset in a given payload, and at least one walking direction, each at least one walking direction corresponding to one of the at least one given nodes. Each of the at least one HNA jobs may include the results of processing performed by the HFA processor 110, thereby enabling the at least one HNA processor 108 to advance a match for a given subpattern in a given NFA for a given pattern in the at least one pattern 304. Thus, each HNA job represents a partial match result determined by the HFA coprocessor 110, thereby advancing a match for a given pattern by the at least one HNA processor 108 via the assigned HPU 425. The assigned HPU may include an HNA processing core 408.

HNA处理内核408可以对至少一个HNA指令153进行以下处理：读取至少一个指针(未示出)、或其中存储的其他合适的指令信息。该至少一个指针可以包括一个指向至少一个系统存储器151的输入堆栈分区161内的输入缓冲区458的输入缓冲区指针(未示出)。至少一个HNA指令153还可以包括一个指向至少一个系统存储器151的HNA数据包数据存储器165中所存储的有效载荷(未示出)的有效载荷指针(未示出)，并且可以将该有效载荷提取到HPU 425的有效载荷缓冲区462。至少一个HNA指令153可以进一步包括一个指向HNA芯片外结果缓冲区分区171中的给定结果缓冲区的结果缓冲区指针(未示出)以使HPU425的HNA处理内核408能够传输HPU 425的匹配结果缓冲区466中所存储的至少一个匹配结果条目。至少一个HNA指令153可以进一步包括一个指向至少一个系统存储器151的HNA芯片外保存缓冲区分区171中的给定保存缓冲区的保存缓冲区指针(未示出)以使HNA处理内核408能够从HPU 425的保存缓冲区464传输至少一个保存缓冲区条目。至少一个HNA指令153可以进一步包括一个指向至少一个系统存储器151的HNA芯片外运行堆栈分区167的给定运行堆栈的运行堆栈指针(未示出)以使HNA处理内核408能够从HPU 425的运行堆栈460传输至少一个运行堆栈条目或将其传输至该运行堆栈。The HNA processing core 408 may process the at least one HNA instruction 153 by reading at least one pointer (not shown) or other suitable instruction information stored therein. The at least one pointer may include an input buffer pointer (not shown) pointing to an input buffer 458 within the input stack partition 161 of the at least one system memory 151. The at least one HNA instruction 153 may also include a payload pointer (not shown) pointing to a payload (not shown) stored in the HNA packet data memory 165 of the at least one system memory 151, and may extract the payload to the payload buffer 462 of the HPU 425. The at least one HNA instruction 153 may further include a result buffer pointer (not shown) pointing to a given result buffer in the HNA off-chip result buffer partition 171 to enable the HNA processing core 408 of the HPU 425 to transfer at least one match result entry stored in the match result buffer 466 of the HPU 425. The at least one HNA instruction 153 may further include a save buffer pointer (not shown) pointing to a given save buffer in the at least one HNA off-chip save buffer partition 171 of the system memory 151 to enable the HNA processing core 408 to transfer at least one save buffer entry from the save buffer 464 of the HPU 425. The at least one HNA instruction 153 may further include a run stack pointer (not shown) pointing to a given run stack in the at least one HNA off-chip run stack partition 167 of the system memory 151 to enable the HNA processing core 408 to transfer at least one run stack entry from or to the run stack 460 of the HPU 425.

尽管输入缓冲区458、运行堆栈460、和保存缓冲区464可能展现或可能不展现出堆栈的后进先出(LIFO)特性，但输入缓冲区458、运行堆栈460、和保存缓冲区464在此可以分别被称为输入堆栈、运行堆栈、和保存堆栈。输入缓冲区458、运行堆栈460、和保存缓冲区464可以位于同一或不同物理缓冲区内。如果位于同一物理缓冲区内，输入堆栈458、运行堆栈460、和保存堆栈464的条目基于条目的字段设置可以不同，或以任何其他合适的方式而不同。输入堆栈458和运行堆栈460可以位于可以是片上的同一物理缓冲区内，并且保存堆栈464可以位于可以是芯片外的另一个物理缓冲区内。Although input buffer 458, run stack 460, and save buffer 464 may or may not exhibit the last-in, first-out (LIFO) characteristic of a stack, input buffer 458, run stack 460, and save buffer 464 may be referred to as an input stack, a run stack, and a save stack, respectively, herein. Input buffer 458, run stack 460, and save buffer 464 may be located in the same or different physical buffers. If located in the same physical buffer, the entries of input stack 458, run stack 460, and save stack 464 may be different based on the field settings of the entries, or in any other suitable manner. Input stack 458 and run stack 460 may be located in the same physical buffer, which may be on-chip, and save stack 464 may be located in another physical buffer, which may be off-chip.

至少一个HNA指令153的至少一项HNA工作可以存储在输入堆栈458内以便由HNA处理内核408进行处理。该至少一个HNA指令的该至少一项HNA工作每个可以属于同一给定有效载荷，如被传输至有效载荷缓冲区462的被HFA处理器110处理的有效载荷。The at least one HNA work of the at least one HNA instruction 153 may be stored in the input stack 458 for processing by the HNA processing core 408. The at least one HNA work of the at least one HNA instruction may each belong to the same given payload, such as a payload transferred to the payload buffer 462 for processing by the HNA processor 110.

HNA处理内核408可以被配置成用于基于输入缓冲区指针从输入缓冲区458加载(即，提取或取回)至少一项HNA工作。HNA处理内核408可以将该至少一项HNA工作推送(即，存储)至运行堆栈460。HNA处理内核408可以从运行堆栈460弹出(即，读取、提取、加载等)给定HNA工作并对该给定HNA工作进行处理。每项至少一项HNA工作可以包括对有效载荷缓冲区462中所存储的有效载荷的段(未示出)的有效载荷偏移、和指向可以是至少一个有限自动机(如图3A的至少一个NFA 314)中的至少一个有限自动机的图形(未示出)的指针。The HNA processing core 408 can be configured to load (i.e., extract or retrieve) at least one HNA job from the input buffer 458 based on the input buffer pointer. The HNA processing core 408 can push (i.e., store) the at least one HNA job to the run stack 460. The HNA processing core 408 can pop (i.e., read, extract, load, etc.) a given HNA job from the run stack 460 and process the given HNA job. Each of the at least one HNA job can include a payload offset to a segment (not shown) of a payload stored in the payload buffer 462 and a pointer to a graph (not shown), which can be at least one finite automaton (e.g., the at least one NFA 314 of FIG. 3A ).

HNA处理内核408可以加载(即，提取)可能将节点分布在超级集群存储器156a、HNA芯上图形存储器156b、或HNA芯片外图形存储器156c中的任何一个或多个之中的图形，并且可以开始使用与有效载荷缓冲区462中的有效载荷的对应有效载荷偏移相对应的有效载荷段来遍历所提取的节点。图形的部分匹配路径可以包括图形的使有效载荷的连续段与用于生成该图形的给定模式匹配的至少两个节点。该部分匹配路径在此可以被称为线程或活动线程。The HNA processing core 408 may load (i.e., extract) a graph that may distribute nodes in any one or more of the supercluster memory 156a, the HNA on-core graphics memory 156b, or the HNA off-chip graphics memory 156c, and may begin traversing the extracted nodes using payload segments corresponding to corresponding payload offsets of the payload in the payload buffer 462. A partially matching path of the graph may include at least two nodes of the graph that have consecutive segments of the payload matching a given pattern used to generate the graph. The partially matching path may be referred to herein as a thread or active thread.

随着HNA处理内核408可以使用来自有效载荷缓冲区462的有效载荷段来对图形进行处理，将条目推送和弹出至运行堆栈460或将其从该运行堆栈推出或弹出来保存和恢复其在图形中的位置。例如，如果行走过的节点为有待行走的下一个节点呈现多个选项，则HNA处理内核408可能需要将其位置保存在图形内。例如，HNA处理内核408可以行走呈现多个处理路径选项的节点，如以图形表示的岔路。根据在此披露的实施例，DFA或NFA的节点可以与节点类型相关联。与分离类型相关联的节点可以呈现多个处理路径选项。以下参照图5A进一步披露了分离节点类型。As the HNA processing core 408 can use the payload segment from the payload buffer 462 to process the graph, entries are pushed and popped onto the run stack 460 or pushed or popped from the run stack to save and restore their positions in the graph. For example, if a node that has been walked presents multiple options for the next node to be walked, the HNA processing core 408 may need to save its position in the graph. For example, the HNA processing core 408 can walk nodes that present multiple processing path options, such as forked roads represented by a graph. According to the embodiments disclosed herein, nodes of a DFA or NFA can be associated with a node type. Nodes associated with a separation type can present multiple processing path options. Separation node types are further disclosed below with reference to FIG. 5A .

根据在此披露的实施例，HNA处理内核408可以被配置成用于选择该多条处理路径中的给定路径，并将条目推送至运行堆栈460，基于确定沿着所选择的路径行走过的节点处的失配(即，否定)结果，该运行堆栈可以使HNA处理内核408能够沿着该多条处理路径中的未选择的路径返回和继续进行。如此，推送运行堆栈460上的条目可以保存图形中表示未探究的上下文的位置。未探究的上下文可以指示图形的给定节点和相应的有效载荷偏移以使HNA处理内核408能够返回至给定节点并根据来自有效载荷缓冲区462的有效载荷的给定段来行走给定节点，因为该给定段可以位于有效载荷中的相应有效载荷偏移处。如此，运行堆栈460可以用于使HNA处理内核408能够记住和稍后行走图形的未探究的路径。推送或存储指示给定节点和给定有效载荷中的相应偏移的条目在此可以被称为存储未探究的上下文、线程或非活动线程。弹出、提取、或加载指示给定节点和给定有效载荷中的相应偏移的条目以便根据位于给定有效载荷中的相应偏移处的段来行走给定节点在此可以被称为激活线程。丢弃指示给定节点和给定有效载荷中的相应偏移的条目在此可以被称为清空条目或止用线程。According to the embodiment disclosed herein, HNA processing core 408 can be configured to select a given path in the multiple processing paths and push an entry to run stack 460. Based on the mismatch (i.e., negation) result at the node that has been walked along the selected path, the run stack can enable HNA processing core 408 to return and continue along the unselected path in the multiple processing paths. In this way, pushing the entry on run stack 460 can save the position of the context that represents the unexplored context in the graph. The unexplored context can indicate a given node of the graph and the corresponding payload offset so that HNA processing core 408 can return to the given node and walk the given node according to a given segment of the payload from payload buffer 462, because the given segment can be located at the corresponding payload offset in the payload. In this way, run stack 460 can be used to enable HNA processing core 408 to remember and later walk the unexplored path of the graph. Pushing or storing the entry indicating the corresponding offset in a given node and a given payload can be referred to as storing unexplored context, thread, or inactive thread. Popping, extracting, or loading an entry indicating a given node and a corresponding offset in a given payload so as to walk the given node according to the segment located at the corresponding offset in the given payload may be referred to herein as activating a thread. Discarding an entry indicating a given node and a corresponding offset in a given payload may be referred to herein as flushing an entry or deactivating a thread.

当以图形形式行走有效载荷段时，在到达有效载荷缓冲区462中的有效载荷的边界的情况下，保存缓冲区464可以使HNA处理内核408能够将其位置保存在图形中。例如，HNA处理内核408可以确定有效载荷缓冲区462中的有效载荷或有效载荷的一部分与给定模式部分匹配并且确定有效载荷的当前有效载荷偏移为有效载荷的结束偏移。如此，HNA处理内核408可以确定仅发现了给定模式的部分匹配并且对整个有效载荷进行了处理。如此，HNA处理内核408可以将运行堆栈460内容保存到保存缓冲区464以继续行走与所处理的有效载荷相同的流相对应的下一个有效载荷。保存缓冲区464可以被配置成用于存储运行堆栈460的至少一个运行堆栈条目，从而在处理整个有效载荷的情况下，反映运行堆栈460的运行状态。When walking payload segments in a graph format, upon reaching the boundary of a payload in payload buffer 462, save buffer 464 enables HNA processing core 408 to save its position in the graph. For example, HNA processing core 408 may determine that a payload or a portion of a payload in payload buffer 462 partially matches a given pattern and determine the payload's current payload offset as the payload's end offset. Thus, HNA processing core 408 may determine that only a partial match of the given pattern was found and that the entire payload was processed. In this manner, HNA processing core 408 may save the contents of run stack 460 to save buffer 464 to continue walking the next payload corresponding to the same stream as the processed payload. Save buffer 464 may be configured to store at least one run stack entry of run stack 460, thereby reflecting the operational status of run stack 460 when the entire payload is processed.

基于发现模式的最终(即，整个或完整)匹配，HNA可以弹出和丢弃运行堆栈460中的与当前HNA工作(例如，从输入缓冲区加载的HNA)相关联的条目并将匹配结果(未示出)保存到匹配结果缓冲区466。可替代地，当可能对所有可能的匹配路径感兴趣时，HNA处理内核408可以继续处理运行堆栈460的与当前HNA工作相关联的条目。Upon finding a final (i.e., entire or complete) match for the pattern, the HNA may pop and discard the entry associated with the current HNA work (e.g., the HNA loaded from the input buffer) in the run stack 460 and save the match results (not shown) to the match result buffer 466. Alternatively, when all possible matching paths may be of interest, the HNA processing core 408 may continue processing the entries associated with the current HNA work in the run stack 460.

匹配结果可以包括与确定模式的最终匹配所在的节点相关联的节点地址。确定模式的最终匹配所在的节点在此可以被称为标记节点。节点地址、或图形中最终匹配位置的其他标识符、匹配模式的标识符、匹配模式的长度、或任何其他合适的匹配结果或以上的组合可以被包括在匹配结果内。The matching result may include a node address associated with the node where the final match of the determined pattern is located. The node where the final match of the determined pattern is located may be referred to as a marked node herein. The node address, or other identifier of the final matching position in the graph, an identifier of the matching pattern, the length of the matching pattern, or any other suitable matching result or combination thereof may be included in the matching result.

基于处理与当前HNA工作相关联的所有运行堆栈条目，HNA处理内核408可以从运行堆栈460加载之前已经从输入缓冲区458加载过的下一个HNA工作，因为HNA处理内核408可以被配置成用于顺序地处理至少一个HNA指令153的HNA工作。如此，HNA处理内核408可以从超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c提取下一个图形(未示出)和根据来自下一项HNA工作所标识的下一个有效载荷的一个或多个有效载荷段来行走下一个图形，并且继续处理附加HNA工作，直到运行堆栈460为空。Upon processing all run stack entries associated with the current HNA job, the HNA processing core 408 may load the next HNA job that has been previously loaded from the input buffer 458 from the run stack 460, because the HNA processing core 408 may be configured to sequentially process the HNA jobs of at least one HNA instruction 153. Thus, the HNA processing core 408 may fetch the next graphic (not shown) from the supercluster graphic memory 156a, the HNA on-chip graphic memory 156b, or the HNA off-chip graphic memory 156c and walk the next graphic based on one or more payload segments from the next payload identified by the next HNA job, and continue processing additional HNA jobs until the run stack 460 is empty.

基于当根据有效载荷行走图形时发现有效载荷的失配，HNA处理内核408可以从运行堆栈460弹出与当前HNA工作相关联的条目并基于所弹出的条目的内容根据下一个有效载荷的下一段来行走下一个节点。如果运行堆栈460不包括与当前HNA工作相关联的条目，则HNA处理内核408可以完成当前HNA工作并且可以从运行堆栈460加载之前已经从输入缓冲区458加载过的下一项HNA工作。如此，HNA处理内核408可以被配置成用于基于所加载的下一项HNA工作来行走另外下一个图形，并继续处理附加工作，直到运行堆栈460为空。Upon discovering a payload mismatch while walking the graph based on the payload, the HNA processing core 408 may pop the entry associated with the current HNA job from the running stack 460 and walk the next node based on the next segment of the next payload based on the contents of the popped entry. If the running stack 460 does not include an entry associated with the current HNA job, the HNA processing core 408 may complete the current HNA job and load the next HNA job previously loaded from the input buffer 458 from the running stack 460. In this manner, the HNA processing core 408 may be configured to walk another next graph based on the loaded next HNA job and continue processing additional jobs until the running stack 460 is empty.

在此披露的实施例可以使用栈顶寄存器470来提高匹配性能。栈顶寄存器470在此可以被可互换地称为补充存储器470、TOS寄存器470或TOS 470。栈顶寄存器470可以是操作性地耦合至一个第二存储器的一个第一存储器，如运行堆栈460。HPU 425的HNA处理内核408可以操作性地耦合至栈顶寄存器470和运行堆栈460。栈顶寄存器470可以被配置成用于存储至少一个有限自动机中的给定有限自动机的多个节点的可以被HNA处理内核408推送用于行走给定节点的HNA工作(即，上下文)，如堆栈条目(也被可互换地称为上下文或未探究的上下文)。例如，可以推送或弹出上下文用于行走给定节点。该上下文可以对从网络接收到的输入流的有效载荷中的段的给定节点和偏移进行标识。该上下文可以使HNA处理内核408能够根据该偏移标识的段行走通过该上下文标识的给定节点。Embodiments disclosed herein may utilize a top-of-stack register 470 to improve matching performance. The top-of-stack register 470 may be interchangeably referred to herein as a supplemental memory 470, a TOS register 470, or a TOS 470. The top-of-stack register 470 may be a first memory operatively coupled to a second memory, such as a run stack 460. The HNA processing core 408 of the HPU 425 may be operatively coupled to the top-of-stack register 470 and the run stack 460. The top-of-stack register 470 may be configured to store HNA work (i.e., context) for multiple nodes of a given finite automaton in at least one finite automaton, which may be pushed by the HNA processing core 408 for walking the given node, such as stack entries (also interchangeably referred to as context or unexplored context). For example, a context may be pushed or popped for walking a given node. The context may identify a given node and an offset for a segment in the payload of an input stream received from the network. The context enables the HNA processing core 408 to walk through the given node identified by the context based on the segment identified by the offset.

栈顶寄存器470可以与可以包括有效性状态信息(也被可互换地称为有效性指示符)的上下文状态信息相关联。有效性状态可以为栈顶寄存器470指示有效或无效状态。该有效状态可以指示栈顶寄存器470存储了未决上下文。未决上下文可以是所存储的还没有被HNA处理内核408处理的上下文。The top-of-stack register 470 may be associated with context state information that may include validity state information (also interchangeably referred to as a validity indicator). The validity state may indicate a valid or invalid state for the top-of-stack register 470. The valid state may indicate that the top-of-stack register 470 stores a pending context. A pending context may be a stored context that has not yet been processed by the HNA processing core 408.

无效状态可以指示栈顶寄存器470没有存储未决上下文，例如，存储到栈顶寄存器470的条目已经被HNA处理内核408弹出以根据段来行走给定节点或以另外的方式被HNA处理内核408丢弃。如此，HNA处理内核408可以使用上下文状态信息来辨别栈顶寄存器470是否具有未决上下文。The invalid state may indicate that the top-of-stack register 470 does not store a pending context, e.g., the entry stored in the top-of-stack register 470 has been popped by the HNA processing core 408 to walk a given node according to the segment or has otherwise been discarded by the HNA processing core 408. Thus, the HNA processing core 408 may use the context state information to discern whether the top-of-stack register 470 has a pending context.

根据在此披露的实施例，有效性状态可以被实现为栈顶寄存器470的位、栈顶寄存器470的多位字段、与栈顶寄存器470分开存储的指示符、或以传达关于栈顶寄存器470寄存器是否存储未决上下文的状态的任何其他合适的方式来实现。According to embodiments disclosed herein, the validity status may be implemented as a bit of the top-of-stack register 470, a multi-bit field of the top-of-stack register 470, an indicator stored separately from the top-of-stack register 470, or in any other suitable manner that conveys status regarding whether the top-of-stack register 470 register stores a pending context.

HNA处理内核408使用运行堆栈460来保存上下文，如行走NFA图形的节点过程中NFA图形的节点的状态。TOS 470寄存器的访问(即，读取/写入)可以比运行堆栈460快多倍。与ECC保护存储器相比，针对该存储器，推送或弹出操作可以进行三个、四个、或更多个时钟周期，如果在TOS 470寄存器上执行，则推送或弹出操作可以进行一个时钟周期。TOS 470寄存器可以将最近推送的堆栈条目与早前推送的可以通过TOS 470寄存器被推送至运行堆栈460的条目保持分开。将最近推送的条目保持在TOS 470寄存器中可以提高行走性能，因为最近推送的条目可能是最频繁访问的条目，即，在推送另一个条目之前很可能弹出最近推送的条目。HNA processing core 408 uses operation stack 460 to preserve context, as the state of the node of NFA graph in the node process of walking NFA graph.The access (that is, read/write) of TOS 470 registers can be many times faster than operation stack 460.Compared with ECC protected memory, for this memory, push or pop operation can carry out three, four or more clock cycles, if execute on TOS 470 registers, then push or pop operation can carry out a clock cycle.TOS 470 registers can keep separate with the stack entry that pushes recently and can be pushed to the entry and exit of operation stack 460 by TOS 470 registers earlier.The entry and exit that pushes recently is remained in TOS 470 registers and can improve walking performance, because the entry and exit that pushes recently may be the entry and exit that most frequently accesses, that is, before pushing another entry and exit, probably pop out the entry and exit that pushes recently.

存储上下文(如通过推送一个第一上下文)可以包括基于与TOS 470寄存器相关联的上下文状态信息存储访问TOS 470寄存器而不访问运行堆栈460或访问TOS 470寄存器和运行堆栈460的存储确定。访问TOS 470寄存器而不访问运行堆栈460的存储确定可以基于对TOS 470寄存器的无效状态进行指示的上下文状态信息。访问TOS 470寄存器和运行堆栈460的存储确定可以基于对TOS 470寄存器的有效状态进行指示的上下文状态信息。Storing the context (e.g., by pushing a first context) may include storing a determination to access the TOS 470 registers without accessing the runtime stack 460 or to access both the TOS 470 registers and the runtime stack 460 based on context state information associated with the TOS 470 registers. The determination to access the TOS 470 registers without accessing the runtime stack 460 may be based on the context state information indicating an invalid state of the TOS 470 registers. The determination to access both the TOS 470 registers and the runtime stack 460 may be based on the context state information indicating a valid state of the TOS 470 registers.

TOS堆栈470可以配置有用于存储单个上下文(即，HNA工作)的单个条目，并且运行堆栈460可以配置有用于存储多个上下文的多个条目。在HNA处理内核408弹出上下文(即，堆栈条目)例如以取回所存储的上下文的情况下，关于上下文状态信息是否指示TOS 470寄存器的有效或无效状态，可以进行检查。如果上下文状态信息指示有效状态，可以从TOS470寄存器弹出478最近推送的上下文，并且当TOS 470寄存器不再存储未决上下文时，下文状态信息可以被更新成用于指示TOS 470寄存器的当前无效状态。The TOS stack 470 may be configured with a single entry for storing a single context (i.e., HNA work), and the run stack 460 may be configured with multiple entries for storing multiple contexts. When the HNA processing core 408 pops a context (i.e., a stack entry), for example, to retrieve a stored context, a check may be performed as to whether the context state information indicates a valid or invalid state for the TOS 470 registers. If the context state information indicates a valid state, the most recently pushed context may be popped 478 from the TOS 470 registers, and when the TOS 470 registers no longer store a pending context, the context state information may be updated to indicate the current invalid state of the TOS 470 registers.

然而，如果检查确定上下文状态信息指示无效状态，则反而可以从运行堆栈460弹出480(即，取回)未决上下文。如此，基于上下文状态信息的与TOS 470寄存器相关联的无效状态，可以从运行堆栈460取回未决上下文，并且不将运行堆栈460所存储的未决上下文写入到TOS 470寄存器。However, if the examination determines that the context state information indicates an invalid state, the pending context may instead be popped 480 (i.e., retrieved) from the run stack 460. Thus, based on the invalid state associated with the TOS 470 register of the context state information, the pending context may be retrieved from the run stack 460, and the pending context stored by the run stack 460 may not be written to the TOS 470 register.

图4B为根据在此披露的实施例的可以如通过推送或弹出堆栈条目而被存储或取回的上下文4401(即，HNA工作)的示例实施例的框图4400。上下文4401可以包括多个字段4402-4418。该多个字段可以包括可以基于多种节点类型中的一种节点类型的上下文条目类型字段4402。上下文条目类型字段4402可以表示该多个字段4402-4418中的哪些字段可以针对节点类型而相关。4B is a block diagram 4400 of an example embodiment of a context 4401 (i.e., HNA operation) that can be stored or retrieved, such as by pushing or popping a stack entry, according to embodiments disclosed herein. Context 4401 may include multiple fields 4402-4418. The multiple fields may include a context entry type field 4402 that may be based on one of multiple node types. Context entry type field 4402 may indicate which fields of the multiple fields 4402-4418 are relevant for a node type.

上下文4401可以进一步包括可以基于上下文条目类型字段4402而相关的匹配类型字段4404。匹配类型字段4404可以基于节点类型并且可以用于确定给定节点是否被配置成与从网络接收到的数据流中的给定元素的单个实例或多个连续实例匹配。Context 4401 may further include a match type field 4404 that may be related based on context entry type field 4402. Match type field 4404 may be based on node type and may be used to determine whether a given node is configured to match a single instance or multiple consecutive instances of a given element in a data stream received from the network.

上下文4401可以进一步包括无论上下文条目类型字段4402如何都可以相关的并且可以对用于在给定节点处进行匹配的给定元素进行标识的元素字段4408。Context 4401 may further include an element field 4408 that may be relevant regardless of context entry type field 4402 and may identify a given element for matching at a given node.

上下文4401可以进一步包括无论上下文条目类型字段如何都可以相关的并且可以对与给定节点相关联的下一个节点进行标识的下一个节点地址字段4410。例如，基于给定节点处的肯定匹配，可以通过下一个节点地址字段4410来标识用于行走下一个段的下一个节点。Context 4401 may further include a next node address field 4410 that may be relevant regardless of the context entry type field and may identify the next node associated with a given node. For example, based on a positive match at a given node, the next node for walking the next segment may be identified by next node address field 4410.

上下文4401可以进一步包括可以基于上下文条目类型字段4402而相关的计数字段4412。计数字段4412可以标识剩余的用于与元素字段4408在给定节点处所标识的给定元素进行匹配的连续实例的数量的计数值。Context 4401 may further include a count field 4412 that may be related based on context entry type field 4402. Count field 4412 may identify a count value of the number of consecutive instances remaining for matching a given element identified by element field 4408 at a given node.

上下文4401可以进一步包括无论上下文条目类型字段4402如何都可以相关的并且在输入流中检测到至少一个正则表达式的完整匹配的情况下可以对是否丢弃上下文4401或行走下一个节点地址字段4410所标识的下一个节点进行标识的丢弃未探究的上下文(DUP)字段4414。Context 4401 may further include a discard unexplored context (DUP) field 4414 that may be relevant regardless of the context entry type field 4402 and that may identify whether to discard context 4401 or walk to the next node identified by the next node address field 4410 when a complete match of at least one regular expression is detected in the input stream.

上下文4401可以进一步包括无论上下文条目类型字段4402如何都可以相关的并且可以对逆向或正向行走方向进行标识的逆向行走方向字段4416。Context 4401 may further include a reverse walking direction field 4416 that may be relevant regardless of context entry type field 4402 and may identify reverse or forward walking direction.

上下文4401可以进一步包括无论上下文条目类型字段4402如何都可以相关的并且可以对输入流中的有效载荷段的用于与具体元素进行匹配的偏移进行标识的偏移字段4418。可以基于上下文条目类型字段4402标识该具体元素。Context 4401 may further include an offset field 4418 that may be relevant regardless of context entry type field 4402 and may identify an offset of a payload segment in the input stream for matching with a specific element. The specific element may be identified based on context entry type field 4402.

推送上下文可以包括对包括上下文4401的堆栈条目进行配置，并且该堆栈条目可以存储在如以上披露的图4A的运行堆栈460的堆栈上。上下文4401的字段的第一子集可以基于与给定节点相关联的给定元数据来进行配置、基于之前已经提取给定节点来获得，如匹配类型字段4404、元素字段4408、和下一个节点地址字段4410字段。上下文4401的字段的第二子集可以由HNA处理内核408基于行走的运行时间信息进行配置，如为给定节点保持的当前行走方向或计数值。例如，该第二子集可以包括逆向行走方向字段4416、计数字段4412、以及丢弃未探究的上下文(DUP)字段4414。Pushing a context may include configuring a stack entry including a context 4401, and the stack entry may be stored on the stack of the runtime stack 460 of FIG. 4A as disclosed above. A first subset of the fields of context 4401 may be configured based on given metadata associated with a given node, obtained based on having previously extracted the given node, such as a match type field 4404, an element field 4408, and a next node address field 4410. A second subset of the fields of context 4401 may be configured by the HNA processing core 408 based on runtime information about the walk, such as the current walk direction or count value maintained for the given node. For example, this second subset may include a reverse walk direction field 4416, a count field 4412, and a discard unexplored context (DUP) field 4414.

HNA处理内核408可以基于上下文条目类型字段4402中所包括的上下文状态设置(未示出)来解释上下文4401。上下文状态设置可以指示上下文4401是否完整或不完整。基于所弹出的堆栈条目的上下文4401的上下文条目类型字段4402的指示上下文4401是否不完整的上下文状态设置，HNA处理内核408可以被配置成用于提取通过下一个节点地址字段4410标识的下一个节点并基于下一个节点所存储的元数据和当前运行时间配置(如行走方向)来继续进行行走，而不是基于所弹出的堆栈条目的上下文4401的字段配置来继续进行行走。The HNA processing core 408 may interpret the context 4401 based on a context state setting (not shown) included in the context entry type field 4402. The context state setting may indicate whether the context 4401 is complete or incomplete. Based on the context state setting of the context entry type field 4402 of the context 4401 of the popped stack entry indicating whether the context 4401 is incomplete, the HNA processing core 408 may be configured to extract the next node identified by the next node address field 4410 and continue walking based on the metadata stored in the next node and the current runtime configuration (e.g., walking direction), rather than continuing walking based on the field configuration of the context 4401 of the popped stack entry.

图5A为行走器320可以用于对输入流(未示出)中的正则表达式模式502进行匹配的每模式NFA图形504的示例实施例的框图500。如以上所披露的，至少一个HNA处理器108可以被配置成用于实施行走器320的与NFA处理相关的功能性，并且至少一个HNA处理器108可以包括多个超级集群。每个超级集群可以包括多个集群。该多个集群中的每个集群可以包括每个可以包括一个如以上参照图4A所披露的HNA处理内核408的多个HNA处理单元(HPU)。如此，至少一个HPU 425的至少一个HNA处理内核408可以基于HNA调度器129进行的HNA指令的调度来实施行走器320与NFA处理相关的功能性。FIG5A is a block diagram 500 of an example embodiment of a per-pattern NFA graph 504 that the walker 320 can use to match regular expression patterns 502 in an input stream (not shown). As disclosed above, at least one HNA processor 108 can be configured to implement NFA processing-related functionality of the walker 320, and the at least one HNA processor 108 can include multiple superclusters. Each supercluster can include multiple clusters. Each cluster in the multiple clusters can include multiple HNA processing units (HPUs), each of which can include an HNA processing core 408 as disclosed above with reference to FIG4A. Thus, at least one HNA processing core 408 of at least one HPU 425 can implement NFA processing-related functionality of the walker 320 based on the scheduling of HNA instructions by the HNA scheduler 129.

在行走器320可以使用的每模式NFA图形504的示例实施例中，输入流可以包括一个带有有效载荷542的数据包(未示出)。正则表达式模式502为指定字符“h”后跟着无限数量的与换行字符(即，[^\n]*)不匹配的连续字符的模式“h[^\n]*ab”。该无限数量可以是零或更多。模式502进一步包括后面连续跟着无限数量的与换行字符不匹配的字符的字符“a”和“b”。在该示例实施例中，有效载荷542包括有效载荷542中的带有对应偏移520a-d(即，0、1、2、和3)的段522a-d(即，h、x、a、和b)。In an example embodiment of a per-pattern NFA graph 504 that can be used by walker 320, an input stream can include a data packet (not shown) with a payload 542. Regular expression pattern 502 is the pattern "h[^\n]*ab" that specifies the character "h" followed by an unlimited number of consecutive characters that do not match a newline character (i.e., [^\n]*). The unlimited number can be zero or more. Pattern 502 further includes the characters "a" and "b" followed by an unlimited number of consecutive characters that do not match a newline character. In this example embodiment, payload 542 includes segments 522a-d (i.e., h, x, a, and b) with corresponding offsets 520a-d (i.e., 0, 1, 2, and 3) in payload 542.

应理解到，正则表达式模式502、NFA图形504、有效载荷542、段522a-d、以及偏移520a-d表示用于说明性目的的示例，并且在此披露的系统、方法、以及相应的装置可以适用于任何合适的正则表达式模式、NFA图形、有效载荷、段、以及偏移。进一步地，应理解到，NFA图形504可以是更大的NFA图形(未示出)的一个子部分。此外，有效载荷542可以是更大的有效载荷(未示出)的一部分并且该部分可以在更大的有效载荷的开始、结束、或任何位置，从而产生与该示例实施例中的那些偏移不同的偏移。It should be understood that regular expression pattern 502, NFA graph 504, payload 542, segments 522a-d, and offsets 520a-d represent examples for illustrative purposes, and that the systems, methods, and corresponding apparatus disclosed herein can be applied to any suitable regular expression pattern, NFA graph, payload, segment, and offset. Further, it should be understood that NFA graph 504 can be a sub-portion of a larger NFA graph (not shown). Additionally, payload 542 can be a portion of a larger payload (not shown) and the portion can be at the beginning, end, or any other position of the larger payload, thereby resulting in offsets that differ from those in the example embodiment.

在该示例实施例中，NFA图形504是被配置成用于使正则表达式模式502与输入流进行匹配的每模式NFA图形。例如，NFA图形504可以包括编译器306所生成的多个节点的图形，如节点N0 506、N1 508、N2 510、N3 512、N4 514、和N5 515。节点N0 506可以表示模式502的起始节点，并且节点N5 515可以表示模式502的标记节点。标记节点N5 515可以与反映与输入流进行匹配的模式502的最终(即，整个或完整)匹配的指示符(未示出)相关联。如此，行走器302可以基于遍历标记节点N5 515和检测该指示符来确定模式502在输入流中匹配。指示符可以是与标记节点相关联的元数据的(未示出)标记或字段设置或任何其他合适的指示符。In this example embodiment, NFA graph 504 is a per-pattern NFA graph configured to match regular expression pattern 502 with an input stream. For example, NFA graph 504 may include a graph of multiple nodes generated by compiler 306, such as nodes N0 506, N1 508, N2 510, N3 512, N4 514, and N5 515. Node N0 506 may represent the starting node of pattern 502, and node N5 515 may represent the marking node of pattern 502. Marked node N5 515 may be associated with an indicator (not shown) reflecting the final (i.e., entire or complete) match of pattern 502 matched with the input stream. In this way, walker 302 may determine that pattern 502 matches in the input stream based on traversing marked node N5 515 and detecting the indicator. The indicator may be a (not shown) tag or field setting of metadata associated with the marked node or any other suitable indicator.

根据在此披露的实施例，行走器320可以使有效载荷542的段522a-d中的一个段一次走过NFA图形504以将正则表达式模式502与输入流匹配。可以基于偏移518中的给定段的为有效载荷542内的当前偏移的对应偏移来确定用于行走给定节点的段516中的给定段。根据在此披露的实施例，行走器320可以对当前偏移进行以下更新：使当前偏移增量或减量。例如，行走器320可以使NFA图形504在正向或逆向方向上行走，并且因此，可以通过分别使偏移增量或减量在正向543或逆向546方向上走过来自有效载荷542的段。According to embodiments disclosed herein, the walker 320 can walk the NFA graph 504 one segment at a time from segments 522a-d of the payload 542 to match the regular expression pattern 502 to the input stream. A given segment from segments 516 to be used to walk a given node can be determined based on the corresponding offset of the given segment in offset 518 to the current offset within the payload 542. According to embodiments disclosed herein, the walker 320 can update the current offset by incrementing or decrementing the current offset. For example, the walker 320 can walk the NFA graph 504 in a forward or reverse direction and, therefore, can walk segments from the payload 542 in a forward 543 or reverse 546 direction by incrementing or decrementing the offset, respectively.

节点N0 506、N2 510、N3 512、以及N4 514可以被配置成用于使对应的元素与有效载荷542的给定段匹配，而节点N1 508和N5 515可以是指示没有匹配功能性的节点类型的节点，并且因此，将不从有效载荷542进行处理。在该示例实施例中，节点N1 508为向行走器320呈现多个转换路径选项的分离节点。例如，行走分离节点N1 508呈现ε路径530a和530b。根据在此披露的实施例，行走器320可以基于与行走器306的相互协议中的隐含设置来选择该多条路径530a和530b中的给定路径。例如，编译器306可以基于行走器320沿着确定性路径的隐含理解、例如根据行走器320基于行走分离节点508而选择上部ε路径530a的隐含理解来生成NFA图形504。根据在此披露的实施例，可以将上部ε路径530a选择成上部ε路径530a表示懒惰路径。该懒惰路径可以是表示最短可能的元素匹配的路径。Nodes N0 506, N2 510, N3 512, and N4 514 can be configured to match corresponding elements to a given segment of payload 542, while nodes N1 508 and N5 515 can be nodes of a node type indicating no matching functionality and, therefore, will not be processed from payload 542. In this example embodiment, node N1 508 is a split node that presents multiple conversion path options to walker 320. For example, walking split node N1 508 presents ε-paths 530a and 530b. According to embodiments disclosed herein, walker 320 can select a given path from the multiple paths 530a and 530b based on an implicit setting in a mutual agreement with walker 306. For example, compiler 306 can generate NFA graph 504 based on walker 320's implicit understanding of following deterministic paths, such as based on walker 320's implicit understanding that upper ε-path 530a will be selected based on walking split node 508. According to embodiments disclosed herein, upper epsilon path 530a may be selected such that upper epsilon path 530a represents a lazy path. The lazy path may be a path representing the shortest possible element match.

根据在此披露的实施例，分离节点508可以与呈现该多个路径选项的分离节点元数据(未示出)相关联。例如，在该示例实施例中，该分离节点元数据可以或者直接或者间接地指示多个下一个节点，如节点N2 510和N3 512。如果直接指示该多个下一个节点，则该元数据可以包括到下一个节点N2 510和N3 512的绝对地址或指示符。如果间接地指示该多个下一个节点，则该元数据可以包括可以用于解析下一个节点N2 510和N3 512的绝对地址或下一个节点N2 510和N3 512的指示符的索引或偏移。可替代地，可以使用用于直接或间接地指示该多个下一个节点的下一个节点地址的其他合适的形式。According to embodiments disclosed herein, the split node 508 may be associated with split node metadata (not shown) that presents the multiple path options. For example, in this example embodiment, the split node metadata may indicate multiple next nodes, such as nodes N2 510 and N3 512, either directly or indirectly. If the multiple next nodes are indicated directly, the metadata may include absolute addresses or indicators to the next nodes N2 510 and N3 512. If the multiple next nodes are indicated indirectly, the metadata may include an index or offset that can be used to resolve the absolute addresses of the next nodes N2 510 and N3 512 or indicators of the next nodes N2 510 and N3 512. Alternatively, other suitable forms of next node addresses that directly or indirectly indicate the multiple next nodes may be used.

该隐含理解可以包括将行走器320配置成用于基于分离节点元数据内的具体条目位置中所包括的节点元数据来选择多个下一个节点中的下一个给定节点。编译器306可以被配置成用于生成在所指定的条目位置处包括下一个给定节点的指示的分离节点元数据。如此，生成NFA图形504的编译器306可以使用行走器320将在分离节点N1 508处选择给定路径(如上部ε路径530a)的隐含理解。This implicit understanding can include configuring the walker 320 to select a next given node from the plurality of next nodes based on the node metadata included in the specific entry position within the split node metadata. The compiler 306 can be configured to generate the split node metadata including an indication of the next given node at the specified entry position. In this way, the compiler 306 generating the NFA graph 504 can use the implicit understanding of the walker 320 to select a given path (such as the upper epsilon path 530a) at the split node N1 508.

图5B为根据有效载荷542来行走图5A的每模式NFA的处理周期的示例实施例的表538。应理解到，处理周期可以包括一个或多个时钟周期。5B is a table 538 of an example embodiment of a processing cycle for walking the per-mode NFA of FIG. 5A according to payload 542. It should be understood that a processing cycle may include one or more clock cycles.

如表538中所示，处理周期540a-h可以包括根据来自有效载荷542的在当前偏移532处的段来行走当前节点530以确定匹配结果534和基于匹配结果534的行走器动作536。在该示例实施例中，节点N0 506可以具有字符节点类型。例如，节点N0 506可以是被配置成用于与输入流中的字符“h”匹配的字符节点。在该示例实施例中，行走器320可以在处理周期540a中根据当前偏移520a处的段522a(即，“h”)来行走起始节点N0 506。As shown in table 538, processing cycles 540a-h can include walking the current node 530 according to the segment at the current offset 532 from the payload 542 to determine a match result 534 and a walker action 536 based on the match result 534. In this example embodiment, node N0 506 can have a character node type. For example, node N0 506 can be a character node configured to match the character "h" in the input stream. In this example embodiment, the walker 320 can walk the starting node N0 506 according to the segment 522a (i.e., "h") at the current offset 520a in processing cycle 540a.

当段522a与起始节点N0 506处的“h”匹配时，行走器320可以确定匹配结果534为肯定匹配结果。如编译器306通过与起始节点N0 506相关联的元数据(未示出)所指定的，行走器320可以在正向方向上行走并提取与节点N0 506相关联的元数据所指示的下一个节点并且可以使当前偏移从520a(即，“0”)增量到到520b(即，“1”)。在该示例实施例中，节点N0506所指示的下一个节点为分离节点N1508。如此，行走器320在处理周期540a中采取动作536，该动作包括在有效载荷542中将当前偏移更新成“1”并转换至分离节点N1 508。转换可以包括提取(在此也被称为加载)分离节点N1 508。When segment 522a matches "h" at start node N0 506, the walker 320 can determine that the match result 534 is a positive match result. As specified by the compiler 306 through metadata (not shown) associated with the start node N0 506, the walker 320 can walk in the forward direction and extract the next node indicated by the metadata associated with node N0 506 and increment the current offset from 520a (i.e., "0") to 520b (i.e., "1"). In this example embodiment, the next node indicated by node N0 506 is the separation node N1 508. As such, the walker 320 takes action 536 in processing cycle 540a, which includes updating the current offset to "1" in the payload 542 and switching to the separation node N1 508. The switching can include extracting (also referred to herein as loading) the separation node N1 508.

当分离节点N1 508呈现多个转换路径选项时，如ε路径530a和530b，处理周期540b中的动作536可以包括选择上部ε路径530a并提取与有效载荷542无关的节点N2 510而不从有效载荷542消耗(即，处理)。由于分离节点N1 508没有执行匹配函数，当前偏移/段532没有改变，并且因此，在处理周期540b中没有消耗(即，处理)有效载荷。When split node N1 508 presents multiple conversion path options, such as epsilon paths 530a and 530b, action 536 in processing cycle 540b may include selecting upper epsilon path 530a and extracting node N2 510 that is not associated with payload 542 without being consumed (i.e., processed) from payload 542. Since split node N1 508 does not perform a matching function, current offset/segment 532 is not changed, and therefore, no payload is consumed (i.e., processed) in processing cycle 540b.

由于分离节点N1 508呈现多个路径选项，动作536可以包括存储未探究的上下文，如通过存储节点N3 512的间接或直接标识符和当前偏移520b(即，“1”)。所选择的转换路径在此可以被称为当前或活动线程，并且所存储的每个未被遍历的转换路径在此可以被称为存储线程。有效载荷中的相应的节点标识符和偏移可以标识每个线程。如此，未探究的上下文可以标识未探究的线程(即，路径)。Because split node N1 508 presents multiple path options, action 536 may include storing the unexplored context, such as by storing the indirect or direct identifier of node N3 512 and the current offset 520b (i.e., "1"). The selected transition path may be referred to herein as the current or active thread, and each stored transition path that has not been traversed may be referred to herein as a stored thread. The corresponding node identifier and offset in the payload may identify each thread. In this manner, the unexplored context may identify the unexplored thread (i.e., path).

在沿着所选择的部分匹配路径出现否定匹配结果的情况下，例如，如果沿着从节点N2 510延伸的路径在节点N2 510或多个节点处确定否定匹配结果，则存储未探究的上下文可以使行走器320能够记住返回至节点N3 512以根据有效载荷542中的偏移520b“1”处的段行走节点N3 512。根据在此披露的实施例，在沿着所选择的转换路径标识模式502的最终匹配的情况下，可以用丢弃未探究的处理(DUP)指示符来标记未探究的上下文，该指示符向行走器320指示是否丢弃或处理未探究的上下文。In the event of a negative match result along the selected partially matched path, for example, if a negative match result is determined at node N2 510 or multiple nodes along the path extending from node N2 510, storing the unexplored context may enable the walker 320 to remember to return to node N3 512 to walk node N3 512 according to the segment at offset 520b "1" in the payload 542. According to embodiments disclosed herein, in the event that a final match for the pattern 502 is identified along the selected transition path, the unexplored context may be marked with a discard unexplored processing (DUP) indicator that indicates to the walker 320 whether to discard or process the unexplored context.

例如，基于到达指示输入流中的模式502的最终(即，完整或整个)匹配的标记节点N5 515，行走器320可以利用DUP指示符来确定是否通过根据在偏移520b根据段“x”行走节点N3 512来处理未探究的上下文以便确定NFA图形504的对模式502进行匹配的另一条路径，或是否丢弃未探究的上下文。用DUP指示符标记未探究的上下文可以包括以任何合适的方式标记未探究的上下文，如通过将与未探究的上下文相关联的位或字段设置为真，以表示希望处理堆栈条目，或设置为假，以表示希望丢弃该堆栈条目。For example, upon reaching marked node N5 515 indicating a final (i.e., complete or entire) match of pattern 502 in the input stream, the walker 320 can utilize the DUP indicator to determine whether to process the unexplored context by walking node N3 512 according to segment “x” at offset 520 b to determine another path of the NFA graph 504 that matches pattern 502, or whether to discard the unexplored context. Marking the unexplored context with the DUP indicator can include marking the unexplored context in any suitable manner, such as by setting a bit or field associated with the unexplored context to true to indicate a desire to process the stack entry, or to false to indicate a desire to discard the stack entry.

编译器306可以确定是否遍历存储线程。例如，编译器306可以通过配置每个节点的相应元数据中的设置来控制是否设置DUP指示符。可替代地，编译器306可以对与有限自动机相关联的全局元数据中所包括的、对要遍历所有存储的线程进行指定的全局设置进行配置，从而能够标识所有可能的匹配。Compiler 306 can determine whether to traverse the storage thread. For example, compiler 306 can control whether to set the DUP indicator by configuring a setting in the corresponding metadata of each node. Alternatively, compiler 306 can configure a global setting included in the global metadata associated with the finite automaton that specifies a thread to traverse all storage, thereby being able to identify all possible matches.

在该示例实施例中，ε转换路径530a的选择会引起在节点N2510处或当前线程的后续节点(如，节点N4 514)处检测到匹配失败。如此，如果检测到匹配失败，则可以遍历针对ε转换路径530b的存储线程。可替代地，如果由编译器306指定，则不管遍历ε转换路径530b是否引起检测到匹配失败，都可以遍历ε转换路径530b。In this example embodiment, selection of the ε transition path 530a results in a match failure being detected at node N2 510 or at a subsequent node of the current thread (e.g., node N4 514). Thus, if a match failure is detected, the storage thread for the ε transition path 530b may be traversed. Alternatively, if specified by the compiler 306, the ε transition path 530b may be traversed regardless of whether traversing the ε transition path 530b results in a match failure being detected.

存储未遍历的转换路径可以包括通过与条目中的当前偏移522b的指示相关联地存储下一个节点N3 513的标识符来将该条目存储在堆栈上，如图4A的运行堆栈460。下一个节点N3 513的标识符可以是值、指针、或下一个节点的任何其他合适的指示符。偏移值可以是数字值、指针、或对有效载荷542内的段516的位置进行标识的任何其他合适的值。Storing the untraversed transition path can include storing the entry on a stack, such as the run stack 460 of FIG4A , by storing an identifier of the next node N3 513 in association with an indication of the current offset 522 b in the entry. The identifier of the next node N3 513 can be a value, a pointer, or any other suitable indicator of the next node. The offset value can be a numeric value, a pointer, or any other suitable value that identifies the location of the segment 516 within the payload 542.

根据该示例实施例，基于选择上部路径(即，ε转换路径530a)，在处理周期540c中，行走器320可以提取节点N2 510并且试图将当前偏移520b(即，“1”)处的段522b(即，“x”)与节点N2 510的元素“a”进行匹配。由于“x”与节点N2 510处的元素“a”不匹配，处理周期540c中的动作536可以包括从运行堆栈460弹出条目。所弹出的条目544b可以是最近推送的条目，如在该示例实施例中指示节点N3 512和偏移520b(即，“1”)的存储条目544a。According to this example embodiment, based on selecting the upper path (i.e., ε-transition path 530a), in processing cycle 540c, walker 320 may extract node N2 510 and attempt to match segment 522b (i.e., "x") at current offset 520b (i.e., "1") with element "a" of node N2 510. Since "x" does not match element "a" at node N2 510, action 536 in processing cycle 540c may include popping an entry from run stack 460. The popped entry 544b may be the most recently pushed entry, such as, in this example embodiment, stored entry 544a indicating node N3 512 and offset 520b (i.e., "1").

行走器320并且可以根据位于有效载荷542中的偏移520b处的段“x”转换和行走节点N3 512。如此，处理周期540d显示在处理周期540d中匹配结果534为肯定的。处理周期540d中的动作536可以包括将当前偏移更新成偏移520c并转换回可以是节点N3 512指示的下一个节点的分离节点N1 508。The walker 320 can then transition and walk node N3 512 based on segment "x" located at offset 520b in the payload 542. Thus, processing cycle 540d indicates that the match result 534 is positive in processing cycle 540d. Action 536 in processing cycle 540d can include updating the current offset to offset 520c and transitioning back to the split node N1 508, which can be the next node indicated by node N3 512.

由于从分离节点508转换的所有圆弧为ε转换，当在处理周期540e中没有更新当前偏移时，行走器320可以再次选择该多个路径选项中的一条路径并且不消耗(即，处理)来自有效载荷542的段。在该示例实施例中，行走器320再次选择ε转换路径530a。如此，行走器320通过推送节点N3 512和当前偏移(现在为520c(即，“2”))再次将线程存储在运行堆栈460上。如处理周期540f中所示，行走器320提取节点N2 510并将偏移520c(即，“2”)处的段522c(即，“a”)与节点N2 510的元素“a”匹配。由于“a”在节点N2 510处匹配，行走器320将当前偏移更新成520d(即，“3”)并转换至如编译器306配置的节点N2 510元数据(未示出)所指定的节点N4 514。例如，节点N2 510元数据可以指定通过与给定节点N2 510相关联的下一个节点地址(未示出)从给定节点(如节点N2 510)到下一个节点(如节点N4 514)的转换511。根据在此披露的实施例，下一个节点地址可以被配置成用于标识下一个节点和多个存储器中的给定存储器，如编译器306将下一个节点分布到其上以便存储的超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c。Since all arcs transitioning from the split node 508 are ε transitions, when the current offset is not updated in processing cycle 540e, the walker 320 can again select one of the multiple path options and not consume (i.e., process) segments from the payload 542. In this example embodiment, the walker 320 again selects the ε transition path 530a. As such, the walker 320 again stores the thread on the run stack 460 by pushing node N3 512 and the current offset, which is now 520c (i.e., "2"). As shown in processing cycle 540f, the walker 320 extracts node N2 510 and matches segment 522c (i.e., "a") at offset 520c (i.e., "2") with element "a" of node N2 510. Since "a" matches at node N2 510, the walker 320 updates the current offset to 520d (i.e., "3") and transitions to node N4 514 as specified by the node N2 510 metadata (not shown) configured by the compiler 306. For example, the node N2 510 metadata may specify a transition 511 from a given node (e.g., node N2 510) to a next node (e.g., node N4 514) via a next node address (not shown) associated with the given node N2 510. According to embodiments disclosed herein, the next node address may be configured to identify a next node and a given memory of a plurality of memories, such as the super cluster graph memory 156a, the HNA on-chip graph memory 156b, or the HNA off-chip graph memory 156c, to which the compiler 306 distributes the next node for storage.

如此，在处理周期540g中，行走器320可以提取下一个节点N4 514和偏移520d处的下一个段522d(即，“b”)。由于“b”在节点N4 514处进行匹配，行走器320可以转换至下一个节点N5 515。节点N5 515为一个与表示输入流中的正则表达式模式542的最终(即，完整或整个)匹配的指示符相关联的标记节点。因此，在处理周期540h中，行走器320可以中断沿着当前路径的行走并通过将条目存储在匹配结果缓冲区466中来报告最终匹配。然后，行走器320可以对运行堆栈460进行存储线程检查并且按照相应DUP指示符所指示的那样或者丢弃存储线程或将它们激活如此，行走器320弹出对节点N3 512和偏移520(即，“2”)进行标识的条目，并确定是否通过根据偏移520c处的段522c行走节点N3 512来激活存储线程或根据与所弹出的条目相关联的DUP指示符来丢弃存储线程。Thus, in processing cycle 540g, the walker 320 can extract the next node N4 514 and the next segment 522d (i.e., "b") at offset 520d. Since "b" matches at node N4 514, the walker 320 can transition to the next node N5 515. Node N5 515 is a marked node associated with an indicator representing the final (i.e., complete or entire) match of the regular expression pattern 542 in the input stream. Therefore, in processing cycle 540h, the walker 320 can interrupt walking along the current path and report the final match by storing an entry in the match results buffer 466. The walker 320 may then perform a store thread check on the run stack 460 and either discard the store threads or activate them as indicated by the corresponding DUP indicators. Thus, the walker 320 pops the entry identifying the node N3 512 and the offset 520 (i.e., “2”) and determines whether to activate the store thread by walking the node N3 512 according to the segment 522 c at the offset 520 c or discard the store thread according to the DUP indicator associated with the popped entry.

由于以上披露的DFA和NFA组合式处理，在此披露的实施例可以能够优化匹配性能。例如，由于NFA可以基于通过DFA处理标识的部分匹配，以上披露的实施例可以减少NFA处理中的误报的数量。进一步地，因为在此披露的实施例包括可以通过DFA处理标识的每规则(即，每模式)NFA，在此披露的实施例进一步优化了匹配性能。Due to the combined DFA and NFA processing disclosed above, the embodiments disclosed herein can optimize matching performance. For example, because the NFA can be based on partial matches identified by DFA processing, the embodiments disclosed above can reduce the number of false positives in NFA processing. Furthermore, because the embodiments disclosed herein include per-rule (i.e., per-pattern) NFAs that can be identified by DFA processing, the embodiments disclosed herein further optimize matching performance.

如以上所披露的，DFA 312为统一DFA，并且每个至少一个NFA 314为每模式NFA。HFA处理器110使有效载荷走过统一DFA 312可以被认为是标记模式的起点(中间匹配)并向可以继续从该标记行走以确定最终匹配的至少一个NFA 314提供起点的第一解析块。例如，基于将输入流的有效载荷的段处理经过统一DFA 312所确定的部分匹配结果，行走器320可以确定需要进一步处理规则集310中的给定数量的规则(即，模式)，并且当每个至少一个NFA 314为每模式NFA时，HFA处理器110可以产生可以被转换成给定数量的NFA行走的模式匹配结果。As disclosed above, DFA 312 is a unified DFA, and each of at least one NFA 314 is a per-pattern NFA. HFA processor 110 causes the payload to walk through the unified DFA 312, which can be considered as a first parsing block that marks the starting point of a pattern (intermediate match) and provides a starting point to at least one NFA 314 that can continue to walk from the mark to determine the final match. For example, based on the partial match results determined by processing a segment of the payload of the input stream through the unified DFA 312, walker 320 can determine that a given number of rules (i.e., patterns) in rule set 310 need to be further processed, and when each of at least one NFA 314 is a per-pattern NFA, HFA processor 110 can generate a pattern matching result that can be converted into a given number of NFA walks.

图6为行走器320的环境600的示例实施例的框图600。可以接收602数据包101a的输入流并且其可以包括可以是来自不同流的数据包的数据包616a-f，如一个第一流614a和一个第二流614b。例如，数据包P1 616a、P4 616d、和P6 616f可以是第一流614a中的数据包，而数据包P2 616b、P3 616c、和P5 616e可以属于第二流614b。处理内核603可以是安全装置102的可以被配置成用于执行数据包101a的高层协议处理并且可以被配置成用于将模式匹配方法分流至HFA处理器110和至少一个HNA处理器108的通用处理内核，如以上参照图1A和图1G披露的至少一个CPU内核103。FIG6 is a block diagram 600 of an example embodiment of an environment 600 for a walker 320. An input stream of packets 101a may be received 602 and may include packets 616a-f, which may be packets from different streams, such as a first stream 614a and a second stream 614b. For example, packets P1 616a, P4 616d, and P6 616f may be packets in the first stream 614a, while packets P2 616b, P3 616c, and P5 616e may belong to the second stream 614b. A processing core 603 may be a general-purpose processing core of the security device 102, such as the at least one CPU core 103 described above with reference to FIG1A and FIG1G, that may be configured to perform high-level protocol processing of the packets 101a and to offload pattern matching methods to the HFA processor 110 and at least one HNA processor 108.

可以将数据包101a转发604至HFA处理器110，并且行走器320可以通过使数据包101a的段走过统一DFA(如图3A的统一DFA312)以确定输入流中的正则表达式模式304的部分匹配。行走器320可以被配置成用于转发606部分匹配结果，这些部分匹配结果可以标识数据包101a的段的偏移和每模式NFA(如至少一个NFA 314)的节点，以便由可以基于HFA处理器110的DFA处理的部分匹配结果行走至少一个NFA 314的至少一个HNA处理器108的给定超级集群的给定集群的给定HPU使这些部分匹配前进，因为这些匹配结果可以与数据包101a中的相应数据包一起被转发608至至少一个HNA处理器108。The data packet 101a may be forwarded 604 to the HFA processor 110, and the walker 320 may determine partial matches for the regular expression pattern 304 in the input stream by walking segments of the data packet 101a through a unified DFA (such as the unified DFA 312 of FIG. 3A ). The walker 320 may be configured to forward 606 partial match results, which may identify offsets of the segments of the data packet 101a and nodes of a per-pattern NFA (such as the at least one NFA 314), so that the partial matches may be advanced by a given HPU of a given cluster of a given supercluster of at least one HNA processor 108 that may walk the at least one NFA 314 based on the partial match results processed by the DFA of the HFA processor 110, as these match results may be forwarded 608 to the at least one HNA processor 108 along with corresponding data packets in the data packet 101a.

至少一个HNA处理器108的给定超级集群的给定集群的给定HPU可以能够确定部分匹配618c、618b、和618a，形成与输入流中的正则表达式模式304中的给定正则表达式模式的最终(即，完整)匹配。例如，或者通过经由处理内核603间接地、或者直接从HFA处理器110转发605来将HFA部分匹配结果从HFA处理器110转发606至至少一个HNA处理器108，HFA处理器110部分匹配的每个数据包可以使至少一个HNA处理器108的给定超级集群的给定集群的给定HPU能够使部分匹配提前，因为行走器320可以用来自HFA处理器110的“提示”或起始信息使数据包101a的段走过至少一个NFA 314。A given HPU of a given cluster of a given supercluster of at least one HNA processor 108 may be able to determine partial matches 618 c, 618 b, and 618 a that form a final (i.e., complete) match to a given regular expression pattern in the input stream 304. For example, by forwarding 606 the HFA partial match results from the HFA processor 110 to the at least one HNA processor 108, either indirectly via the processing core 603 or directly from the HFA processor 110, each packet partially matched by the HFA processor 110 may enable a given HPU of a given cluster of a given supercluster of at least one HNA processor 108 to advance the partial match because the walker 320 may use the “hint” or starting information from the HFA processor 110 to walk segments of the packet 101 a through the at least one NFA 314.

例如，如以上参照图4A所披露的，输入堆栈458可以包括至少一个HNA指令153的至少一项HNA工作，以便由所选择的HPU 425的分配有至少一个HNA指令153的HNA处理内核408来进行处理。至少一个HNA指令153的每个至少一项HNA工作可以属于被HFA处理器110处理的同一给定有效载荷。此类可以基于HFA处理器110进行的数据包“预筛选”的“提示”或起始信息可以包括具有有效载荷段的相应偏移、用于根据如上所披露的每模式NFA行走的NFA起始节点。如此，行走器320可以确定可以从至少一个HNA处理器108被转发至处理内核603的数据包101a的最终匹配结果610，并且然后，在网络中，数据包101a可以与数据包101b一样适当被转发612。For example, as disclosed above with reference to FIG. 4A , the input stack 458 may include at least one HNA task of at least one HNA instruction 153 for processing by the HNA processing core 408 of the selected HPU 425 assigned the at least one HNA instruction 153. Each at least one HNA task of the at least one HNA instruction 153 may pertain to the same given payload being processed by the HFA processor 110. Such "hints" or starting information, which may be based on "pre-screening" of packets performed by the HFA processor 110, may include an NFA starting node with a corresponding offset of the payload segment for walking the per-pattern NFA as disclosed above. In this manner, the walker 320 may determine a final matching result 610 for the packet 101a that may be forwarded from the at least one HNA processor 108 to the processing core 603, and the packet 101a may then be appropriately forwarded 612 in the network similarly to the packet 101b.

除了可以减少NFA处理的误报数量的HFA处理器110进行的此类数据包预筛选以外，在此披露的实施例可以通过基于节点局部性将每个每模式NFA的节点分布到存储器层次中的多个存储器来进一步优化匹配性能。由于每个NFA可以是每模式NFA，在此披露的实施例可以基于以下理解来有利地将每个每模式NFA的节点分布到层次中的多个存储器：规则(即，模式)越长，则越不太可能访问(即，行走或遍历)从规则(即，模式)的末端处的部分生成的节点。通过将每个每模式NFA的早前的节点存储在相对更快(即，性能更高)的存储器内，在此披露的实施例可以进一步优化匹配性能。应理解到，因为此类节点分布可以基于存储器映射的层级，可以基于所映射的层级来有利地分布节点，从而能够实现对有待利用的匹配性能进行优化的任何合适的分布。In addition to such packet pre-screening by the HFA processor 110, which can reduce the number of false positives processed by the NFA, the embodiments disclosed herein can further optimize matching performance by distributing the nodes of each per-pattern NFA to multiple memories in a memory hierarchy based on node locality. Since each NFA can be a per-pattern NFA, the embodiments disclosed herein can advantageously distribute the nodes of each per-pattern NFA to multiple memories in a hierarchy based on the understanding that the longer the rule (i.e., pattern), the less likely it is to access (i.e., walk or traverse) nodes generated from the portion at the end of the rule (i.e., pattern). By storing the earlier nodes of each per-pattern NFA in relatively faster (i.e., higher performance) memory, the embodiments disclosed herein can further optimize matching performance. It should be understood that because such node distribution can be based on a hierarchy of memory maps, the nodes can be advantageously distributed based on the mapped hierarchy, thereby enabling any suitable distribution that optimizes the matching performance to be utilized.

如以上所披露的，至少一个NFA 314(如图5A的每模式NFA 504)可以存储在至少一个存储器内，如超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c。根据在此披露的实施例，可以基于跨可以包括多个图形存储器(如超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c)的至少一个存储器来分布每模式NFA 504的节点的智能编译器306来优化行走器320的匹配性能，这些图形存储器可以在存储器层次中。超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c可以是每个针对每个每模式NFA预加载的静态存储器，以便进行更快速的处理。基于该多个图形存储器的不同访问时间，应用性能可以到达20+Gbps搜索速率以上。As disclosed above, at least one NFA 314 (such as per-mode NFA 504 in FIG. 5A ) can be stored in at least one memory, such as supercluster graph memory 156 a, HNA on-chip graph memory 156 b, or HNA off-chip graph memory 156 c. According to embodiments disclosed herein, the matching performance of the walker 320 can be optimized based on the intelligent compiler 306 distributing the nodes of the per-mode NFA 504 across at least one memory, which can include multiple graph memories (such as supercluster graph memory 156 a, HNA on-chip graph memory 156 b, or HNA off-chip graph memory 156 c). These graph memories can be arranged in a memory hierarchy. Supercluster graph memory 156 a, HNA on-chip graph memory 156 b, or HNA off-chip graph memory 156 c can be static memory preloaded for each per-mode NFA, allowing for faster processing. Based on the different access times of the multiple graph memories, application performance can reach search rates exceeding 20+ Gbps.

可以基于将如图5A的每模式NFA 504的部分509的节点N0506、N1 508、N2 510、和N3 512的连续节点存储在性能更快的存储器中来优化行走器320的匹配性能，该性能更快的存储器相对于可以被映射到存储器层次中的较低层级的、存储连续节点N4 514和N5 515的另一个存储器被映射到更高的层级。由于NFA 504为从单个模式(如模式502)生成的每模式NFA，NFA 504与从其他模式生成的其他NFA分开，并且因此，在此披露的实施例可以基于每模式NFA的节点的对于统一NFA的节点而言不存在的所识别的局部性。The matching performance of the walker 320 can be optimized based on storing consecutive nodes, such as nodes N0 506, N1 508, N2 510, and N3 512 of portion 509 of the per-pattern NFA 504 of FIG. 5A , in a faster-performing memory that is mapped to a higher level relative to another memory that stores consecutive nodes N4 514 and N5 515, which can be mapped to a lower level in the memory hierarchy. Because NFA 504 is a per-pattern NFA generated from a single pattern (such as pattern 502), NFA 504 is separate from other NFAs generated from other patterns, and therefore, embodiments disclosed herein can be based on the identified locality of nodes of the per-pattern NFA that is not present for nodes of the unified NFA.

在此披露的实施例可以基于以下理解：每模式NFA图形(如每模式NFA图形504)的早前的节点(如节点N0 506、N1 508、N2 510和N3 512)可以比节点N4 514和N5 515具有被遍历的更高可能性，因为节点N4 514和N5 515被定位成朝向规则(即，模式)502的末端，并且因此，需要匹配有效载荷的更多部分以便被行走(即，遍历)。如此，每模式NFA(如NFA504)或任何其他合适的每模式NFA图形的早前节点可以被认为是由于误报而在比“低触摸(low touch)”节点更高频率基础上访问的“高触摸(high touch)”节点，仅在出现模式的完整匹配的情况下，才更有可能访问“低触摸”节点。Embodiments disclosed herein can be based on the understanding that earlier nodes (e.g., nodes N0 506, N1 508, N2 510, and N3 512) of a per-pattern NFA graph (e.g., per-pattern NFA graph 504) can have a higher probability of being traversed than nodes N4 514 and N5 515 because nodes N4 514 and N5 515 are located towards the end of rule (i.e., pattern) 502 and, therefore, require matching more portions of the payload in order to be walked (i.e., traversed). As such, earlier nodes of a per-pattern NFA (e.g., NFA 504) or any other suitable per-pattern NFA graph can be considered "high touch" nodes that are visited more frequently due to false positives than "low touch" nodes, which are more likely to be visited only in the event of a complete match of the pattern.

根据在此披露的实施例，基于对每个每模式NFA中的哪些节点被认为是“高触摸”节点和哪些节点被认为是“低触摸”节点的理解，编译器306可以将每个每模式NFA中的节点分布到层次中的多个存储器。此类理解可以用于通过将每个每模式NFA的节点分布到存储器层次中的多个存储器来“预高速缓存”(即，静态地存储)这些节点，从而能够提高匹配性能。例如，可以基于以下理解将“高触摸”节点分布到更快的存储器：由于其在每模式NFA内的局部性，可以更频繁地访问(即，行走或遍历)这些“高触摸”节点。According to embodiments disclosed herein, compiler 306 can distribute the nodes in each per-pattern NFA to multiple memories in a hierarchy based on an understanding of which nodes in each per-pattern NFA are considered "high touch" nodes and which nodes are considered "low touch" nodes. Such understanding can be used to "pre-cache" (i.e., statically store) the nodes of each per-pattern NFA by distributing them to multiple memories in a memory hierarchy, thereby improving matching performance. For example, "high touch" nodes can be distributed to faster memories based on the understanding that these "high touch" nodes can be accessed (i.e., walked or traversed) more frequently due to their locality within the per-pattern NFA.

通常，统一NFA的基于正则表达式模式集合生成的正则表达式访问模式可以是随机的，因为此类模式可以基于具体有效载荷。因此，正则表达式访问模式的历史不能用于预测进一步的正则表达式访问模式。例如，高速缓存统一NFA的最近遍历的节点可能不会向行走器提供性能益处，因为统一NFA内的所访问的下一个节点可能不是所高速缓存的节点。Typically, the regular expression access patterns generated by a unified NFA based on a set of regular expression patterns can be random because such patterns can be based on a specific payload. Therefore, the history of regular expression access patterns cannot be used to predict future regular expression access patterns. For example, caching the most recently traversed nodes of a unified NFA may not provide a performance benefit to the walker because the next node visited within the unified NFA may not be the cached node.

图7A为编译器306的环境700的实施例的框图。如以上所披露的，编译器306在此可以被称为可以被配置成用于通过对来自规则集310的可能最适用于DFA或NFA处理的部分进行标识来将规则集310编译成二值图像112的智能编译器。因此，二值图像112可以包括至少两个部分，其中，一个用于DFA处理的第一部分和一个用于NFA处理的第二部分，如以上参照图3A所披露的统一DFA 312和至少一个NFA314。根据在此披露的实施例，至少一个HNA处理器108可以操作性地耦合至多个存储器，这些存储器可以包括多个图形存储器，像如上所披露的超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c。根据在此披露的实施例，编译器306可以被配置成用于确定图形存储器中的统一DFA 312和至少一个NFA 314的节点的位置，如超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c。FIG7A is a block diagram of an embodiment of an environment 700 for a compiler 306. As disclosed above, the compiler 306 may be referred to herein as an intelligent compiler that may be configured to compile a rule set 310 into a binary image 112 by identifying portions of the rule set 310 that may be most suitable for DFA or NFA processing. Thus, the binary image 112 may include at least two portions, a first portion for DFA processing and a second portion for NFA processing, such as the unified DFA 312 and at least one NFA 314 disclosed above with reference to FIG3A. According to embodiments disclosed herein, at least one HNA processor 108 may be operatively coupled to a plurality of memories, which may include a plurality of graphics memories, such as the supercluster graphics memory 156 a, the HNA on-chip graphics memory 156 b, or the HNA off-chip graphics memory 156 c disclosed above. According to embodiments disclosed herein, compiler 306 may be configured to determine locations of nodes of unified DFA 312 and at least one NFA 314 in a graph memory, such as supercluster graph memory 156a, HNA on-chip graph memory 156b, or HNA off-chip graph memory 156c.

根据在此披露的实施例，统一DFA 312可以静态地存储在DFA图形存储器中的给定存储器中，而至少一个NFA 314可能已经分布了节点并且将其跨这些图形存储器静态地存储，如超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c，因为编译器306可以确定具体NFA的分布目标以便存储在具体存储器内，以便优化行走器匹配性能。根据在此披露的实施例，如超级集群图形存储器156a、HNA片上图形存储器156b、或HNA芯片外图形存储器156c的图形存储器可以在存储器层次743中，该存储器层次可以包括多个层级708a-c。该多个层级708a-c可以被映射到可以包括存储器756a-c的该多个图形存储器，这些图形存储器可以分别是超级集群图形存储器156a、HNA片上图形存储器156b、和HNA芯片外图形存储器156c。According to embodiments disclosed herein, unified DFA 312 may be statically stored in a given memory in a DFA graph memory, while at least one NFA 314 may have distributed nodes and statically stored them across these graph memories, such as supercluster graph memory 156a, HNA on-chip graph memory 156b, or HNA off-chip graph memory 156c, because compiler 306 can determine the distribution target of a specific NFA to store in a specific memory to optimize walker matching performance. According to embodiments disclosed herein, graph memory, such as supercluster graph memory 156a, HNA on-chip graph memory 156b, or HNA off-chip graph memory 156c, may be in a memory hierarchy 743, which may include multiple levels 708a-c. The multiple levels 708a-c may be mapped to the multiple graph memories, which may include memories 756a-c, which may be supercluster graph memory 156a, HNA on-chip graph memory 156b, and HNA off-chip graph memory 156c, respectively.

编译器306可以用任何合适的方式映射层级708a-c，并且可以按递减次序712对这些层级708a-c进行排名，从而使得层级708a可以是排名最高的层级708a，并且层级708c可以是排名最低的层级。图形存储器756a-c可以包括可以是性能最高的存储器的随机存取存储器(RAM)，该存储器可以与片上搜索存储器(OSM)一起共同位于网络服务处理器100上。图形存储器756a-c可以包括可以被包括在至少一个系统存储器151内的HNA芯片外图形存储器156c，该系统存储器可以是外部的并且操作性地耦合至网络服务处理器100。Compiler 306 can map levels 708a-c in any suitable manner and can rank levels 708a-c in descending order 712 such that level 708a can be the highest ranked level 708a and level 708c can be the lowest ranked level. Graphics memory 756a-c can include random access memory (RAM), which can be the highest performance memory, and can be co-located with on-chip search memory (OSM) on the network services processor 100. Graphics memory 756a-c can include HNA off-chip graphics memory 156c, which can be included in at least one system memory 151, which can be external and operatively coupled to the network services processor 100.

基于根据存储器的性能(即，读取和写入访问时间)的映射，RAM存储器可以被映射到排名最高的层级708a，OSM可以被映射到排名第二高的层级708b，并且该系统存储器可以被映射到排名最低的层级708c。然而，应理解到，可以用任何合适的方式进行多个层级708a-c与图形存储器756a-c之间的映射。例如，该映射可以基于对与规则集310相关联的应用的了解，可以从该规则集生成被分布到存储器756a-c的节点，从而性能最高的存储器不能被映射到排名最高的层级。进一步地，应理解到，存储器层次743中的层级的数量和所示图形存储器756a-c的数量出于说明性目的并且可以是任何合适的层级和存储器数量。Based on the mapping based on the performance of the memory (i.e., read and write access time), RAM memory can be mapped to the highest-ranked tier 708a, OSM can be mapped to the second-highest-ranked tier 708b, and system memory can be mapped to the lowest-ranked tier 708c. However, it should be understood that the mapping between the multiple tiers 708a-c and the graphics memory 756a-c can be performed in any suitable manner. For example, the mapping can be based on an understanding of the application associated with the rule set 310, from which the nodes distributed to the memories 756a-c can be generated, so that the highest-performance memory is not mapped to the highest-ranked tier. Further, it should be understood that the number of tiers in the memory hierarchy 743 and the number of graphics memories 756a-c shown are for illustrative purposes and can be any suitable number of tiers and memories.

如以上所披露的，通过将从给定模式的早前的部分生成的NFA节点存储在更快的存储器内，智能编译器306可以利用每模式NFA的节点的局部性。进一步地，由于自HFA处理器110的DFA处理确定给定模式的部分匹配以来，给定模式的匹配的概率已经较高，此类实施例组合以优化匹配性能。As disclosed above, by storing NFA nodes generated from earlier portions of a given pattern in faster memory, the intelligent compiler 306 can exploit the locality of nodes in the per-pattern NFA. Further, since the probability of a match for a given pattern is already higher since the DFA processing of the HFA processor 110 determined a partial match for the given pattern, such embodiments combine to optimize matching performance.

例如，如以上所披露的，DFA处理可以用于减少NFA处理发现的误报的数量。由于每个NFA可以是每模式NFA，可以基于多个存储器到存储器层次743的层级的映射跨该多个存储器有利地分布每个每模式NFA的节点。例如，从长度相对较短的模式生成的较小的NFA可以将所有节点分布到一个第一级并存储在被映射到该第一级的第一存储器，而从相对较长的模式生成的较大的NFA可以将节点的第一部分分布到该第一级并且将剩余部分分布在剩余级之间。该第一级可以被映射到性能最高的存储器的排名最高的级。For example, as disclosed above, DFA processing can be used to reduce the number of false positives found by NFA processing. Since each NFA can be a per-pattern NFA, the nodes of each per-pattern NFA can be advantageously distributed across the multiple memories based on the mapping of multiple memories to levels of the memory hierarchy 743. For example, a smaller NFA generated from a pattern of relatively short length can distribute all nodes to a first level and store them in a first memory mapped to the first level, while a larger NFA generated from a pattern of relatively long length can distribute the first portion of the nodes to the first level and the remaining portion among the remaining levels. The first level can be mapped to the highest-ranked level of the highest-performance memory.

如此，每模式NFA的早前的节点可以存储在性能最高的存储器内。由于早前的节点因为误报而可能具有被遍历的更高可能性，在此披露的实施例可以通过对被映射到存储器层次743中的较高级的存储器的访问能够处理大多数误报。根据在此披露的实施例，可以通过使对被映射到排名最高的层级(如存储器层次743中的层级708a)的存储器756a的访问数量比对可以被映射到排名最低的层级708c的存储器756c的方式数量相对较高来优化匹配性能。Thus, the earlier nodes of the per-pattern NFA can be stored in the highest-performing memory. Since the earlier nodes may have a higher probability of being traversed due to false positives, the embodiments disclosed herein can handle most false positives by accessing memory mapped to higher levels in the memory hierarchy 743. According to the embodiments disclosed herein, matching performance can be optimized by making the number of accesses to memory 756a mapped to the highest-ranked level (e.g., level 708a in the memory hierarchy 743) relatively high compared to the number of accesses to memory 756c mapped to the lowest-ranked level 708c.

存储器756a可以是能够支持例如13亿项事务每秒的性能最高的存储器，而存储器756b可以具有能够支持1.5亿项事务每秒的较低性能，并且存储器756c可以具有能够支持1200万项事务每秒的性能最低的存储器。进一步地，根据在此披露的实施例，此类被映射到排名较高的层级的性能较高的存储器的存储量大小可以比被映射到排名最低的层级708c的性能较低的存储器(如存储器756c)相对较小，该存储器相比可以是相对大的存储器。例如，存储器756c可以是包括在至少一个系统存储器151内的HNA芯片外图形存储器156c，该至少一个系统存储器可以是外部的并且提供受到物理附接的存储器的量限制的相对大量的存储容量。Memory 756a may be the highest performance memory capable of supporting, for example, 1.3 billion transactions per second, while memory 756b may have a lower performance capable of supporting 150 million transactions per second, and memory 756c may have the lowest performance memory capable of supporting 12 million transactions per second. Further, according to embodiments disclosed herein, such higher performance memory mapped to a higher-ranked tier may have a relatively smaller storage size than lower performance memory (such as memory 756c) mapped to the lowest-ranked tier 708c, which may be a relatively large memory. For example, memory 756c may be HNA off-chip graphics memory 156c included within at least one system memory 151, which may be external and provide a relatively large amount of storage capacity limited by the amount of physically attached memory.

根据在此披露的实施例，每模式NFA存储分配设置710a-c可以被配置成用于层级708a-c。每模式NFA存储分配设置710a-c可以表示用于从每个每模式NFA分布到层级708a-c的对应层级的目标数量的唯一节点以便存储在被映射到对应层级的给定存储器内。在针对规则集310中的一个或多个模式中的每个模式生成每模式NFA的情况下，编译器306可以被配置成用于以一种使被映射到层级708a-c的存储器756a-c能够提供足够的存储容量的方式来确定每模式NFA存储分配设置710a-c。According to embodiments disclosed herein, per-pattern NFA storage allocation settings 710a-c can be configured for levels 708a-c. Per-pattern NFA storage allocation settings 710a-c can represent a target number of unique nodes to be distributed from each per-pattern NFA to the corresponding level of the levels 708a-c for storage within a given memory mapped to the corresponding level. When generating a per-pattern NFA for each of one or more patterns in the rule set 310, the compiler 306 can be configured to determine the per-pattern NFA storage allocation settings 710a-c in a manner that enables the memory 756a-c mapped to the levels 708a-c to provide sufficient storage capacity.

每模式NFA存储分配设置710a-c可以表示每个每模式NFA的节点集合中的用于分布到对应层级的目标数量的唯一节点，以便存储到被映射到该对应层级的给定存储器。例如，基于被配置成用于层级708a的每模式NFA存储分配设置710a，编译器306可以分布每模式NFA 714a的对应节点集合702a的第一部分704a和每模式NFA 714b的对应节点集合702b的第二部704b分，以便存储在被映射到层级708a的存储器756a内。The per-pattern NFA storage allocation settings 710a-c can represent a target number of unique nodes in the node set of each per-pattern NFA to be distributed to the corresponding level for storage in a given memory mapped to the corresponding level. For example, based on the per-pattern NFA storage allocation setting 710a configured for level 708a, the compiler 306 can distribute a first portion 704a of the corresponding node set 702a of the per-pattern NFA 714a and a second portion 704b of the corresponding node set 702b of the per-pattern NFA 714b for storage in the memory 756a mapped to the level 708a.

基于被配置成用于层级708b的每模式NFA存储分配设置710b，编译器306可以分布每模式NFA 714a的对应点集合702a的第三部分706a和每模式NFA 714b的对应节点集合702b的第四部706b分，以便存储在被映射到层级708b的存储器756b内。这种分布为目标分布，因为给定对应的节点集合中的节点的数量可以不包括如少于可能已经生成的目标数量或少于可以保持在对应集合内用于分布的目标数量的目标数量。Based on the per-pattern NFA storage allocation setting 710b configured for level 708b, compiler 306 may distribute third portion 706a of corresponding point set 702a of per-pattern NFA 714a and fourth portion 706b of corresponding node set 702b of per-pattern NFA 714b for storage within memory 756b mapped to level 708b. This distribution is a target distribution because the number of nodes in a given corresponding node set may not include a target number, such as less than a target number that may have been generated or less than a target number that may be maintained within the corresponding set for distribution.

在该示例实施例中，每模式NFA存储分配设置710c可以被配置成用于存储器层次743的排名最低的级708c并且可以用表示无限数量的方式来指定。在该示例实施例中，被映射到排名最低的层级708c的存储器756c可以是包括在至少一个系统存储器151内的HNA芯片外图形存储器156c，该至少一个系统存储器具有相对大的存储量。如此，编译器306可以将节点分布到该系统存储器，包括对针对每模式NFA 714a-b中的每个生成的每个对应点集合中的任何剩余未分布的节点进行分布，以便存储在系统存储器756c内。In this example embodiment, the per-mode NFA storage allocation setting 710c can be configured for the lowest-ranked level 708c of the memory hierarchy 743 and can be specified in a manner that represents an infinite number. In this example embodiment, the memory 756c mapped to the lowest-ranked level 708c can be an HNA off-chip graph memory 156c included in at least one system memory 151, which has a relatively large amount of memory. In this way, the compiler 306 can distribute nodes to the system memory, including distributing any remaining undistributed nodes in each corresponding point set generated for each of the per-mode NFAs 714a-b for storage within the system memory 756c.

应理解到，编译器可以内在地了解层级到存储器映射，并且如此，可以排除特定层级708a-c。例如，编译器306可以对每模式NFA存储分配设置710a-c进行配置并且基于对存储器层次743中的存储器756a-c中的每个存储器的层级映射的内在了解来将这些设置直接映射到存储器756a-c。还应理解到，图7A中所示的每模式NFA、每模式NFA的节点、和分布数量是出于说明性目的并且可以是任何合适的每模式NFA、节点、或分布数量。It should be understood that the compiler can inherently understand the hierarchy-to-memory mapping and, as such, can exclude certain hierarchies 708a-c. For example, compiler 306 can configure per-pattern NFA storage allocation settings 710a-c and map these settings directly to memories 756a-c based on inherent knowledge of the hierarchy mapping for each of memories 756a-c in memory hierarchy 743. It should also be understood that the per-pattern NFAs, nodes per-pattern NFAs, and number of distributions shown in FIG7A are for illustrative purposes and can be any suitable per-pattern NFAs, nodes, or number of distributions.

图7B为HNA处理内核408的示例实施例的框图721，该处理内核操作性地耦合至多个存储器756a-c，这些存储器可以被映射到图7A的存储器层次743中的层级708a-c、和图4A的节点高速缓存451。相对于存储器756b和756c，存储器756a可以是性能最快的存储器。存储器756a可以被映射到存储器层次743中的排名最高的层级708a。相对于也操作性地耦合至HNA处理内核408的其他存储器708a和708b，存储器756c可以是性能最低的存储器。FIG7B is a block diagram 721 of an example embodiment of an HNA processing core 408 operatively coupled to a plurality of memories 756 a-c, which can be mapped to tiers 708 a-c in the memory hierarchy 743 of FIG7A and the node cache 451 of FIG4A . Memory 756 a can be the fastest performing memory relative to memories 756 b and 756 c. Memory 756 a can be mapped to the highest-ranked tier 708 a in the memory hierarchy 743. Memory 756 c can be the lowest performing memory relative to the other memories 708 a and 708 b also operatively coupled to the HNA processing core 408.

排名最高的存储器756a可以是与HNA处理内核408一起共同位于722芯片上的第一存储器。存储器756b可以是一个排名第二高的存储器，该存储器为与HNA处理内核408一起共同位于722芯片上的第二存储器。相对于操作性地耦合至HNA处理内核408的其他存储器756b和756c，排名最高的存储器756a可以是性能最高的存储器。性能最高的存储器756a可以具有最快的读取和写入访问时间。存储器756c可以是性能最低的存储器，其可以是最大的存储器，如不与HNA处理内核408一起位于芯片上的外部存储器。The highest-ranked memory 756a may be the first memory co-located on chip 722 with the HNA processing core 408. Memory 756b may be the second-highest-ranked memory, which is the second memory co-located on chip 722 with the HNA processing core 408. Relative to the other memories 756b and 756c operatively coupled to the HNA processing core 408, the highest-ranked memory 756a may be the highest-performance memory. The highest-performance memory 756a may have the fastest read and write access times. Memory 756c may be the lowest-performance memory, which may be the largest memory, such as external memory not co-located on chip with the HNA processing core 408.

对应的层次节点事务大小723a-c可以与层级708a-c中的每个相关联。每个对应的层次节点事务大小可以表示从映射到对应层级的给定存储器提取的用于对给定存储器的读取访问的最大节点数量。例如，层次节点事务大小723a可以与最高层级708a相关联。由于存储器756a处于最高层级708a，层次节点事务大小723a可以表示从存储器756a提取的最大节点数量。类似地，由于存储器756b处于第二高层级708b，层次节点事务大小723b可以表示从存储器756b提取的最大节点数量，并且由于存储器756c处于下一个最低层级708c，层次节点事务大小723c可以表示从存储器756c提取的最大节点数量。Corresponding level node transaction sizes 723a-c can be associated with each of the levels 708a-c. Each corresponding level node transaction size can represent the maximum number of nodes that can be fetched from a given memory mapped to the corresponding level for read accesses to the given memory. For example, level node transaction size 723a can be associated with the highest level 708a. Since memory 756a is at the highest level 708a, level node transaction size 723a can represent the maximum number of nodes that can be fetched from memory 756a. Similarly, since memory 756b is at the second highest level 708b, level node transaction size 723b can represent the maximum number of nodes that can be fetched from memory 756b, and since memory 756c is at the next lowest level 708c, level node transaction size 723c can represent the maximum number of nodes that can be fetched from memory 756c.

图8为多个每模式NFA的节点分布的示例实施例的框图800。在该示例实施例中，针对一个或多个模式804中的模式816a生成一个第一NFA 814a，针对一个或多个模式804中的第二模式816b生成一个第二NFA 814b，针对一个或多个模式804中的第三模式816c生成一个第三NFA 814c。8 is a block diagram 800 of an example embodiment of a node distribution of multiple per-pattern NFAs. In this example embodiment, a first NFA 814a is generated for a pattern 816a in one or more patterns 804, a second NFA 814b is generated for a second pattern 816b in one or more patterns 804, and a third NFA 814c is generated for a third pattern 816c in one or more patterns 804.

第一每模式NFA 814a的节点804a的第一部分被分布到被映射到存储器层次812中的第一存储器856a的层级808a，并且节点806a的第二部分被分布到被映射到第二存储器856b的第二层级808b。在该示例实施例中，层级808a为排名最高的级，并且层级808b为排名最低的层级。第二每模式NFA 814b的节点804b的第三部分被分布到映射到存储器层次812中的第一存储器856a的层级808a，并且节点806b的第四部分被分布到映射到第二存储器856b的第二层级808b。第三每模式NFA 814c的节点804c的第五部分被分布到映射到存储器层次812中的第一存储器856a的层级808a，并且节点806c的第六部分被分布到映射到第二存储器856b的第二层级808b。A first portion of node 804a of first per-pattern NFA 814a is distributed to level 808a, which is mapped to first memory 856a in memory hierarchy 812, and a second portion of node 806a is distributed to second level 808b, which is mapped to second memory 856b. In this example embodiment, level 808a is the highest-ranked level, and level 808b is the lowest-ranked level. A third portion of node 804b of second per-pattern NFA 814b is distributed to level 808a, which is mapped to first memory 856a in memory hierarchy 812, and a fourth portion of node 806b is distributed to second level 808b, which is mapped to second memory 856b. A fifth portion of node 804c of third per-pattern NFA 814c is distributed to level 808a mapped to first memory 856a in memory hierarchy 812, and a sixth portion of node 806c is distributed to second level 808b mapped to second memory 856b.

如图8中所示，第二NFA 814b的被分布用于存储在被映射到层级808a的存储器856a内的第二节点部分804b可以分别小于第一NFA 814a和第一节点部分804a和第三NFA814c的第五节点部分804c。例如，如果每模式NFA 814b的节点的数量小于层级808a的每NFA存储分配设置(未示出)所表示的唯一目标节点的数量，则情况可以是这样。进一步地，由于层级808b为存储器层次812中的排名最低的层级，层级808b的下一个每模式NFA存储分配设置(未示出)可以非常大，从而在已经分布到比层级808b更高的每个层级后，能够分布所有未分布的节点以便存储在被映射到层级808b的存储器856a内。如此，在该示例实施例中，第二节点部分806a可以包括比第六部分806c更多的节点，因为模式816a可以是比模式816c更长的规则。进一步地，第四节点部分806b可以是空的，因为模式816b可以相对短，具有很少的针对每模式NFA 814b生成的节点，从而引起每模式NFA 814b的所有节点被分布到层级808a以便存储在存储器856a内。As shown in FIG8 , the second node portion 804b of the second NFA 814b that is distributed for storage within the memory 856a mapped to the level 808a can be smaller than the first and first node portions 804a and the fifth node portion 804c of the third NFA 814c, respectively. This can be the case, for example, if the number of nodes per-pattern NFA 814b is smaller than the number of unique target nodes indicated by the per-NFA storage allocation setting (not shown) of the level 808a. Furthermore, since level 808b is the lowest-ranked level in the memory hierarchy 812, the next per-pattern NFA storage allocation setting (not shown) of level 808b can be significantly larger, thereby enabling all undistributed nodes to be distributed for storage within the memory 856a mapped to the level 808b after being distributed to each level higher than level 808b. Thus, in this example embodiment, second node portion 806a may include more nodes than sixth portion 806c because pattern 816a may be a longer rule than pattern 816c. Further, fourth node portion 806b may be empty because pattern 816b may be relatively short, having few nodes generated for per-pattern NFA 814b, thereby causing all nodes of per-pattern NFA 814b to be distributed to level 808a for storage within memory 856a.

编译器306可以分布每个每模式NFA的节点作为生成每个每模式NFA的一部分。如以上所披露的，可以通过经由下一个节点地址对第二节点进行标识的第一节点元数据来指定NFA中从一个第一节点到一个第二节点的转换。根据在此披露的实施例，编译器306可以将下一个节点地址配置成包括一个对多个存储器中的已经分布该第二节点以便存储到其中的给定存储器进行指示的部分。Compiler 306 can distribute the nodes of each per-pattern NFA as part of generating each per-pattern NFA. As disclosed above, a transition from a first node to a second node in an NFA can be specified by first-node metadata that identifies the second node via a next-node address. According to embodiments disclosed herein, compiler 306 can configure the next-node address to include a portion indicating a given memory in a plurality of memories to which the second node has been distributed for storage.

图9为可以在至少一个处理器中实施的方法900的示例实施例的流程图，该处理器操作性地耦合至被映射到安全装置内的存储器层次中的多个层级的存储器，该安全装操作性地耦合至网络。该方法可以开始(902)并生成至少一个非确定有限自动机(NFA)(904)。可以针对单个正则表达式模式生成每个每模式NFA并且其可以包括对应的节点集合。在该示例实施例中，该方法可以分布每个每模式NFA的对应节点集合中的节点以便存储在基于所映射的层级和被配置用于这些层级的每模式NFA存储分配设置的多个存储器内(908)，并且之后，该方法结束(908)。FIG9 is a flow chart of an example embodiment of a method 900 that can be implemented in at least one processor that is operatively coupled to a plurality of levels of memory mapped to a memory hierarchy within a security device that is operatively coupled to a network. The method can begin (902) and generate at least one non-deterministic finite automaton (NFA) (904). Each per-pattern NFA can be generated for a single regular expression pattern and can include a corresponding set of nodes. In this example embodiment, the method can distribute the nodes in the corresponding set of nodes of each per-pattern NFA for storage within a plurality of memories based on the mapped levels and the per-pattern NFA storage allocation configured for these levels (908), and thereafter, the method ends (908).

图10为多个每模式NFA的节点的另一节点分布的示例实施例的框图1000。在该示例实施例中，示出了针对存储在一个第一存储器1056a一个第二存储器1056b内的节点分布1004和1006。每个每模式NFA 1014a-c的分布1004可以基于分别被配置成用于层级1008a和1008b的每模式NFA存储分配设置1010a和1010b。在该示例实施例中，层级1008a和1008b被分别映射到第一存储器1056a和第二存储器1056b。FIG10 is a block diagram 1000 of another example embodiment of a node distribution for multiple per-pattern NFAs. In this example embodiment, node distributions 1004 and 1006 are shown for nodes stored in a first memory 1056a and a second memory 1056b. The distribution 1004 for each per-pattern NFA 1014a-c can be based on per-pattern NFA storage allocations 1010a and 1010b configured for levels 1008a and 1008b, respectively. In this example embodiment, levels 1008a and 1008b are mapped to first memory 1056a and second memory 1056b, respectively.

图11为一种用于对至少一个每模式NFA的节点进行分布的方法的示例实施例的流程图1100。根据在此披露的实施例，对所生成的每个每模式NFA的对应节点集合中的节点进行分布可以包括以一种包括对应节点集合中的节点的第一分布以便存储在多个存储器中的第一存储器内的连续方式对应节点集合中的节点进行分布。该第一存储器可以被映射到层级中的排名最高的层级。基于之前的分布之后该对应节点集合中剩余的至少一个未分布的节点，分布可以包括该对应节点集合中的节点的至少一次第二分布。每次至少一次第二分布可以是为了存储在多个存储器中的给定存储器内。该给定存储器可以被映射到层级中的给定层级，每次分布比排名最高的层级连续更低。Figure 11 is a flowchart 1100 of an example embodiment of a method for distributing nodes of at least one per-mode NFA. According to embodiments disclosed herein, distributing the nodes in the corresponding node set of each generated per-mode NFA may include distributing the nodes in the corresponding node set in a sequential manner including a first distribution of the nodes in the corresponding node set for storage in a first memory of a plurality of memories. The first memory may be mapped to a highest-ranked level in a hierarchy. Based on at least one undistributed node remaining in the corresponding node set after a previous distribution, the distribution may include at least one second distribution of the nodes in the corresponding node set. Each at least one second distribution may be for storage in a given memory of the plurality of memories. The given memory may be mapped to a given level in the hierarchy, each distribution being successively lower than the highest-ranked level.

该连续方式可以包括对来自该至少一个每模式NFA中的给定每模式NFA的多个节点中的节点进行分布，这些节点表示生成该给定每模式NFA所针对的给定正则表达式模式的给定数量的连续元素。进一步地，根据在此披露的实施例，每次至少一次第二分布包括通过与至少一个之前的节点相关联的元数据中所包括的下一个节点地址所标识的至少一个下一个节点，在紧接着之前第二分布中分布该至少一个之前的节点。The sequential manner may include distributing nodes from a plurality of nodes of a given per-pattern NFA in the at least one per-pattern NFA, the nodes representing a given number of consecutive elements of a given regular expression pattern for which the given per-pattern NFA was generated. Further, according to embodiments disclosed herein, each at least one second distribution includes distributing at least one next node identified by a next node address included in metadata associated with at least one previous node in an immediately preceding second distribution.

该方法可以开始(1102)并且将给定层级设置成存储器层次中排名最高的层级(1104)。该方法可以将给定每模式NFA设置成从一个或多个正则表达式模式的集合中生成的至少一个NFA中的第一每模式NFA(1106)。该方法可以检查该给定每模式NFA的未分布的节点的数量(1108)。如果给定每模式NFA的未分布的节点的数量为零，则该方法可以检查该给定每模式NFA是否是从一个或多个正则表达式模式的集合生成的最后一个NFA(1116)。The method may start (1102) and set a given level to the highest-ranked level in a memory hierarchy (1104). The method may set a given per-pattern NFA to be the first per-pattern NFA in at least one NFA generated from a set of one or more regular expression patterns (1106). The method may check the number of undistributed nodes of the given per-pattern NFA (1108). If the number of undistributed nodes of the given per-pattern NFA is zero, the method may check whether the given per-pattern NFA is the last NFA generated from the set of one or more regular expression patterns (1116).

在该示例实施例中，如果该给定每模式NFA为所生成的最后一个每模式NFA，则该方法可以检查给定层级是否是排名最低的层级(1120)，并且如果该给定层级为排名最低的层级，则之后，该方法结束(1126)。然而，如果对该给定层级是否为排名最低的层级的检查(1120)为否，则该方法可以将该给定层级设置成下一个连续更低的层级(1124)并且再次将该给每模式NFA设置成从一个或多个正则表达式模式的集合生成的至少一个NFA的第一每模式NFA(1106)并继续对该给定每模式NFA的未分布的节点的数量进行检查(1108)。如果该给定每模式NFA的未分布的节点的数量为零，则该方法可以如以上所披露的继续进行。In this example embodiment, if the given per-pattern NFA is the last per-pattern NFA generated, the method may check whether the given level is the lowest-ranked level (1120), and if the given level is the lowest-ranked level, then the method ends (1126). However, if the check (1120) as to whether the given level is the lowest-ranked level is false, the method may set the given level to the next successively lower level (1124) and again set the given per-pattern NFA to the first per-pattern NFA of at least one NFA generated from the set of one or more regular expression patterns (1106) and continue checking the number of undistributed nodes of the given per-pattern NFA (1108). If the number of undistributed nodes of the given per-pattern NFA is zero, then the method may continue as disclosed above.

如果对该给定每模式NFA的未分布的节点的数量的检查(1108)为非零，则该方法可以检查该给定层级是否为排名最低的层级(1110)。如果是，则该方法可以将该数量的未分布的节点分布到被映射到给定层级的给定存储器(1114)，并且该方法可以检查该给定每模式NFA是否是从一个或多个正则表达式模式的集合生成的最后一个NFA(1116)。如果是，则该方法可以如以上所披露的继续进行。如果不是，则该方法可以将该给定每模式NFA设置成所生成的下一个每模式NFA(1118)，并且该方法可以迭代再次检查该给定每模式NFA的未分布的节点的数量(1108)，该给定每模式NFA被更新成为所生成的下一个每模式NFA。If the check (1108) of the number of undistributed nodes for the given per-pattern NFA is non-zero, the method may check whether the given level is the lowest-ranked level (1110). If so, the method may distribute the number of undistributed nodes to a given memory mapped to the given level (1114), and the method may check whether the given per-pattern NFA is the last NFA generated from the set of one or more regular expression patterns (1116). If so, the method may continue as disclosed above. If not, the method may set the given per-pattern NFA to the next per-pattern NFA generated (1118), and the method may iteratively check the number of undistributed nodes for the given per-pattern NFA again (1108), with the given per-pattern NFA being updated to become the next per-pattern NFA generated.

如果对该给定层级是否为排名最低的层级的检查(1110)为否，则该方法可以检查该给定每模式NFA的未分布的节点的数量是否超过被配置成用于该给定层级的每模式NFA存储分配设置所表示的节点数量(1112)。如果超过，则该方法可以对被配置成用于该给定层级的每模式NFA存储分配设置所表示的该数量的未分布的节点进行分布以便存储在被映射到给定层级的给定存储器(1122)，并且检查该给定每模式NFA是否是从一个或多个正则表达式模式的集合生成的最后一个NFA(1116)。如果是，则该方法可以如以上所披露的继续进行。If the check (1110) of whether the given level is the lowest-ranked level is false, the method may check whether the number of undistributed nodes of the given per-pattern NFA exceeds the number of nodes indicated by the per-pattern NFA storage allocation setting configured for the given level (1112). If so, the method may distribute the number of undistributed nodes indicated by the per-pattern NFA storage allocation setting configured for the given level for storage in a given memory mapped to the given level (1122), and check whether the given per-pattern NFA is the last NFA generated from the set of one or more regular expression patterns (1116). If so, the method may continue as disclosed above.

如果对该给定每模式NFA是否为所生成的最后一个每模式NFA的检查(1116)为否，则该方法可以将该给定每模式NFA设置成所生成的下一个每模式NFA(1118)，并且该方法可以迭代再次检查该给定每模式NFA的未分布的节点的数量(1108)，该给定每模式NFA被更新成为所生成的下一个每模式NFA。If the check (1116) as to whether the given per-pattern NFA is the last per-pattern NFA generated is false, the method may set the given per-pattern NFA to be the next per-pattern NFA generated (1118), and the method may iteratively check the number of undistributed nodes of the given per-pattern NFA again (1108), the given per-pattern NFA being updated to be the next per-pattern NFA generated.

然而，如果对该给定每模式NFA的未分布的节点的数量是否超过被配置成用于该给定层级的每模式NFA存储分配设置所表示的节点数量的检查(1112)为否，则该方法可以将该数量的未分布的节点分布到被映射到给定层级的给定存储器(1114)并且如以上披露的继续进行。However, if the check (1112) as to whether the number of undistributed nodes for the given per-mode NFA exceeds the number of nodes represented by the per-mode NFA storage allocation setting configured for the given level is false, the method may distribute that number of undistributed nodes to the given memory mapped to the given level (1114) and proceed as disclosed above.

根据在此披露的实施例，每模式NFA存储分配设置可以通过绝对值表示唯一节点的目标数量。该绝对值可以是每个对应节点集合的共同值，该共同值使每个对应节点集合能够具有用于存储在被映射到对应层级的给定存储器内的唯一节点的目标数量的相同值。例如，如图10中所示，每模式NFA1014a-c中的每个具有所选择的表示来自每模式NFA1014a-c中的每个的有待分布到被映射到层级1008a的存储器1056a的相同数量的节点的第一部分1004，每模式存储分配设置1010a被配置成用于该层级。According to embodiments disclosed herein, a per-mode NFA storage allocation setting can represent a target number of unique nodes by an absolute value. This absolute value can be a common value for each corresponding set of nodes, which enables each corresponding set of nodes to have the same value for the target number of unique nodes to be stored within a given memory mapped to a corresponding level. For example, as shown in FIG10 , each of the per-mode NFAs 1014 a-c has a first portion 1004 selected to represent the same number of nodes from each of the per-mode NFAs 1014 a-c to be distributed to the memory 1056 a mapped to the level 1008 a for which the per-mode storage allocation setting 1010 a is configured.

可替代地，可以通过用于应用于每个对应节点集合中的节点的对应总数的百分比值来表示唯一节点的目标数量，该百分比值使每个对应节点集合能够具有用于存储在被映射到对应层级的给定存储器内的唯一节点的目标数量的单独值。例如，如果如25％的数量被配置成用于每模式NFA存储分配设置1010a(其被配置成用于层级1008a)，则第一部分1004将包括来自每模式NFA 1014a-c的节点的25％。由于每个每模式NFA 1014a-c的节点可以不同，则来自每模式NFA 1014a-c中的每个的节点的数量可以不同。Alternatively, the target number of unique nodes can be represented by a percentage value applied to the corresponding total number of nodes in each corresponding node set, which enables each corresponding node set to have a separate value for the target number of unique nodes stored within a given memory mapped to the corresponding hierarchy. For example, if a number such as 25% is configured for per-mode NFA storage allocation setting 1010a (which is configured for hierarchy 1008a), then first portion 1004 will include 25% of the nodes from per-mode NFAs 1014a-c. Since the number of nodes in each per-mode NFA 1014a-c can be different, the number of nodes from each of per-mode NFAs 1014a-c can be different.

每模式NFA存储分配设置可以包括一个第一每模式NFA存储分配设置和一个第二每模式NFA存储分配设置。层级可以包括一个排名最高的层级和一个排名第二高的层级。该第一每模式NFA存储分配设置可以被配置成用于排名最高的层级。该第二每模式NFA存储分配设置可以被配置成用于排名最第二高的层级。该第一每模式NFA存储分配设置可以小于该第二每模式NFA存储分配设置。例如，所表示的来自每个每模式NFA的用于分布到性能最高的存储器的节点的数量可以小于表示用于如可以表示无限数量的性能最低的存储器(如系统存储器)的节点的数量。The per-mode NFA storage allocation setting may include a first per-mode NFA storage allocation setting and a second per-mode NFA storage allocation setting. The tiers may include a highest-ranked tier and a second-highest-ranked tier. The first per-mode NFA storage allocation setting may be configured for the highest-ranked tier. The second per-mode NFA storage allocation setting may be configured for the second-highest-ranked tier. The first per-mode NFA storage allocation setting may be less than the second per-mode NFA storage allocation setting. For example, the number of nodes represented from each per-mode NFA for distribution to the highest-performance memory may be less than the number of nodes represented for the lowest-performance memory (e.g., system memory), which may represent an unlimited number of nodes.

在此披露的实施例可以使给定分布中的节点的数量最大化，并且最大化的数量会受到被配置成用于给定层级的每模式NFA存储分配设置中的对应每模式NFA存储分配设置的限制。例如，每模式NFA存储分配设置所表示的节点的数量可以为十。如此，每个包括十个或更多未分布的节点的每模式NFA将分布十个节点。每个包括十个未分布的节点的每模式NFA将分布未分布数量中对应数量的节点。Embodiments disclosed herein can maximize the number of nodes in a given distribution, with the maximized number being limited by the corresponding per-mode NFA storage allocation setting in the per-mode NFA storage allocation setting configured for a given hierarchy. For example, the number of nodes represented by the per-mode NFA storage allocation setting can be ten. Thus, each per-mode NFA that includes ten or more undistributed nodes will distribute ten nodes. Each per-mode NFA that includes ten undistributed nodes will distribute a corresponding number of nodes in the undistributed number.

如以上所披露的，行走器(如图3A的行走器320)可以被配置成用于使输入流的有效载荷的段的节点走过统一DFA(如图3A的同一DFA 312)和至少一个每模式NFA(如图3A的每模式NFA 314)以便试图对输入流中的正则表达式模式进行匹配。在编译阶段，编译器(如图3A的编译器306)可以生成统一DFA 312和至少一个每模式NFA 314。统一DFA 312的节点和至少一个每模式NFA 314可以存储在存储器层次中的多个存储器内，如图7A的存储器层次743中的多个存储器756a-c。As disclosed above, a walker (such as walker 320 of FIG. 3A ) can be configured to walk nodes of a segment of a payload of an input stream through a unified DFA (such as the same DFA 312 of FIG. 3A ) and at least one per-pattern NFA (such as per-pattern NFA 314 of FIG. 3A ) in an attempt to match regular expression patterns in the input stream. During the compilation phase, a compiler (such as compiler 306 of FIG. 3A ) can generate the unified DFA 312 and the at least one per-pattern NFA 314. The nodes of the unified DFA 312 and the at least one per-pattern NFA 314 can be stored in a plurality of memories in a memory hierarchy, such as the plurality of memories 756 a-c in the memory hierarchy 743 of FIG. 7A .

如以上所披露的，参照图10和图11，编译器306生成的每个每模式NFA的对应节点集合可以基于编译器306为每个对应集合确定的节点分布而分布和存储在多个存储器756a-c中的一个或多个存储器之中。如以上所披露的，编译器306可以根据被映射到多个存储器756a-c的层级(如图7A的层级708a-c)和被配置成用于层级708a-c的每模式NFA存储分配设置(如710a-c)来确定每个节点分布。10 and 11 , the corresponding node sets of each per-pattern NFA generated by compiler 306 can be distributed and stored in one or more of the plurality of memories 756a-c based on the node distribution determined for each corresponding set by compiler 306. As disclosed above, compiler 306 can determine each node distribution based on the hierarchies (e.g., hierarchies 708a-c of FIG. 7A ) mapped to the plurality of memories 756a-c and the per-pattern NFA storage allocation settings (e.g., 710a-c) configured for the hierarchies 708a-c.

如此，行走器320可以被配置成用于行走每模式NFA 314的对应节点集合中的节点，这些节点可以基于编译器306所确定的节点分布根据被映射到多个存储器756a-c的层级708a-c和被配置成用于层级708a-c的每模式NFA存储分配设置710a-c而被分布和存储在多个存储器756a-c中的一个或多个存储器之中。如以上参照图6所披露的，行走器320可以被配置成用于基于如在统一DFA 312的行走过程中行走器320确定的输入流中的对应正则表达式模式的部分匹配而行走每模式NFA 314的对应节点集合。Thus, the walker 320 can be configured to walk the nodes in the corresponding node set of the per-pattern NFA 314, which can be distributed and stored in one or more of the plurality of memories 756a-c according to the levels 708a-c mapped to the plurality of memories 756a-c and the per-pattern NFA storage allocation settings 710a-c configured for the levels 708a-c based on the node distribution determined by the compiler 306. As disclosed above with reference to FIG6, the walker 320 can be configured to walk the corresponding node set of the per-pattern NFA 314 based on partial matches of corresponding regular expression patterns in the input stream as determined by the walker 320 during the walking of the unified DFA 312.

图12为可以在至少一个处理器中实施的方法的另一个示例实施例的流程图1200，该处理器操作性地耦合至被映射到安全装置内的存储器层次中的多个层级的存储器，该安全装操作性地耦合至网络。该方法可以开始(1202)并根据输入流的有效载荷的段行走针对对应正则表达式模式生成的至少一个每模式NFA中的给定每模式NFA的对应节点集合中的节点。该对应节点集合可以基于根据被映射到多个存储器的层级确定的节点分布和被配置成用于这些层级的每模式NFA存储分配设置而被分布和存储在该多个存储器中的一个或多个存储器之中(1204)。之后，在该示例实施例中，该方法结束(1206)。FIG12 is a flowchart 1200 of another example embodiment of a method that can be implemented in at least one processor operatively coupled to a plurality of levels of memory mapped to a memory hierarchy within a security device operatively coupled to a network. The method can begin (1202) and walk a node in a corresponding set of nodes of a given per-pattern NFA in at least one per-pattern NFA generated for a corresponding regular expression pattern based on a segment of a payload of an input stream. The corresponding set of nodes can be distributed and stored in one or more of the plurality of memories based on a node distribution determined based on the levels mapped to the plurality of memories and a per-pattern NFA storage allocation setting configured for the levels (1204). Thereafter, in this example embodiment, the method ends (1206).

行走器320可以被配置成用基于(i)有效载荷的在给定节点处的给定段的肯定匹配和(ii)与该给定节点相关联的下一个节点地址来从对应节点集合中的该给定节点行走到下一个节点。该下一个节点地址可以被配置成用于标识该下一个节点和多个存储器(如图7A的多个存储器756a-c)中的其中存储该下一个节点的给定存储器。例如，转到图5A的示例实施例，行走器320可以基于节点N2 510处的段522c的肯定匹配来行走节点N4 514，因为节点N2 510可以被配置成用于将有效载荷中的给定偏移处的给定段与字符元素‘a’匹配。与节点N2 510相关联的元数据(未示出)可以标识下一个节点(如节点N4 514)以便基于该给定偏移处的给定段与字符元素‘a’的肯定匹配来进行遍历(即，行走)。The walker 320 can be configured to walk from the given node in the corresponding node set to the next node with a positive match of a given segment at a given node based on (i) the payload and (ii) the next node address associated with the given node. The next node address can be configured to identify the given memory in which the next node is stored in the given memory and multiple memories (such as multiple memories 756a-c of Figure 7A). For example, turning to the example embodiment of Figure 5A, the walker 320 can walk node N4 514 based on the positive match of the segment 522c at node N2 510, because node N2 510 can be configured to match the given segment at the given offset in the payload with the character element 'a'. The metadata (not shown) associated with node N2 510 can identify the next node (such as node N4 514) so as to traverse (i.e., walk) based on the positive match of the given segment at the given offset with the character element 'a'.

例如，与节点N2 510相关联的元数据可以包括下一个节点地址，该地址为节点N4514的地址或指针或索引或对下一个节点N4 514进行标识的任何其他合适的标识符，以便基于节点N2 510处的肯定匹配来进行遍历。与节点N2 510相关联的元数据可以进一步对多个存储器中的存储下一个节点N4 514的给定存储器进行标识。可以用任何合适的方式对该给定存储器进行标识，如通过结合或作为下一个节点514的下一个节点地址(未示出)的一部分来配置所存储器的具体位。如此，行走器320可以被配置成用于从通过与给定节点N2510相关联的下一个节点地址所标识的给定存储器提取下一个节点N4 514，以便根据下一个偏移处的下一个段(如图5A的下一个偏移520d处的下一个段522d)行走下一个节点N4514。For example, the metadata associated with node N2 510 may include a next node address, which is an address of node N4 514 or a pointer or index or any other suitable identifier for identifying the next node N4 514, so as to be traversed based on a positive match at node N2 510. The metadata associated with node N2 510 may further identify a given memory in the plurality of memories storing the next node N4 514. The given memory may be identified in any suitable manner, such as by configuring a specific bit of the memory in conjunction with or as part of a next node address (not shown) of the next node 514. Thus, the walker 320 may be configured to extract the next node N4 514 from the given memory identified by the next node address associated with the given node N2 510, so as to walk the next node N4 514 according to the next segment at the next offset (e.g., the next segment 522d at the next offset 520d in FIG. 5A ).

下一个节点N4 514可以被高速缓存在节点高速缓存中。回到图4A，HPU 425的示例实施例包括可以操作性地耦合至HNA处理内核408的节点高速缓存451。节点高速缓存451的大小可以用于至少存储阈值数量的节点。如此，HNA处理内核408可以将一个或多个节点(高达阈值数量的节点)高速缓存在节点高速缓存451中。如以上所披露的，HNA处理内核408可以被配置成用于实施行走器320的关于NFA处理的各方面。如此，行走器320可以基于下一个节点N4 514的提取(即，读取访问)是否引起高速缓存缺失而从节点高速缓存451或该多个存储器756a-c中的给定存储器取回下一个节点N4 514。根据在此披露的实施例，可以基于循环或最近最少使用(LRU)替换策略来替换节点高速缓存451中的条目。行走器320可以被配置成用于保持节点高速缓存451中的一个或多个条目的索引以便在实施循环或LRU替换策略中使用。Next node N4 514 can be cached in the node cache. Return to Fig. 4 A, the example embodiment of HPU 425 includes the node cache 451 that can be operatively coupled to HNA processing core 408. The size of node cache 451 can be used for storing at least a threshold number of nodes. In this way, HNA processing core 408 can cache one or more nodes (up to a threshold number of nodes) in node cache 451. As disclosed above, HNA processing core 408 can be configured to be used for implementing the various aspects of NFA processing of walker 320. In this way, walker 320 can retrieve next node N4 514 from a given memory in node cache 451 or the multiple memories 756a-c based on whether the extraction (that is, read access) of next node N4 514 causes cache miss. According to the embodiment disclosed here, the entry in node cache 451 can be replaced based on a cycle or least recently used (LRU) replacement strategy. Walker 320 may be configured to maintain an index of one or more entries in node cache 451 for use in implementing a round-robin or LRU replacement strategy.

如果节点N4 514的提取引起高速缓存缺失，则HNA处理内核408可以从静态存储节点N4 514的给定存储器中提取节点N4 514并且还将节点N4 514缓存在节点高速缓存451中。基于与给定存储器的层级相关联的层次节点事务大小，HNA处理内核408可以高速缓存来自给定存储器的附加节点。可以用连续的方式将节点N4 514和所高速缓存的任何附加节点安排在对应每模式NFA内。例如，基于与给定存储器的层级相关联的层次节点事务大小，HNA处理内核408可以将以连续的方式安排的节点N5 515和节点N4 514一起高速缓存在每模式NFA 504内。If the fetching of node N4 514 results in a cache miss, the HNA processing core 408 may fetch node N4 514 from the given memory that statically stores node N4 514 and also cache node N4 514 in the node cache 451. Based on the hierarchical node transaction size associated with the given memory hierarchy, the HNA processing core 408 may cache additional nodes from the given memory. Node N4 514 and any additional cached nodes may be arranged in a contiguous manner within the corresponding per-mode NFA. For example, based on the hierarchical node transaction size associated with the given memory hierarchy, the HNA processing core 408 may cache node N5 515 and node N4 514, arranged in a contiguous manner, together within the per-mode NFA 504.

根据在此披露的实施例，对应的层次节点事务大小(未示出)可以与层级708a-c中的每个相关联。每个对应的层次节点事务大小可以表示从映射到对应层级的给定存储器提取的用于对给定存储器的读取访问的最大节点数量。例如，与排名最高的层级相关联的层次节点事务大小可以具有最大节点数量，即，一个或两个节点.根据在此披露的实施例，层级中排名最高的层级可以与和层级相关联的层次节点事务大小中的最小的层次节点事务大小相关联。According to embodiments disclosed herein, a corresponding hierarchy node transaction size (not shown) can be associated with each of the hierarchies 708a-c. Each corresponding hierarchy node transaction size can represent a maximum number of nodes to be fetched from a given memory mapped to the corresponding hierarchy for a read access to the given memory. For example, the hierarchy node transaction size associated with the highest-ranked hierarchy can have a maximum number of nodes, i.e., one or two nodes. According to embodiments disclosed herein, the highest-ranked hierarchy in the hierarchy can be associated with the smallest hierarchy node transaction size among the hierarchy node transaction sizes associated with the hierarchy.

可以用任何合适的方式表示层次节点事务大小，如通过直接指定节点的最大数量，或通过指定可以是所表示的最大节点数量的大小的倍数的位数量。根据在此披露的实施例，节点高速缓存451可以被组织成多行。可以基于节点位大小确定每行的大小，并且每行可以包括用于供HNA处理内核408使用的附加位。每行可以是来自该多个存储器中的每个存储器的事务的最小量子(即，粒度)。The hierarchical node transaction size can be expressed in any suitable manner, such as by directly specifying the maximum number of nodes, or by specifying a number of bits that can be a multiple of the size of the maximum number of nodes represented. According to embodiments disclosed herein, the node cache 451 can be organized into multiple rows. The size of each row can be determined based on the node bit size, and each row can include additional bits for use by the HNA processing core 408. Each row can be the smallest quantum (i.e., granularity) of transactions from each of the multiple memories.

根据在此披露的实施例，排名最高的存储器可以是与HNA处理内核408共同位于芯片上的存储器。排名最高的存储器可以是相对于该多个存储器中的其他存储器性能最高的存储器。性能最高的存储器可以具有最快的读取和写入访问时间。事务大小(例如，从性能最高的存储器读取的数据的量子大小)可以是一行或两行，该一行或两行可以分别包括一个或两个节点。According to embodiments disclosed herein, the highest-ranked memory can be a memory co-located on-chip with the HNA processing core 408. The highest-ranked memory can be the memory with the highest performance relative to the other memories in the plurality of memories. The highest-performance memory can have the fastest read and write access times. The transaction size (e.g., the quantum size of data read from the highest-performance memory) can be one or two rows, which can include one or two nodes, respectively.

相比之下，排名最低的层级可以被映射到该多个存储器中的性能最低的存储器。性能最低的存储器可以是与该多个存储器中的其他存储器相比具有相对更长的读取和写入访问时间的性能最慢的存储器。例如，性能最慢的存储器可以是最大的存储器，如不与HNA处理内核408一起位于芯片上的外部存储器。如此，可以通过每读取访问具有更大的事务大小(如四行)来减少对此类存储器的读取访问的数量。In contrast, the lowest-ranked tier can be mapped to the lowest-performing memory in the plurality of memories. The lowest-performing memory can be the slowest-performing memory with relatively longer read and write access times than the other memories in the plurality of memories. For example, the slowest-performing memory can be the largest memory, such as external memory that is not located on-chip with the HNA processing core 408. In this way, the number of read accesses to such memory can be reduced by having a larger transaction size (e.g., four rows) per read access.

根据在此披露的实施例，与排名最低的层级相关联的层次节点事务大小可以被配置成使得逐出并用从被映射到排名最低的层级的对应存储器中提取的一行或多行替换来自节点高速缓存451的一行或多行。可以基于存储阈值数量的节点的一行或多行来确定该一行或多行。如此，如果对应的层级为层级中排名最低的层级，则对应的层次节点事务大小可以使HNA处理内核408能够高速缓存来自给定存储器的阈值数量的节点。如此，如果对应的层级为层级中排名最低的层级，HNA处理内核408可以被配置成用于逐出高速缓存在节点高速缓存451中的阈值数量的节点。According to embodiments disclosed herein, the hierarchical node transaction size associated with the lowest-ranked tier can be configured to evict and replace one or more rows from the node cache 451 with one or more rows extracted from the corresponding memory mapped to the lowest-ranked tier. The one or more rows can be determined based on one or more rows storing a threshold number of nodes. Thus, if the corresponding tier is the lowest-ranked tier among the tiers, the corresponding hierarchical node transaction size can enable the HNA processing core 408 to cache a threshold number of nodes from a given memory. Thus, if the corresponding tier is the lowest-ranked tier among the tiers, the HNA processing core 408 can be configured to evict a threshold number of nodes cached in the node cache 451.

根据在此披露的实施例，节点高速缓存451可以被配置成用于高速缓存阈值数量的节点。节点的阈值数量可以是基于与该多个存储器相关联的所有事务大小中的最大事务大小而可以被读取的最大节点数量。例如，该多个存储器的所有事务大小中的最大事务大小可以是与可以被映射到例如不与HNA处理内核408共同位于芯片上的外部存储器的排名最低的层级相关联的给定事务大小。According to embodiments disclosed herein, the node cache 451 can be configured to cache a threshold number of nodes. The threshold number of nodes can be the maximum number of nodes that can be read based on a maximum transaction size among all transaction sizes associated with the plurality of memories. For example, the maximum transaction size among all transaction sizes of the plurality of memories can be a given transaction size associated with the lowest-ranked tier that can be mapped to, for example, an external memory that is not co-located on-chip with the HNA processing core 408.

将该一个或多个节点高速缓存在节点高速缓存451中可以基于从该多个存储器中的给定存储器读取的一个或多个节点中的给定节点的高速缓存缺失和与层级中的被映射到给定存储器的对应层级相关联的对应的层次节点事务大小。与对应层级相关联的对应的层次节点事务大小可以表示被从映射到对应层级的给定存储器提取的用于对给定存储器的读取访问的最大节点数量。Caching the one or more nodes in the node cache 451 may be based on a cache miss of a given node in the one or more nodes read from a given memory in the plurality of memories and a corresponding level node transaction size associated with a corresponding level in the hierarchy mapped to the given memory. The corresponding level node transaction size associated with the corresponding level may represent a maximum number of nodes that may be fetched from the given memory mapped to the corresponding level for a read access to the given memory.

HNA处理内核408可以被配置成用于使用LRU或循环替换策略来从节点高速缓存451中逐出所高速缓存的一个或多个节点。根据在此披露的实施例，如果被映射到给定存储器的对应层级高于层级中的排名最低的层级，则可以基于层级确定逐出的一个或多个所高速缓存的节点的总数。例如，如果层级与为1的层次节点事务大小相关联，则节点高速缓存逐出的所高速缓存的节点的总数为1，并且可以基于LRU或循环替换策略确定所逐出的条目。总数为1是出于说明性目的，并且应理解到，可以使用任何合适的层次节点事务大小。HNA processing core 408 can be configured to use LRU or cyclic replacement strategy to evict one or more nodes of institute's cache from node cache 451.According to the embodiment disclosed herein, if the corresponding level that is mapped to given memory is higher than the lowest level in the level, then the total number of one or more cached nodes that can be evicted based on the level determination.For example, if the level is associated with a level node transaction size of 1, the total number of cached nodes that the node cache evicts is 1, and the entry that can be evicted based on LRU or cyclic replacement strategy is determined.Total being 1 is for illustrative purposes, and it should be understood that any suitable level node transaction size can be used.

图13A为可以在至少一个处理器中实施的方法的示例实施例的流程图1300，该处理器操作性地耦合至存储器层次中的多个存储器和安全装置内的节点高速缓存，该安全装操作性地耦合至网络。该方法可以开始(1302)并将至少一个有限自动机的多个节点存储在多个存储器中(1304)。该方法可以基于一个或多个节点中的给定节点的高速缓冲缺失和与层级相关联的层次节点事务大小将多个节点中的存储在存储器层次中的层级处的多个存储器中的给定存储器中的一个或多个节点高速缓存在节点高速缓存内(1306)。之后，在该示例实施例中，该方法结束(1308)。13A is a flowchart 1300 of an example embodiment of a method that can be implemented in at least one processor operatively coupled to a plurality of memories in a memory hierarchy and a node cache within a security device operatively coupled to a network. The method can begin (1302) and store a plurality of nodes of at least one finite automaton in a plurality of memories (1304). The method can cache one or more nodes in a given memory in a plurality of memories at a level in the memory hierarchy in the node cache (1306) based on a cache miss for a given node in the one or more nodes and a level node transaction size associated with the level. Thereafter, in this example embodiment, the method ends (1308).

图13B为有效载荷1342和有效载荷1342中带有对应偏移1318的段1316的示例实施例的框图1341。在示例实施例中，可以根据图13B的有效载荷1342行走图5A的每模式NFA图形504中的节点。例如，行走器320可以试图对有效载荷1342的在每模式NFA图形504中的节点处的段1316进行匹配以试图将有效载荷1342与图5A的正则表达式模式502匹配。Figure 13B is a block diagram 1341 of an example embodiment of a payload 1342 and a segment 1316 in the payload 1342 with a corresponding offset 1318. In an example embodiment, the nodes in the per-pattern NFA graph 504 of Figure 5A can be walked based on the payload 1342 of Figure 13B. For example, the walker 320 can attempt to match the segment 1316 of the payload 1342 at the node in the per-pattern NFA graph 504 to attempt to match the payload 1342 to the regular expression pattern 502 of Figure 5A.

每模式NFA图形504中的多个节点可以存储在多个存储器中，如图7A的存储器756a-c。该多个节点中的一个或多个节点(如每模式NFA图形504中的节点N0 506、N1 508、N2 510、和N3 512)可以存储在位于存储器层次(如存储器层次743)中的一个层级(如排名最高的层级708a)处的给定存储器(如图7A的性能最高的存储器756a)中。如以下参照图13C和图13D披露的，基于给定节点(如节点N0 506)的高速缓存缺失、和与层级708a相关联的层次节点事务大小723a，节点N0 506、N1 508、N2 510、和N3 512可以高速缓存在节点高速缓存(如图4A的节点高速缓存451)内。Multiple nodes in the per-pattern NFA graph 504 can be stored in multiple memories, such as memories 756a-c of FIG. 7A . One or more of the multiple nodes (e.g., nodes N0 506, N1 508, N2 510, and N3 512 in the per-pattern NFA graph 504) can be stored in a given memory (e.g., the highest-performing memory 756a of FIG. 7A ) at a level (e.g., the highest-ranked level 708a) in a memory hierarchy (e.g., the memory hierarchy 743). As disclosed below with reference to FIG. 13C and FIG. 13D , based on a cache miss for a given node (e.g., node N0 506) and a hierarchical node transaction size 723a associated with the level 708a, nodes N0 506, N1 508, N2 510, and N3 512 can be cached in a node cache (e.g., the node cache 451 of FIG. 4A ).

如图13B中所示，有效载荷1342包括带有对应偏移1320a-n(即，0、1、2等)的段1322a-n(即，h、y、x等)。行走器320可以使有效载荷1342的段1322a-n中的一个段一次走过NFA图形504以将正则表达式模式502与输入流匹配。可以基于偏移1320a-n中的给定段的为有效载荷1342内的当前偏移的对应偏移来确定用于行走给定节点的段1322a-n中的该给定段。如以上参照图5A所披露的，行走器320可以对当前偏移进行以下更新：使当前偏移增量或减量。行走器320可以被配置成用于基于对分离节点N1 508进行遍历来选择上部ε路径530a，因为上部ε路径530a表示懒惰路径。As shown in FIG13B , payload 1342 includes segments 1322a-n (i.e., h, y, x, etc.) with corresponding offsets 1320a-n (i.e., 0, 1, 2, etc.). Walker 320 can walk one segment of segments 1322a-n of payload 1342 at a time through NFA graph 504 to match regular expression pattern 502 with input stream. A given segment of segments 1322a-n to be used to walk a given node can be determined based on the corresponding offset of the given segment of offsets 1320a-n to the current offset within payload 1342. As disclosed above with reference to FIG5A , walker 320 can update the current offset by incrementing or decrementing the current offset. Walker 320 can be configured to select upper ε-path 530a based on traversing separating node N1 508 because upper ε-path 530a represents a lazy path.

图13C为通过在分离节点N1 508处选择懒惰路径根据图13B的有效载荷行走图5A的每模式NFA图形504的处理周期的示例实施例的表1338a。13C is a table 1338a of an example embodiment of a processing cycle of walking the per-mode NFA graph 504 of FIG. 5A according to the payload of FIG. 13B by selecting a lazy path at the split node N1 508 .

图13D为表1338b，其为图13C的表1338a的续表。如表1338a和1338b中所示，处理周期1340a-mm可以包括根据当前偏移1332处段来行走当前节点1330以确定匹配结果1334和基于匹配结果1334的行走器动作1336。在该示例实施例中，行走器320可以在处理周期1340a中根据当前偏移1320a处的段1322a(即，“h”)来行走起始节点N0 506。如以上参照图6所披露的，可以基于HFA处理器110进行的DFA处理产生的匹配结果来指定起始节点N0 506和当前偏移1320a。FIG13D is a table 1338b, which is a continuation of table 1338a of FIG13C. As shown in tables 1338a and 1338b, processing cycles 1340a-mm may include walking the current node 1330 according to the segment at the current offset 1332 to determine a matching result 1334 and a walker action 1336 based on the matching result 1334. In this example embodiment, the walker 320 may walk the starting node N0 506 according to the segment 1322a (i.e., "h") at the current offset 1320a during processing cycle 1340a. As disclosed above with reference to FIG6, the starting node N0 506 and the current offset 1320a may be specified based on the matching result generated by the DFA processing performed by the HFA processor 110.

当段1322a与每模式NFA图形504中的节点N0 506处的字符“h”匹配时，HNA处理内核408进行的NFA处理产生行走器320进行的匹配结果1334为肯定匹配结果的确定。如编译器306通过与起始节点N0 506相关联的元数据(未示出)所指定的，行走器320可以在正向方向上行走并提取与节点N0 506相关联的元数据所指示的下一个节点并且可以使使当前偏移从1320a(即，“0”)增量到1320b(即，“1”)。在该示例实施例中，节点N0 506所指示的下一个节点为分离节点N1508。如此，行走器320针对处理周期1340a采取动作1336，该动作包括在有效载荷1342中将当前偏移更新成“1”并转换至分离节点N1 508。转换可以包括提取(在此也被称为加载)分离节点N1 508。When segment 1322a matches the character "h" at node N0 506 in each pattern NFA graph 504, the NFA processing performed by the HNA processing core 408 produces a determination by the walker 320 that the match result 1334 is a positive match result. As specified by the compiler 306 through metadata (not shown) associated with the starting node N0 506, the walker 320 can walk in the forward direction and extract the next node indicated by the metadata associated with node N0 506 and can increment the current offset from 1320a (i.e., "0") to 1320b (i.e., "1"). In this example embodiment, the next node indicated by node N0 506 is the separation node N1 508. As such, the walker 320 takes action 1336 for processing cycle 1340a, which includes updating the current offset to "1" in the payload 1342 and transitioning to the separation node N1 508. The transition can include extracting (also referred to herein as loading) the separation node N1 508.

当分离节点N1 508呈现多个转换路径选项时，如ε路径530a和530b，针对处理周期1340b的动作1336可以包括选择上部ε路径530a并提取与有效载荷1342无关的节点N2 510而不从有效载荷1342消耗(即，处理)。由于分离节点N1 508没有执行匹配函数，当前偏移/段1332没有改变，并且因此，在处理周期1340b中没有消耗(即，处理)有效载荷。When split node N1 508 presents multiple conversion path options, such as epsilon paths 530a and 530b, actions 1336 for processing cycle 1340b may include selecting upper epsilon path 530a and extracting node N2 510 that is not associated with payload 1342 without being consumed (i.e., processed) from payload 1342. Since split node N1 508 does not perform a matching function, current offset/segment 1332 is not changed, and therefore, no payload is consumed (i.e., processed) in processing cycle 1340b.

由于分离节点N1 508呈现多个路径选项，动作1336可以包括存储未探究的上下文，如通过存储节点N3 512的间接或直接标识符和当前偏移1320b(即，“1”)。在沿着所选择的部分匹配路径出现否定匹配结果的情况下，例如，如果沿着从节点N2 510延伸的路径在节点N2510或多个节点处确定否定匹配结果，则存储未探究的上下文可以使行走器320能够记住返回至节点N3 512以根据有效载荷1342中的偏移1320b“1”处的段行走节点N3 512。Because the split node N1 508 presents multiple path options, action 1336 can include storing unexplored context, such as by storing an indirect or direct identifier and current offset 1320b (i.e., "1") for node N3 512. In the event of a negative match result along the selected partially matched path, for example, if a negative match result is determined at node N2 510 or multiple nodes along the path extending from node N2 510, storing the unexplored context can enable the walker 320 to remember to return to node N3 512 to walk node N3 512 according to the segment at offset 1320b "1" in the payload 1342.

在该示例实施例中，ε转换路径530a的选择会在节点N2 510处或当前线程的后续节点(如，节点N4 514)处产生检测到匹配失败。例如，基于选择上部路径(即，ε转换路径530a)，在处理周期1340c中，行走器320可以提取节点N2 510并且试图将当前偏移1320b(即，“1”)处的段1322b(即，“y”)与节点N2 510的元素“a”匹配。由于“y”与节点N2 510处的元素“a”不匹配，处理周期1340c中的动作1336可以包括从图4A的运行堆栈460弹出条目。In this example embodiment, selection of the ε-transition path 530a may result in a detected match failure at node N2 510 or at a subsequent node of the current thread (e.g., node N4 514). For example, based on selection of the upper path (i.e., ε-transition path 530a), in processing cycle 1340c, the walker 320 may extract node N2 510 and attempt to match segment 1322b (i.e., "y") at the current offset 1320b (i.e., "1") with element "a" of node N2 510. Because "y" does not match element "a" at node N2 510, action 1336 in processing cycle 1340c may include popping an entry from the run stack 460 of FIG. 4A.

所弹出的条目可以是最近推送的条目，如在该示例实施例中指示节点N3 512和偏移1320b(即，“1”)、在处理周期1340b中推送的存储条目。如此，如果检测到匹配失败，则可以遍历针对ε转换路径530b存储的线程，处理周期1340d、1340g、1340j、1340m、1340p、1340s、1340w、1340z、1340cc、1340ff、和1340ii中示出的就是这种情况。存储未遍历的转换路径可以包括通过存储包括与当前偏移的指示相关联的下一个节点的标识符的条目来将条目存储在堆栈上，如图4A的运行堆栈460。The entry popped may be the most recently pushed entry, such as the stored entry pushed in processing cycle 1340b indicating node N3 512 and offset 1320b (i.e., "1") in this example embodiment. Thus, if a match failure is detected, the threads stored for the epsilon transition path 530b may be traversed, as is the case illustrated in processing cycles 1340d, 1340g, 1340j, 1340m, 1340p, 1340s, 1340w, 1340z, 1340cc, 1340ff, and 1340ii. Storing untraversed transition paths may include storing an entry on a stack, such as the run stack 460 of FIG. 4A, by storing an entry including an identifier of a next node associated with an indication of the current offset.

在处理周期1340d中，行走器320并且可以根据位于有效载荷1342中的偏移1320b处的段“y”转换和行走节点N3 512。如此，由于与节点N3 512相关联的元素指示针对不是换行符的段的肯定匹配，处理周期1340d示出了在处理周期1340d中匹配结果1334是肯定的。处理周期1340d中的动作1336可以包括将当前偏移更新成偏移1320c并转换回可以是节点N3 512指示的下一个节点的分离节点N1 508。In processing cycle 1340d, the walker 320 may also transition and walk node N3 512 based on segment "y" located at offset 1320b in payload 1342. Thus, since the element associated with node N3 512 indicates a positive match for a segment that is not a line break, processing cycle 1340d shows that match result 1334 is positive in processing cycle 1340d. Action 1336 in processing cycle 1340d may include updating the current offset to offset 1320c and transitioning back to separate node N1 508, which may be the next node indicated by node N3 512.

由于从分离节点508转换的所有圆弧为ε转换，当在处理周期1340e中没有更新当前偏移时，行走器320可以再次选择该多个路径选项中的一条路径并且不消耗(即，处理)来自有效载荷1342的段。在该示例实施例中，行走器320再次选择ε转换路径530a。如此，行走器320通过推送节点N3 512和当前偏移(现在为1320c(即，“2”))再次将线程存储在运行堆栈460上。如处理周期1340f中所示，行走器320提取节点N2 510并试图将偏移1320c(即，“2”)处的段1322c(即，“x”)与节点N2 510的元素“a”匹配。Since all arcs transitioning from the split node 508 are ε transitions, when the current offset is not updated in processing cycle 1340e, the walker 320 can again select one of the multiple path options and not consume (i.e., process) segments from the payload 1342. In this example embodiment, the walker 320 again selects the ε transition path 530a. As such, the walker 320 again stores the thread on the run stack 460 by pushing node N3 512 and the current offset, which is now 1320c (i.e., "2"). As shown in processing cycle 1340f, the walker 320 extracts node N2 510 and attempts to match segment 1322c (i.e., "x") at offset 1320c (i.e., "2") with element "a" of node N2 510.

由于“x”在节点N2 510处不匹配，行走器320可以再次从运行堆栈460弹出条目。所弹出的条目可以是最近推送的条目，如在该示例实施例中指示节点N3 512和偏移1320c(即，“2”)、在处理周期1340e中推送的存储条目。如此，行走器320可以根据位于有效载荷1342中的偏移1320c处的段“x”在处理周期1340f中转换和再次行走节点N3 512。如此，由于“x”不是换行字符，处理周期1340g示出了匹配结果1334是肯定的，并且处理周期1340g中的动作1336可以包括将当前偏移更新成偏移1320d(即，“3”)和转换回可以是与节点N3 512相关联的元数据指示的下一个节点的分离节点N1 508。Since "x" does not match at node N2 510, the walker 320 may again pop an entry from the run stack 460. The popped entry may be the most recently pushed entry, such as the storage entry pushed in processing cycle 1340e indicating node N3 512 and offset 1320c (i.e., "2") in this example embodiment. Thus, the walker 320 may transition and walk node N3 512 again in processing cycle 1340f based on segment "x" located at offset 1320c in payload 1342. Thus, since "x" is not a newline character, processing cycle 1340g shows that the match result 1334 is positive, and action 1336 in processing cycle 1340g may include updating the current offset to offset 1320d (i.e., "3") and transitioning back to the split node N1 508, which may be the next node indicated by the metadata associated with node N3 512.

行走器320可以按照图13C和图13D的表1338a和1338b中所示的后续处理周期1340i-mm所指示的分别继续使有效载荷1342的各段走过每模式NFA 504，直到达到标记节点N5 515。如表1338b的处理周期1340mm中所示，行走器320遍历可以与指示输入流中正则表达式模式502的最终(即，完整或整个)匹配的元数据相关联的标记节点N5 515。The walker 320 may continue walking segments of the payload 1342 through the per-pattern NFA 504 as indicated by subsequent processing cycles 1340i-mm shown in Tables 1338a and 1338b of Figures 13C and 13D, respectively, until reaching marker node N5 515. As shown in processing cycle 1340mm of Table 1338b, the walker 320 traverses marker node N5 515, which may be associated with metadata indicating a final (i.e., complete or entire) match of the regular expression pattern 502 in the input stream.

在该示例实施例中，使有效载荷1342走过每模式NFA图形504可以包括标识节点N3512处的失配、通过选择上部ε路径530a来选择分离节点N1 508处的懒惰路径、以及遍历节点N2 510。基于节点N2 520处的失配，可以再次遍历节点N3 512等等，直到确定节点N2 520处的匹配。例如，按照关于处理周期1340b-d、1340e-g、1340h-j、1340k-m、1340n-p、和1340q-s所示，在时间和空间局部性两者下进行节点N1 508、N2 510、和N3 512的遍历，直到在处理周期1340u中、和如处理周期1340x-z、1340aa-cc、1340dd-ff、和1340gg-ii中所示，确定节点N2 510处的肯定匹配，直到在处理周期1340kk中确定节点N2 510处的肯定匹配。因此，表1338a和1338b的大多数处理周期示出了行走器320可以在时间和空间局部性两者下遍历节点N1 508、N2 510、和N3 512。In this example embodiment, walking the payload 1342 through the per-pattern NFA graph 504 may include identifying a mismatch at node N3 512, selecting a lazy path at the split node N1 508 by selecting the upper epsilon path 530a, and traversing node N2 510. Based on the mismatch at node N2 520, node N3 512 may be traversed again, and so on, until a match at node N2 520 is determined. For example, as shown with respect to processing cycles 1340b-d, 1340e-g, 1340h-j, 1340k-m, 1340n-p, and 1340q-s, nodes N1 508, N2 510, and N3 512 are traversed under both temporal and spatial locality until a positive match is determined at node N2 510 in processing cycle 1340u, and as shown in processing cycles 1340x-z, 1340aa-cc, 1340dd-ff, and 1340gg-ii, until a positive match is determined at node N2 510 in processing cycle 1340kk. Thus, most of the processing cycles of Tables 1338a and 1338b show that walker 320 can traverse nodes N1 508, N2 510, and N3 512 under both temporal and spatial locality.

根据在此披露的实施例，将节点高速缓存(如图4A的节点高速缓存451)用于使输入流的各段走过有限自动机能够另外优化行走的性能。例如，如以上参照图7A所披露的，可以基于将如图5A的每模式NFA 504的部分509的节点N0 506、N1 508、N2 510、和N3 512的连续节点存储在性能更快的存储器中来优化行走器320的匹配性能，该性能更快的存储器相对于可以存储连续节点N4 514和N5 515的另一个存储器可以位于排名更高的层级。According to embodiments disclosed herein, using a node cache (such as node cache 451 of FIG. 4A ) to walk segments of an input stream through a finite automaton can further optimize the performance of the walk. For example, as disclosed above with reference to FIG. 7A , the matching performance of the walker 320 can be optimized by storing consecutive nodes, such as nodes N0 506, N1 508, N2 510, and N3 512 of portion 509 of the per-pattern NFA 504 of FIG. 5A , in a faster-performing memory that can be located at a higher-ranked level relative to another memory that can store consecutive nodes N4 514 and N5 515.

如以上所披露的，早前的节点(如图5A的每模式NFA 504的部分509中所包括的节点N0 506、N1 508、N2 510、和N3 512)可以存储在可以位于排名最高的层级的性能最高的存储器内。例如，部分509中所包括的节点N0 506、N1 508、N2 510、和N3 512可以存储在图7A的可以位于排名最高的层级(如存储器层次743中的层级708a)的存储756a内。根据在此披露的实施例，部分509中所包括的节点N0 506、N1 508、N2 510、和N3 512可以基于可以被配置成用于层级708a的每模式NFA存储分配设置710a存储在存储器756a内。As disclosed above, earlier nodes (e.g., nodes N0 506, N1 508, N2 510, and N3 512 included in portion 509 of per-mode NFA 504 of FIG. 5A ) can be stored in the highest-performance memory that can be located at the highest-ranked level. For example, nodes N0 506, N1 508, N2 510, and N3 512 included in portion 509 can be stored in memory 756a of FIG. 7A , which can be located at the highest-ranked level (e.g., level 708a in memory hierarchy 743). According to embodiments disclosed herein, nodes N0 506, N1 508, N2 510, and N3 512 included in portion 509 can be stored in memory 756a based on per-mode NFA storage allocation setting 710a that can be configured for level 708a.

在该示例实施例中，与排名最高的层级708a相关联的层次节点事务大小(如图7B的层次节点事务大小723a)可以表示该示例实施例中的四个节点。例如，层次节点事务大小723a可以包括从存储器756a读取一行或多行，例如，可以基于读取访问来读取存储在存储器756a的一个或多个地址处的数据，并且可以从存储器756a读取(即，取回、加载、或提取)四个节点。如此，层次节点事务大小723a表示当基于引起读取四个节点的单次读取访问可以读取四个节点时，可以从位于排名最高的层级708a的存储器756a读取四个节点。例如，基于给定存储器中每行存储的节点的数量和从位于给定层级的给定存储器中读取的行(即，地址)数，可以确定每事务(即，读取访问)读取的节点的数量。在图7B的示例实施例中，存储器756b可以与层次节点事务大小723b相关联，并且存储器756c可以与层次节点事务大小723c相关联。In this example embodiment, the level node transaction size associated with the highest-ranked level 708a (e.g., level node transaction size 723a of FIG. 7B ) can represent four nodes in this example embodiment. For example, the level node transaction size 723a can include reading one or more rows from memory 756a. For example, data stored at one or more addresses of memory 756a can be read based on a read access, and four nodes can be read (i.e., retrieved, loaded, or extracted) from memory 756a. Thus, the level node transaction size 723a indicates that four nodes can be read from memory 756a at the highest-ranked level 708a when four nodes can be read based on a single read access that causes the four nodes to be read. For example, the number of nodes read per transaction (i.e., read access) can be determined based on the number of nodes stored per row in a given memory and the number of rows (i.e., addresses) read from a given memory at a given level. In the example embodiment of FIG. 7B , memory 756 b may be associated with level node transaction size 723 b , and memory 756 c may be associated with level node transaction size 723 c .

在该示例实施例中，在处理周期1340a中遍历节点N0 506将引起高速缓存缺失，因为节点N0 506还没有被高速缓存在节点高速缓存451内。其结果是，由于层次节点事务大小723a表示在该示例实施例中的四个节点，可以将如节点N0 506、N1 508、N2 510、和N3 512的四个节点从存储器756a带到节点高速缓存451。In the example embodiment, traversing node N0 506 in processing cycle 1340a will cause a cache miss because node N0 506 is not already cached in node cache 451. As a result, since hierarchical node transaction size 723a represents four nodes in this example embodiment, four nodes, namely, nodes N0 506, N1 508, N2 510, and N3 512, may be brought from memory 756a to node cache 451.

其结果是，行走器320可以从节点高速缓存451访问节点N1508、N2 510、和N3 512直到处理周期1340v，在该处理周期中，基于在处理周期1340u中确定的节点N2 510处的肯定匹配，行走器320根据有效载荷1342中的偏移1320g(即，“8”)处的段1322g(即，“q”)遍历节点N4 514。如此，可以有利地利用节点高速缓存451从而能够通过高速缓存每模式NFA中的与该每模式NFA具有时间和空间局部性关系的多个节点(如该示例实施例中的节点N1508、N2 510、和N3 512)来进一步优化行走性能。对于从多个模式生成的NFA而言，将不存在每模式NFA内的此类节点时间和空间局部性关系。因为在此披露的实施例可以基于生成为每模式NFA的NFA，提供了节点高速缓存451能够实现的优化。As a result, the walker 320 can access nodes N1 508, N2 510, and N3 512 from the node cache 451 until processing cycle 1340v, in which the walker 320 traverses node N4 514 according to segment 1322g (i.e., "q") at offset 1320g (i.e., "8") in the payload 1342 based on the positive match at node N2 510 determined in processing cycle 1340u. In this way, the node cache 451 can be advantageously utilized to further optimize walking performance by caching multiple nodes in the per-pattern NFA that have temporal and spatial locality relationships with the per-pattern NFA (such as nodes N1 508, N2 510, and N3 512 in this example embodiment). For an NFA generated from multiple patterns, such node temporal and spatial locality relationships within the per-pattern NFA will not exist. Because the embodiments disclosed herein may be based on an NFA generated as a per-pattern NFA, optimizations that the node cache 451 can implement are provided.

如此，除了可以减少HNA处理内核408进行的NFA处理的误报数量的HFA处理器110进行的数据包预筛选以外，在此披露的实施例可以通过在行走每模式NFA的节点过程中高速缓存节点来进一步优化匹配性能，这些每模式NFA基于对应每模式NFA内的节点局部性将节点分布到存储器层次中的多个存储器内。如以上所披露的，在此披露的实施例可以基于以下理解来有利地将每个每模式NFA的节点分布到存储器层次中的多个存储器：规则(即，模式)越长，则越不太可能访问(即，行走或遍历)从规则(即，模式)的末端处的部分生成的节点。进一步地，根据在此披露的实施例，可以基于多个存储器的最大事务大小粒度来有利地确定节点高速缓存的大小以便通过减少对性能较慢的存储器的访问次数来进一步优化匹配性能。此外，在此披露的关于层次节点事务大小的实施例通过能够高效地利用节点高速缓存中的有限数量的条目、通过基于与层级相关联的给定事务(即，读取访问)大小而能够确定高速缓存节点的总数来进一步优化匹配性能。Thus, in addition to reducing the number of false positives in NFA processing performed by the HFA processor 110, embodiments disclosed herein can further optimize matching performance by caching nodes during the walking of the nodes of each per-pattern NFA, which distributes the nodes across multiple memories in the memory hierarchy based on node locality within the corresponding per-pattern NFA. As disclosed above, embodiments disclosed herein can advantageously distribute the nodes of each per-pattern NFA across multiple memories in the memory hierarchy based on the understanding that the longer the rule (i.e., pattern), the less likely it is to access (i.e., walk or traverse) the nodes generated from the portion at the end of the rule (i.e., pattern). Further, according to embodiments disclosed herein, the size of the node cache can be advantageously determined based on the maximum transaction size granularity of the multiple memories to further optimize matching performance by reducing the number of accesses to slower performing memories. In addition, the embodiments disclosed herein regarding hierarchical node transaction size further optimize matching performance by being able to efficiently utilize a limited number of entries in the node cache and by being able to determine the total number of cached nodes based on a given transaction (i.e., read access) size associated with a level.

图14为计算机1400内部结构的示例实施例的框图，可以在该计算机中实施在此披露的各实施例。计算机1400包含一条系统总线1402，其中，总线为用于在计算机或处理系统的组件之间传输数据的硬件线集合。系统总线1402实质上为一条将计算机系统的不同元件(例如，处理器、磁盘存储器、存储器、输入/输出端口、网络端口等)耦合起来的共享管道，该管道能够在这些元件之间传输信息。与系统总线1402操作性耦合的是用于将各输入和输出设备(例如，键盘、鼠标、显示器、打印机、扩音器等)耦合到计算机1400的I/O设备接口1404。网络接口1406允许计算机1400耦合到附接至网络的各其他设备。存储器1408为可以用于实施在此披露的实施例的计算机软件指令1410和数据1412提供易失性存储。磁盘存储器1414为可以用于实施在此披露的实施例的计算机软件指令1410和数据1412提供非易失性存储。中央处理器单元1418也与系统总线1402操作性耦合并且提供用于执行计算机指令。Figure 14 is a block diagram of an example embodiment of the internal structure of a computer 1400 in which the various embodiments disclosed herein may be implemented. Computer 1400 includes a system bus 1402, where a bus is a collection of hardware lines used to transfer data between components of a computer or processing system. System bus 1402 is essentially a shared pipe that couples the various components of a computer system (e.g., processor, disk storage, memory, input/output ports, network ports, etc.), enabling information to be transferred between these components. Operatively coupled to system bus 1402 is an I/O device interface 1404 for coupling various input and output devices (e.g., keyboard, mouse, display, printer, speakers, etc.) to computer 1400. A network interface 1406 allows computer 1400 to be coupled to various other devices attached to a network. Memory 1408 provides volatile storage for computer software instructions 1410 and data 1412 that may be used to implement the embodiments disclosed herein. Disk storage 1414 provides non-volatile storage for computer software instructions 1410 and data 1412 that may be used to implement the embodiments disclosed herein. A central processing unit 1418 is also operatively coupled to system bus 1402 and provides for executing computer instructions.

可以使用计算机程序产品来配置在此披露的进一步的实施例；例如，可以在用于实施在此披露的示例实施例的软件中对控制器进行编程。在此披露的进一步的实施例可以包括包含可以由处理器执行的指令的非瞬态计算机可读介质，并且当被执行时，这些指令致使处理器完成在此披露的方法。应理解到，可以用软件、硬件、固件、或将来确定的其他类似实现方式来实现在此描述的框图和流程图中的元素。此外，可以用任何方式在软件、硬件、或固件中组合或分离在此描述的框图和流程图中的元素。Further embodiments disclosed herein may be configured using a computer program product; for example, a controller may be programmed in software for implementing the example embodiments disclosed herein. Further embodiments disclosed herein may include a non-transitory computer-readable medium containing instructions that can be executed by a processor, and when executed, these instructions cause the processor to perform the methods disclosed herein. It should be understood that the elements in the block diagrams and flow charts described herein may be implemented in software, hardware, firmware, or other similar implementations to be determined in the future. In addition, the elements in the block diagrams and flow charts described herein may be combined or separated in software, hardware, or firmware in any manner.

应理解到，术语“在此”可以转换为结合在此介绍的教导的申请或专利，从而使得主题、定义、或数据进行形成该结合的申请或专利。It should be understood that the term "herein" may be translated into an application or patent incorporating the teachings introduced herein, such that the subject matter, definitions, or data form the incorporated application or patent.

如果以软件形式实现，则可以用可以支持在此披露的示例实施例的任何语言来编写软件。可以用计算机可读介质的形式存储软件，如随机存取存储器(RAM)、只读存储器(ROM)、光盘只读存储器(CD-ROM)等等。在操作时，通用或专用处理器以本领域中很好理解的方式加载和执行软件。应进一步理解到，这些框图和流程图可以包括有区别地安排或定向的、或有区别地展现的更多或更少的元素。应理解到，实现方式可以指定框图、流程图、和/或网络图和展示本发明的实施例的执行的框图和流程图的数量。If implemented in software, the software can be written in any language that can support the example embodiments disclosed herein. The software can be stored in the form of a computer-readable medium, such as a random access memory (RAM), a read-only memory (ROM), a compact disc read-only memory (CD-ROM), etc. In operation, a general-purpose or special-purpose processor loads and executes the software in a manner well understood in the art. It should be further understood that these block diagrams and flow charts may include more or fewer elements that are differently arranged or oriented, or differently presented. It should be understood that an implementation may specify the number of block diagrams, flow charts, and/or network diagrams and block diagrams and flow charts that illustrate the execution of an embodiment of the present invention.

尽管已经参照其示例实施例具体地展示和描述了本发明，本领域技术人员应理解到通过在不偏离由所附的权利要求书涵盖的本发明的范围的情况下可以从中做出在形式和细节上的各种改变。While the invention has been particularly shown and described with reference to example embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention as encompassed by the appended claims.

Claims

1. A security device operatively coupled to a network, the security device comprising:

At least one central processing unit (CPU) core; and

At least one hyper-nondeterministic finite automaton (HNA) processor, operatively coupled to the at least one CPU core and dedicated to nondeterministic finite automaton (NFA) processing, the at least one HNA processor comprising:

Multiple super clusters, each super cluster comprising multiple clusters, each of the multiple clusters comprising multiple HNA processing units (HPUs), the at least one CPU core being configured to select at least one super cluster among the multiple super clusters;

An on-chip instruction queue for an HNA is configured to store at least one HNA instruction; and

An HNA scheduler is configured to select a given HPU from among the plurality of HPUs of the selected at least one supercluster and assign the at least one HNA instruction to the selected given HPU to initiate a match on at least one regular expression pattern in an input stream received from the network.

2. The security device of claim 1, wherein each supercluster further includes a supercluster graphics memory dedicated to the respective supercluster, the supercluster graphics memory being accessible to the respective plurality of HPUs of the respective plurality of clusters of the respective supercluster and configured to statically store a subset of nodes of at least one per-mode NFA, the subset of nodes being determined by a compiler of the at least one per-mode NFA.

3. The safety device as described in claim 2, wherein:

Each supercluster further includes at least one supercluster character class memory dedicated to the corresponding supercluster, and each at least one supercluster character class memory is configured to statically store multiple regular expression pattern character class definitions.

4. The security device as claimed in claim 3, wherein the supercluster graphics memory and the at least one supercluster character class memory are unified.

5. The security device as claimed in claim 3, wherein the corresponding plurality of HPUs of the corresponding plurality of clusters of the corresponding supercluster share the at least one supercluster character class memory.

6. The safety device as claimed in claim 1, wherein:

Each supercluster further includes at least one supercluster character class memory, each of which is dedicated to a given cluster among a plurality of clusters of a given supercluster and shared by a plurality of HPUs of that given cluster, each of which is configured to statically store a plurality of regular expression pattern character class definitions.

7. The security device of claim 1, wherein the at least one CPU core is further configured to select the at least one supercluster among the plurality of superclusters by restricting supercluster selection based on a graphical identifier associated with the at least one HNA instruction.

8. The security device of claim 7, wherein the graphical identifier is associated with a given per-mode NFA among a plurality of per-mode NFAs, and limiting the selection of the supercluster includes determining that at least one node of the given per-mode NFA is stored in a supercluster graphical memory dedicated to the selected at least one supercluster.

9. The safety device as claimed in claim 7, wherein:

This graphical identifier is associated with a given per-mode NFA among a plurality of per-mode NFAs;

The HNA scheduler is configured to select the given HPU from a set of HPU restrictions, the set of HPU restrictions including each of the respective plurality of HPUs in each of the respective plurality of clusters in the selected at least one supercluster; and the at least one CPU core is further configured to select the at least one supercluster from the plurality of superclusters based on a determination that at least one node of the given per-mode NFA associated with the graph identifier is stored in a supercluster graph memory dedicated to the selected at least one supercluster.

10. The security device of claim 9, wherein the HNA scheduler is further configured to select the given HPU from the HPU limit set based on a single round-robin scheduling of a plurality of HPUs in the HPU limit set.

11. The security device of claim 9, wherein the HNA scheduler is further configured to select the given HPU from the HPU limit set based on the instantaneous loading of each HPU in the HPU limit set.

12. The safety device as claimed in claim 1, wherein:

Each supercluster further includes a dedicated supercluster graphics memory; and

Each supercluster graphics memory is configured to store at least one node of at least one per-mode NFA among a plurality of per-mode NFAs to replicate the at least one node in each supercluster graphics memory of the at least one HNA processor for each supercluster.

13. The safety device as claimed in claim 12, wherein:

The at least one CPU core is further configured to provide the HAN scheduler with options to select the at least one supercluster based on determining that a given per-mode NFA in at least one per-mode NFA associated with the at least one HNA instruction is replicated; and

The HNA scheduler is further configured to be used for:

The at least one supercluster is selected based on the provided option and (i) a first round-robin scheduling of the plurality of superclusters, (ii) a first instantaneous loading of the plurality of superclusters, or (iii) a combination of (i) and (ii); and

The given HPU is selected from the plurality of HPUs of the plurality of clusters of the selected at least one super cluster based on a second round scheduling of the plurality of HPUs of the plurality of clusters of the selected at least one super cluster, a second instantaneous loading of the plurality of HPUs of the plurality of clusters of the selected at least one super cluster, or a combination thereof.

14. The security device of claim 1, wherein the at least one HNA processor further includes an HNA on-chip graphics memory accessible to the plurality of HPUs of the plurality of superclusters, the HNA on-chip graphics memory being configured to statically store a subset of nodes of at least one per-mode NFA, the subset of nodes being determined by a compiler of the at least one per-mode NFA.

15. The security device of claim 1, wherein the at least one HNA command is a first at least one HNA command, and the security device further comprises:

At least one system memory, operatively coupled to the at least one CPU core and the at least one HNA processor, the at least one system memory being configured to include:

An external instruction queue for storing at least one second HNA instruction to be transferred to the on-chip instruction queue of the HNA processor; and

An HNA chip-external graphics memory is configured to statically store a subset of nodes of at least one per-mode NFA, the subset of nodes being determined by a compiler of the at least one per-mode NFA.

16. The safety device of claim 15, further comprising:

At least one local memory controller (LMC), wherein the at least one LMC is operatively coupled to the at least one HNA processor and the at least one system memory, and a given LMC of the at least one LMC is configured to enable non-coherent access to the at least one system memory so that the at least one HNA processor can access the HNA off-chip graphics memory.

17. The security device of claim 15, wherein the at least one system memory is further configured to include an HNA data packet memory configured to continuously store a plurality of payloads, each of the plurality of payloads having a fixed maximum length and associated with a given HNA instruction of either the first at least one HNA instruction stored in the HNA on-chip instruction queue or the second at least one HNA instruction to be transmitted to the HNA on-chip instruction queue.

18. The security device of claim 17, further comprising at least one LMC, wherein the at least one system memory is further configured to include:

An HNA input stack partition is configured to store at least one HNA input stack, each of the at least one HNA input stack being configured to store at least one HNA input job for at least one HPU of at least one of the plurality of HPUs of the plurality of super clusters;

An off-chip runtime stack partition is configured to store at least one off-chip runtime stack to extend the storage of at least one on-chip runtime stack, each of the at least one on-chip runtime stack being configured to store at least one runtime HNA job for the at least one HPU.

An external HNA save buffer partition is configured to expand the storage of at least one on-chip save buffer, each of the at least one on-chip save buffer being configured to store at least one runtime HNA operation for the at least one HPU based on the detection of a payload boundary; and

An external HNA result buffer partition is configured to store at least one final matching result entry of the at least one regular expression pattern determined by the at least one HPU for matching in the input stream, wherein each of the at least one stored HNA instruction identifies the following: a given HNA input stack of the HNA input stack partition, a given HNA external run stack of the HNA external run stack partition, a given HNA external save buffer of the HNA external save buffer partition, and a given HNA external result buffer of the HNA external result buffer partition.

19. The safety device of claim 18, further comprising at least one LMC, wherein a given LMC of the at least one LMC is configured to:

This enables the at least one HNA processor to access the HNA data packet memory, HNA input stack partition, HNA off-chip instruction queue, HNA off-chip execution stack partition, HNA off-chip save buffer partition, and HNA off-chip result buffer partition via a consistent path; and

This enables the at least one HNA processor to access the external graphics memory of the HNA chip via a non-uniform path.

20. The security device of claim 1, wherein each HPU in the plurality of HPUs of the plurality of superclusters comprises:

A node cache is configured to cache one or more nodes from a supercluster graphics memory, an on-chip graphics memory, or an off-chip graphics memory.

A character class cache is configured to cache one or more regular expression pattern character class definitions from a supercluster character class store;

A payload buffer is configured to store a given payload from an HNA packet data memory, wherein the at least one HNA instruction includes an identifier for a location in the given payload within the HNA packet data memory;

A stack top register is configured to store a single HNA operation;

A runtime stack is configured to store multiple HNA jobs;

A unified memory is configured to store a first content of a save stack and a second content of a match result buffer, the first content including one or more HNA jobs stored in the run stack, and the second content including one or more final match results; and

An HNA processing kernel, operatively coupled to the node cache, character class cache, payload buffer, stack top register, run stack, and unified memory, is configured to traverse at least one per-pattern NFA based on multiple payload segments stored in the payload buffer to determine a match of the at least one regular expression pattern.

21. The safety device as claimed in claim 1, wherein:

Each supercluster further includes a supercluster graphics memory dedicated to the respective supercluster.

The at least one HNA processor further includes an on-chip HNA graphics memory shared by the multiple superclusters;

The security device further includes at least one system memory configured to include an external graphics memory (HNA) shared by the multiple superclusters; and

The selected given HPU is configured to traverse multiple nodes of at least one per-mode NFA based on multiple segments of a payload of the input stream according to the assigned at least one HNA instruction, wherein the traversed nodes stored in a node cache are proprietary to the selected given HPU, the supercluster graphics memory, the HNA on-chip graphics memory, the HNA off-chip graphics memory, or a combination thereof.

22. The security apparatus of claim 1, wherein the plurality of HPUs of the plurality of clusters of the selected at least one super cluster form an HPU resource pool, the HPU resource pool being available to the HNA scheduler for selection to achieve the matching acceleration.

23. A hyper-nondeterministic finite automaton (HNA) processor specifically for processing nondeterministic finite automata (NFAs), the HNA processor comprising:

Multiple superclusters, each supercluster comprising multiple clusters, each of which comprises multiple HNA processing units (HPUs); and

An on-chip instruction queue for HNA is configured to store at least one HNA instruction, wherein the plurality of HPUs of at least one selected supercluster in the plurality of superclusters form an HPU resource pool available for allocation of the at least one HNA instruction; and

An HNA scheduler is configured to select a given HPU from the formed resource pool and assign at least one HNA instruction to the selected given HPU to initiate a match on at least one regular expression pattern in an input stream received from a network.

24. A method for processing finite automata, comprising:

Operabically coupling at least one hyper-nondeterministic finite automaton (HNA) processor to at least one CPU core, the at least one HNA processor being dedicated to nondeterministic finite automaton (NFA) processing; and

The at least one HNA processor is configured to include:

25. The method of claim 24, wherein each supercluster further includes a supercluster graphics memory dedicated to the respective supercluster, the supercluster graphics memory being accessible to the respective plurality of HPUs of the respective plurality of clusters of the respective supercluster and configured to statically store a subset of nodes of at least one per-mode NFA, the subset of nodes being determined by a compiler of the at least one per-mode NFA.

26. The method of claim 25, wherein:

27. The method of claim 26, wherein the supercluster graphics memory and the at least one supercluster character class memory are unified.

28. The method of claim 26, wherein the corresponding plurality of HPUs of the corresponding plurality of clusters of the corresponding supercluster share the at least one supercluster character class memory.

29. The method of claim 24, wherein:

30. The method of claim 24, wherein the at least one CPU core is further configured to select the at least one supercluster from the plurality of superclusters by restricting supercluster selection based on a graphical identifier associated with the at least one HNA instruction.

31. The method of claim 30, wherein the graph identifier is associated with a given per-mode NFA among a plurality of per-mode NFAs, and limiting the supercluster selection includes determining that at least one node of the given per-mode NFA is stored in a supercluster graph memory dedicated to the selected at least one supercluster.

32. The method of claim 30, wherein:

33. The method of claim 32, wherein the HNA scheduler is further configured to select the given HPU from the HPU limit set based on a single round-robin scheduling of a plurality of HPUs in the HPU limit set.

34. The method of claim 32, wherein the HNA scheduler is further configured to select the given HPU from the HPU limit set based on the instantaneous loading of each HPU in the HPU limit set.

35. The method of claim 24, wherein:

36. The method of claim 35, wherein:

The at least one CPU core is further configured to provide the HNA scheduler with options to select the at least one supercluster based on the determination that a given per-mode NFA in at least one per-mode NFA associated with the at least one HNA instruction is replicated; and the HNA scheduler is further configured to:

The given HPU is selected from the multiple HPUs of the selected at least one super cluster based on a second-cycle scheduling of the multiple HPUs of the multiple clusters of the selected at least one super cluster, an instantaneous loading of the multiple HPUs of the multiple clusters of the selected at least one super cluster, or a combination thereof.

37. The method of claim 24, wherein the at least one HNA processor further includes an HNA on-chip graphics memory accessible to the plurality of HPUs of the plurality of superclusters, the HNA on-chip graphics memory being configured to statically store a subset of nodes of at least one per-mode NFA, the subset of nodes being determined by a compiler of the at least one per-mode NFA.

38. The method of claim 24, wherein the at least one HNA instruction is a first at least one HNA instruction, and the method further comprises:

39. The method of claim 38, further comprising:

40. The method of claim 38, wherein the at least one system memory is further configured to include an HNA packet data memory configured to continuously store a plurality of payloads, each of the plurality of payloads having a fixed maximum length and associated with a given HNA instruction of either the first at least one HNA instruction stored in the HNA on-chip instruction queue or the second at least one HNA instruction to be transmitted to the HNA on-chip instruction queue.

41. The method of claim 40, further comprising at least one LMC, wherein the at least one system memory is further configured to include:

An HNA input stack partition is configured to store at least one HNA input stack, and each at least one HNA input stack is configured to store at least one HNA input job for at least one HPU of at least one of the plurality of HPUs of the plurality of super clusters;

An off-chip runtime stack partition is configured to store at least one off-chip runtime stack to extend the storage of at least one on-chip runtime stack, each of the at least one on-chip runtime stacks being configured to store at least one runtime HNA job for the at least one HPU;

An external HNA save buffer partition is configured to expand the storage of at least one on-chip save buffer, each of the at least one on-chip save buffers being configured to store at least one runtime HNA operation for the at least one HPU based on the detection of a payload boundary; and

42. The method of claim 41, further comprising at least one LMC, wherein a given LMC of the at least one LMC is configured to:

This enables the at least one HNA processor to access the HNA data packet memory, HNA input stack partition, HNA off-chip instruction queue partition, HNA off-chip execution stack partition, HNA off-chip save buffer partition, and HNA off-chip result buffer partition via a consistent path; and

43. The method of claim 24, wherein each HPU in the plurality of HPUs of the plurality of superclusters comprises:

A stack top register is configured to store a single HNA operation;

A runtime stack is configured to store multiple HNA jobs;

44. The method of claim 24, wherein:

The method further includes at least one system memory configured to include an external graphics memory shared by the plurality of superclusters; and

45. The method of claim 24, wherein the plurality of HPUs of the plurality of super clusters form an HPU resource pool, the HPU resource pool being available to the HNA scheduler for selection to achieve the matching acceleration.