HK1193278A - Compiler for regular expressions - Google Patents
Compiler for regular expressions Download PDFInfo
- Publication number
- HK1193278A HK1193278A HK14106732.4A HK14106732A HK1193278A HK 1193278 A HK1193278 A HK 1193278A HK 14106732 A HK14106732 A HK 14106732A HK 1193278 A HK1193278 A HK 1193278A
- Authority
- HK
- Hong Kong
- Prior art keywords
- nfa
- states
- dfa
- state
- hash value
- Prior art date
Links
Description
RELATED APPLICATIONS
This application is a continuation of and claims priority to U.S. application No.13/168,450 filed 24/6/2011. The entire teachings of the above application are incorporated herein by reference.
Background
The Open Systems Interconnection (OSI) reference model defines 7 network protocol layers (L1-L7) for communication over a transmission medium. The upper layers (L4-L7) represent peer-to-peer communications and the lower layers (L1-L3) represent local communications.
Networking application aware systems need to process, filter and switch the range of L3 to L7 network protocol layers, e.g., L7 network protocol layers such as hypertext transfer protocol (HTTP) and Simple Mail Transfer Protocol (SMTP), and L4 network protocol layers such as Transmission Control Protocol (TCP). In addition to handling the network protocol layers, networking application aware systems need to simultaneously ensure access and content based security for these protocols through the L4-L7 network protocol layers, including firewalls, Virtual Private Networks (VPNs), Secure Socket Layer (SSL), intrusion detection systems (TDS), internet protocol security (IPSec), line-speed anti-virus (AV) and anti-spam functions.
Network processors are available for high throughput L2 and L3 network protocol processing, i.e., performing packet processing to forward packets at line speed. Typically, a general purpose processor is used to process L4-L7 network protocols that require more intelligent processing. While general purpose processors may perform computationally intensive tasks, there is insufficient performance for processing data so that it can be forwarded at wire speed.
Content aware networking requires inspection of packet content at "line speed". The content may be analyzed to determine if a security breach or intrusion exists. A number of patterns and rules in the form of regular expressions are applied to ensure that all security breaches or intrusions are detected. Regular expressions are compact methods for describing the patterns of character strings. The simplest pattern matched by a regular expression is a single character or string of characters, e.g.,/c/or/cat/. Regular expressions also include operators and meta-characters with special meaning.
By using meta-characters, regular expressions can be used for more complex searches, such as "abc. That is, without limiting the number of characters between "abc" and "xyz", the character string "abc" is found, followed by the character string "xyz". Another example is the regular expression "abc.. xyz;", i.e., the string "abc" is found, followed by two characters, then "abc" and followed by the string "xyz" after an unlimited number of characters.
Intrusion Detection System (IDS) applications examine the contents of all individual packets flowing through the network and identify suspicious patterns that may indicate an attempt to break into or threaten the system. One example of a suspicious pattern may be a particular text string in a packet that follows 100 characters by another particular text string.
Content searching is typically performed using a search algorithm, such as Deterministic Finite Automata (DFA) or non-deterministic finite automata (NFA), to process regular expressions.
Disclosure of Invention
A method and corresponding apparatus relate to converting a non-deterministic finite automata (NFA) graph for a given set of patterns to a Deterministic Finite Automata (DFA) graph having a plurality of states. Each of the DFA states is mapped to one or more states of the NFA graph. A hash value of one or more states of the NFA graph mapped to each DFA state is computed. For a given pattern, the DFA state table correlates each of a plurality of DFA states to a hash value of one or more states of the NFA graph.
Valid characters of a given pattern associated with letters identified by the NFA graph and the DFA graph may be determined. Based on the determination, the method compresses the NFA graph and the DFA graph to identify a pattern consisting only of valid characters of a given pattern associated with letters identified by the NFA graph and the DFA graph.
Drawings
The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the invention.
Fig. 1 is a block diagram illustrating a system in which a security device operates to protect a private network.
Fig. 2 is a block diagram of a security device that may be used by the present invention.
Fig. 3 is an NFA graph of an example NFA.
Fig. 4 is a DFA graph of an example DFA.
Fig. 5A-G are NFA and DFA graphs and tables illustrating the principles of graph explosion.
Fig. 6A-B are flow diagrams of methods of converting NFA graphs to DFA graphs.
Fig. 7 is a flow diagram of a method for determining epothilones termination for a state in an NFA graph.
Fig. 8A-C are flow diagrams of methods for converting NFA graphs to DFA graphs.
Fig. 9 is a flow diagram of a method for determining epothilones termination for a state in an NFA graph.
Fig. 9A-9B are flow diagrams of methods for converting NFA graphs to DFA graphs.
Fig. 9C is a flow diagram for determining epothilones termination for a set of NFA states.
Fig. 10 illustrates a pattern matching mechanism for searching, for example, patterns "her", "his", and "her" using the Aho-Corasik algorithm.
Fig. 11A illustrates a failure value of each state of the pattern matching mechanism.
Fig. 11B illustrates output function values of the state of the pattern matching mechanism.
FIG. 12 illustrates a state tree of, for example, anchored patterns "Help" and "Shell".
FIG. 12A illustrates fault values for each state of the state tree of FIG. 12.
FIG. 12B illustrates the output function values for states 14 and 19 of the state tree of FIG. 12.
Fig. 13A illustrates a failure value of each state of the state tree in fig. 12.
FIG. 13B illustrates the output function values for states 12, 14, 17, and 19 of the state tree of FIG. 12.
Detailed Description
Before describing in detail exemplary embodiments of the present invention, exemplary security applications in which embodiments may be implemented and exemplary processes using DFAs and NFAs are described immediately below to aid the reader in understanding the inventive features of the present invention.
Regular expression (Regex) processing is becoming common in many packet processing systems. Regex processing may be applied to legacy security systems (e.g., Intrusion Prevention Systems (IPS), firewalls, and Unified Threat Management (UTM) devices), newer security systems (e.g., anti-malware, anti-spyware, zero-day additive detection), emerging protocol/application identification systems for billing, quality of service (QoS), and network monitoring systems in wired/wireless networks.
The regular expression processing may be subdivided into two stages i) compiling the signature/pattern into a binary data structure, such as a DFA graph or NFA graph, and ii) processing the received packets according to the compiled graph.
The storage and performance tradeoff requires that both stages of regular expression processing occur. Compilers that are allocated large run-time memory footprint are able to compile patterns at higher speed and efficiency. Similarly, a larger graph or equivalent binary data structure for packet inspection may give better packet inspection performance relative to a compact graph.
In practice, however, it is desirable for a compiler to compile rules very quickly with as little memory as possible. One reason is to update the pattern in the fields of the network device (e.g., router, switch, UTM, etc.) while the network device is still running (e.g., inspecting/forwarding packets). Therefore, the rules need to be compiled using limited memory in the embedded router device. Since the rules/patterns are used to prevent attacks on the system or to stop traffic infected by viruses, it is desirable to apply the rules/patterns as early as possible in order to optimize the security of the system. Therefore, the compiler should be able to very quickly compile the rules into a binary data structure.
The general approach compiles the new pattern or signature into a graphic on a central server, which then transmits the compiled graphic to the router. The router then examines the arriving packet according to the received pattern by passing the packet through each pattern. An efficient compiler requires sufficient memory resources. If the compiler does not have enough resources, the compiler performance is slow. Thus, simple methods do not compile new patterns or features on the router, as the router typically does not have sufficient resources (i.e., Random Access Memory (RAM) and CPU computations).
Embodiments of the present invention compile new patterns/signatures into graphics on the router while maintaining the performance level of the central server compiler.
Fig. 1 is a block diagram illustrating a system 100 that includes a security device 110, a protected network 115, and a public network 105. Public network 105 may include an unsecured Wide Area Network (WAN) such as the internet, a wireless network, a local area network, or other type of network. The protected network 115 may include a secure computer network, such as a local area network of an office or data center. As shown, the local area network may be an enterprise network 120 including a plurality of workstations 125. A plurality of workstations 125 are operatively coupled to a database 130, an FTP (file transfer protocol) server 135 and an intranet server 140.
In the system 100, the security device 110 is connected to the public network 105 and the protected network 115 such that network traffic flowing from the public network 105 to the protected network 115 flows first to the security device 110. The security device 110 may be a stand-alone network device (e.g., a router), a component of another network device (e.g., a firewall device), a software module executing on a network device, or other configuration. In general, the security device examines network traffic from the public network 105 and determines whether the network traffic includes any computer security threats. Computer security threats are attempts to gain access to sensitive information, attempts to disrupt an organization's operations, or other types of attacks. Example computer security threats include computer viruses, spyware, rogue software (rootkits), attempts to guess passwords, phishing mails, requests associated with denial of service additions, and other types of attacks.
The computer security threat may be associated with one or more symbol patterns that identify the computer security threat but not harmless data. The symbol patterns associated with computer security threats are referred to herein as "threat signatures". For example, a particular virus may always be included in a sequence of instructions that, when executed, performs a malicious operation.
If the security device 110 determines that a given network traffic flow does not include any computer security threats, the security device 110 may pass the network traffic flow to the protected network 115. Otherwise, if the security device 110 determines that the flow includes one or more computer security threats, the security device 110 may drop the network traffic, record the network traffic, forward the traffic to a traffic analyzer for further analysis, and/or perform some other action with respect to the network traffic. In this manner, the security device 110 may prevent network traffic including computer security threats from reaching the protected network 115.
To detect a security threat associated with one or more symbol patterns, the security device 110 receives a given pattern or sequence of symbols from the secure data center 140 to be monitored in arriving data traffic from the public network 105. Once the security device receives the given patterns to be monitored, the security device creates a finite state machine for each given pattern to be monitored. The security device 110 passes the received data packet through a finite state machine to determine whether the arriving data packet includes a potential security threat.
Fig. 2 is a high-level block diagram of an exemplary security device 200 that may be used with the present invention. The security device includes a memory 210 coupled to a processor 225 via a memory bus 245, and a storage device 230 and a network interface 240 coupled to the processor via an input/output (I/O) bus 250. It should be noted that the security device may comprise other devices, such as a keyboard, a display unit, etc. The network interface 240 interfaces the security device with the secure network 115, the public network 105, and the secure data center 140 and enables data (e.g., packets) to be communicated between the security device and other nodes in the system 100. To this end, network interface 240 comprises conventional circuitry, including signal, electrical, and mechanical features, as well as switched circuitry, which is required to interface with the physical media of system 100 and the protocols running on that media.
Memory 210 is a computer-readable medium implemented as RAM, including RAM devices such as DRAM devices and/or flash memory devices. Memory 210 contains various software and data structures used by processor 225, including data structures that implement aspects of the present invention. In particular, memory 210 includes an operating system 215 and a pattern matching/compilation service 220. The operating system 215 functionally organizes the secure device 200 by invoking operations in support of software processes and services executing on the secure device 200, such as a pattern matching/compilation service 220. As will be described below, pattern matching/compilation service 220 includes computer-executable instructions for compiling finite state machine graphics and/or compiled graphics of arriving data packets from a given pattern.
Storage 230 is a conventional storage (e.g., a disk or, more likely, a DRAM) that includes a pattern matching Database (DB) 235, which pattern matching Database (DB) 235 is a data structure configured to hold various information for compiling a finite state machine from a given pattern. The information may include signature patterns, finite state machine graphs (e.g., DFA graphs and NFA graphs), epothilones termination (EC) cache tables, and DFA state hash tables.
Typically, content-aware application processes use Deterministic Finite Automata (DFA) or non-deterministic finite automata (NFA) to identify patterns in the content of received packets. Both DFA and NFA are finite state machines, i.e., computational models, each of which includes a set of states, a starting state, an input letter (set of all possible symbols), and a transfer function. The calculation starts in the start-up state and changes to a new state according to the transfer function.
Patterns are typically expressed using regular expressions that include basic elements, e.g., normal text characters such as a-Z, 0-9, and meta characters such as ^ a, ^ and |. The basic element of a regular expression is the symbol (single character) to be matched. These are combined with allow concatenation (+), substitution (|), and kringle asterisk (×) meta-characters. The meta-characters used for concatenation are used to create multiple character matching patterns from a single character (or sub-string), while the meta-characters used for replacement (|) are used to create a regular expression that can match any of two or more sub-strings. The meta character clinstar (—) allows pattern matching to any number, including the absence of preceding characters or strings. Combining different operators and single characters allows complex expressions to be built. For example, the expression (th (is | at) ×) will match the following string: th, this, that, thisis, thisat, thatis or thatat.
The character class structure [ say ] allows listing of the characters to be searched, e.g., gr [ ea ] y lookup grey and gray. Dashes indicate the range of characters, e.g., [ A-Z ]. A meta character ". matches any one character.
The input to the DFA or NFA state machine is typically a string of (8-bit) bytes, i.e., the letter is a single byte (one character or symbol). Each byte in the input stream produces a transition from one state to another.
The states and transition functions of a DFA or NFA state machine can be explained by a graph, where each node in the graph represents a state and the arcs in the graph represent state transitions. The current state of the state machine is represented by a node identifier that selects a particular graph node.
Processing the regular expression using DFA and finding one or more patterns in the input stream of characters described by the regular expression is characterized by:
1) determined run-time performance: the next state of the DFA may be determined from the input character (or symbol) and the current state of the DFA. In other words, there is only one state transition per DFA state. In this way, the runtime performance of the DFA is considered deterministic and the behavior can be fully predicted from the inputs.
2) The smaller per-flow context (e.g., state or node pointer) needed to support matching across multiple packets: in a pattern that searches for inputs across several packets that make up a stream, the search may stop at one packet and then resume at another packet. Generally, determining the state to resume the search requires tracking, remembering, or otherwise storing (e.g., as a state pointer) all states experienced until the search stopped. However, in DFA, in order to resume the search, it is only necessary to record the state when the search was stopped. Thus, it can be said that DFAs are characterized require a small per-flow context to support pattern matching across multiple incoming packets, e.g., storing state or node pointers on the order of a few bytes.
3) The number of nodes (or pattern size) is exponentially increasing with the size of the pattern.
In contrast, using NFA to process a regular expression and find a pattern described by the regular expression in an input stream of characters is characterized by:
1) non-deterministic run-time performance: given the input character (or symbol) and the current state of the NFA, there may be more than one next state of the NFA to transition to. In other words, the next state of the NFA cannot be determined from the input and current state of the NFA. As such, the runtime performance of the NFA is considered uncertain and behavior cannot be completely predicted from the inputs.
2) The larger per-flow context (e.g., state or node pointer) needed to support matching across multiple packets: as previously described, patterns match across multiple incoming packets, where the search stops at one packet and then resumes at another packet, all states that are experienced until the packet stops need to be tracked. In NFA, the more inputs are matched, the more states are experienced and the number of states that need to be tracked. Thus, it can be said that NFA is characterized as requiring a larger per-flow context to support pattern matching across multiple input packets compared to DFA.
3) The number of nodes (or pattern size) is a linearly increasing pattern with the size of the pattern.
The features of the DFA and NFA described above are discussed further with reference to FIGS. 3, 4, and 5-G. It should be noted that for simplicity, for all DFA graphs shown in the figures, no arcs (state transitions) to node (state) 0 are shown, and no arcs to the same node as the node pointed to by node 0 are shown for the same characters.
FIG. 3 shows an NFA graph 300 of an example NFA for search patterns "cavium. network," nitrox [. Lambda \ r \ n \ t \ v \ s ] {3} octaon, "and" purevu. {5,10} videohips. Fig. 4 illustrates a DFA graph 400 of an example DFA used to search for the same set of patterns. As noted above, it should be noted that the DFA graph 400 and other DFA graphs provided herein are "simplified" for purposes of drawing. The arc to node 0 representing the state transition to DFA state 0 is not shown in the figure. Arcs to the same nodes as the node pointed to by node 0 for the same characters are also not shown in the figure.
For the same set of patterns, the NFA graph 300 of FIG. 3 has 69 nodes, representing 69 states, while the DFA graph 400 of FIG. 4 has 931 nodes (only a portion of which is shown in FIG. 4), representing 931 states. As shown, for a given pattern or patterns, the number of DFA states may be greater than the number of NFA states, typically by on the order of hundreds or thousands more states. This is an example of a "graph explosion," which is a hallmark feature of DFAs.
To further describe the concept of "pattern explosion," consider FIGS. 5A, 5B, and 5C, which show NFA patterns for patterns ". Lambda [. Lambda \ n ]", ". Lambda [. Lambda \ n ]", and FIGS. 5D, 5E, and 5F, which show DFA patterns for the same patterns, respectively. As shown in fig. 5A-5F and summarized by the table of fig. 5G, for some patterns, NFA may grow linearly, while DFA may grow exponentially to produce a pattern explosion.
Returning to fig. 3, the NFA represented by graph 300 is used to search for patterns in the input stream "purevucheps arevidieo chips", NFA processing or matching begins at the start states 0, 2, 19 and 36 of the NFA, represented by nodes 305a-d and abbreviated as NFA START STATES = {0, 2, 19, 36 }. With respect to the character "p" of the input stream, NFA transitions to state 37 (represented by node 310) and tracks states 0, 2, and 19 (abbreviated as pairs "p" = {0, 2, 19, 37 }), and proceeds as follows:
for "u" = {0, 2, 19, 38}
For "r" = {0, 2, 19, 39}
For "e" = {0, 2, 19, 40}
For "v" = {0, 2, 19, 41}
For "u" = {0, 2, 19, 42}
For "c" = {0, 2, 19, 44}
For "h" = {0, 2, 19, 45}
…
.., etc.,
using the DFA represented by the DFA graph 400 of fig. 4 to search for the same pattern in the same input, the DFA match begins at DFA start state 0, represented by node 405 and abbreviated as DFA START STATES = {0 }. For the character "p" of the input stream, the DFA transitions to state 3, represented by node 410 and abbreviated as pair "p" = {3}, and continues as follows:
for "u" = {6}
For "r" = {9}
For "e" = {12}
For "v" = {15}
For "u" = {18}
For "c" = {27}
For "h" = {41}
…
.., etc.
As shown in the above example, in the NFA, there are at least n +1 number of NFA states to track, where n is the number of patterns to search (e.g., in the case of 3 patterns to search, there are at least 4 states to track). In contrast, in DFA, there is only one state to track per input character. For the purpose of illustration, it is now assumed that an input stream (stream) or stream (flow) "purevuchis are provided across a plurality of packets, wherein a first packet ends with" h "of" purevuchis "and a second packet starts with" i "of" purevuchis ". In NFA, the search stops at "h" (the end of the first packet), with four states to track (i.e., states 0, 2, 19, and 45). To resume the search at "i" (start of second packet), these four states need to be remembered. In contrast, in DFA, the search stops at "h" (the end of the first packet), with one state being tracked (i.e., state 41). To resume the search at "i" (start of second packet), this one state needs to be remembered. This example shows that in the NFA, the per flow context needed to support matching across multiple packets is four states (e.g., by storing four state pointers), while in the DFA, the per flow context is one state. Thus, the per-flow context required by the NFA is larger than that required by the DFA of the same pattern. Similarly, DFA requires smaller per-flow context than the same pattern of NFA.
For each non-deterministic finite automaton, there is an equivalent deterministic finite automaton. The equivalence between the two is defined in language acceptance. Because the NFA is a finite automaton, where 0, 1, or multiple transitions are allowed for an input symbol, an equivalent DFA can be constructed to simulate all of the movements of the NFA in parallel with respect to a particular input symbol.
Since the DFA equivalent of an NFA models (paralleling) the movement of the NFA, each state of the DFA is a combination of one or more states of the NFA. Thus, each state of the DFA will be represented by some subset of the set of states of the NFA; and thus, the conversion from NFA to DFA is often referred to as "building" the subset. Thus, if a given NFA has n states, an equivalent DFA can have 2nA number of states, wherein the initial state corresponds to the subset q0}. Thus, the transition from the NFA to the DFA involves finding all possible subsets of the set of states of the NFA, considering that each subset is a state of the DFA, and then finding a transition from that state for each input symbol.
As described above, because of the multiple possible transitions of the NFA, it is difficult for the NFA graph to be processed by the computer system, and thus a transition of the NFA to the DFA occurs.
Fig. 6A-B are flow diagrams of a method 600 for converting NFA graphs to DFA graphs. The method 600 begins at 605. At this stage, the DFA state set "Sd" is empty. At 610, the starting state of the DFA is determined and added to the DFA state set "Sd" as an unmarked state. The start state of the DFA is determined to be the epothilones termination of the start state of the NFA graph. The method of determining Epsilon termination for a set of NFA states is further described below with reference to FIG. 7.
At 615, it is determined whether the set of DFA states "Sd" includes an unmarked DFA state. If an unlabeled DFA state of the set of DFA states "Sd" exists, then at 620, the unlabeled state "S" is selected and labeled. At 625, the letter (e.g., word) of the language "A" identified by the NFA graphic is selected. At step 630, NFA state "S" of DFA state "S" is selected. Further, prior to step 630, the data structure "St" used to maintain the NFA state set is set to "NULL". At 635, the transfer function "TTn = (s, a)" is applied to the NFA state "s" using the letter "a". If an input of "a" is received, the transfer function determines all NFA states that are reached from NFA state "s". The determined state of the NFA is then added to the data structure "St". At 644, it is determined whether the DFA state "s" includes other NFA states. If so, the method repeats steps 630 and 635 until all NFA states "s" of DFA states "s" have been processed. If all NFA states have been processed, the method continues at step 640. At 640, the epothilones terminations for all NFA states "s" in the data structure "St" are determined and added to the data structure "St".
At step 645, the data structure "St" is compared to all existing DFA states "S" to determine whether the DFA state "S" already includes all NFA states "S" in the data structure "St". The current approach stores each NFA state "S" associated with each DFA state "S" in the data structure. To determine whether an NFA state in data structure "St" has been associated with DFA state "S," each NFA state "S" of data structure "St" must be compared to each NFA state "S" of each DFA state "S". Therefore, such comparison requires a lot of time and memory.
Table 1 below illustrates an example DFA state table that associates DFA state numbers with NFA state sets. The set of NFA states may be stored in a data structure for each DFA state number, as described above.
TABLE 1
| Number of DFA states | NFA State set |
| 0 | {0,1,2,3,4} |
| 1 | {0,5,6,7} |
| 2 | {8,9,2,3} |
| … | … |
| … | … |
For example, the runtime of the operation at step 645 is obtained below with reference to table 2, according to the implementation of the data structure (containing the DFA states and their corresponding set of NFA states). Table 2 lists the storage and maintenance costs of an example data structure implementation. The comment column of table 2 provides a description of each example data structure. For each data structure, it is assumed that there are "N" DFA states, and further that each DFA state represents, on average, 'M' NFA states.
TABLE 2
Continuing with fig. 6A-B, if the DFA state "s" already includes all of the NFA states "s" of the data structure "St," then the method moves to step 695. At 690, it is determined whether any of the NFA states "s" belong to the final state set of the NFA graph. If so, the new DFA state is determined to be the final acceptance state of the NFA graph. At 695, the transition from the flagged DFA state "S" with input "S" is set to the new DFA state "D".
Fig. 7 is a flow chart of a method 700 for determining an epothilones termination for any given NFA state "s" of an NFA graph. Method 700 begins at step 705. At 710, the Epsilon termination function receives NFA state "s" for processing. At 715, state "s" of the NFA is not tagged and is added to the set of untagged NFA states "s". At 720, NFA state "S" is selected from the unlabeled NFA states "S". At 725, the Epsilon transition from the selected NFA state "s" is determined. At step 730, it is determined whether there are any Epsilon transitions. If not, then at 740, the selected NFA state's' is added to the Epsilon termination set. If there are transitions, then at step 745, all determined NFA transitions from the selected NFA state 'S' are added to the set "S" as unmarked NFA states. Step 750 determines whether there are any unmarked NFA states in NFA states "s". If so, the method continues from step 720. If not, the method ends at step 755.
Example pseudo code #1 for the above referenced method (beginning at step 615 of FIG. 6A) is as follows:
1. for each unlabeled DFA State "d" (column-1-in the DFA State Table, Table 1 above)
1. For each letter "a" in the alphabet set "
1. Set S = { }
2. For each NFA State "n" of "d" (column-2-in the DFA State Table, Table 1 above)
1. If "n" has an outgoing arc to "m" arc for "a
1.S=S U{m}
3.SE=ECLOSURE(S)
4. The findings were designated as "false"
5. For each DFA state in the DFA state table, Table 1 above
1. Let "p" be the set of NFA states corresponding to DFA state "f
2. If the sets "Se" and "p" are equal
1. The findings were designated as "true"
2. To 1.6
6. If found to be "true"
1. Setting switch ("d", "a") = "f"
7. Others
1. Add New DFA State "f" to the DFA State Table, Table 1 with "Se" as the set of NFA states in COL.2
2. Setting switch ("d", "a") = "f"
2. Setting "d" to "marked"
The disadvantages of the above referenced method are as follows: i) step 1.1.3 of the pseudocode, representing methods 700 and 600, always calculates the Epsilon termination (ECLOSURE ()), since no history of Epsilon termination is stored in memory, ii) step 1.1.5.2 is very time consuming due to the set equivalence test. The time consumed by the set equivalence test depends on the number of elements in the set to be compared (i.e., the number of NFA states in the DFA states (as shown in Table 1); and iii) the entries in Table 1 cannot be deleted because the entries are needed to perform step 1.1.5.2 of the set equivalence test, thereby requiring a significant amount of memory resources.
In an example embodiment of the present invention, the NFA graph is converted to an equivalent DFA graph.
Fig. 8A-C are flow diagrams of a method 800 for converting NFA graphics to NFA graphics in accordance with an example embodiment of the present invention. The method 800 begins at 805. At this stage, the set of DFA states "Sd" is empty. At 810, the starting state of the DFA is determined and added to the set of DFA states "Sd" as an unmarked state. The start state of the DFA is determined to be the epothilones termination of the start state of the NFA graph. The method of determining the epothilones termination for NFA status according to an example embodiment of the present invention is further described below with reference to fig. 9. At 815, the encrypted/perfect hash value of the NFA state associated with the DFA start state is computed and stored in a table that correlates the DFA state with the hash value of the NFA state associated with the DFA state, as shown in table 4 below. At 820, the epothilones termination for the NFA start state is stored in the epothilones cache as further shown in table 5. The epothilones cache is retrieved by a hash of the set of input NFA states, and the stored data is the epothilones termination of the computed input NFA states.
The cryptographic hash/perfect hash function is a deterministic process that requires taking an arbitrary block of data and returning a fixed-size bit string, which is a cryptographic hash value. Example cryptographic hash functions include, for example, a message digest algorithm (MD 5) or a secure hash algorithm (SHA 1/SHA 2). With a larger digest (e.g., 128b for MD 5), the chance of collision is less likely. However, the "integrity check" may be done offline to verify that there are no conflicts (different data sets with the same hash value) so that if a conflict occurs, the graph may be corrected.
At 825, it is determined whether the set of DFA states "Sd" includes unmarked DFA states. If not, the method ends at step 895. If an unmarked DFA state of the set of DFA states "Sd" exists, then at 830, unmarked state "S" is selected and marked. At 835, the letter (e.g., word) of the language "a" identified by the NFA graph is selected. Further, the data structure "St" for holding the NFA state set is set to "empty". At step 840, the NFA state "S" associated with the DFA state "S" is selected. At 850, using the letter "a" input value, the conversion function "TTn = (s, a)" is applied to the NFA state "s". If an input of "a" is received, the transfer function determines all NFA states that are reached from state "s" of the NFA. The determined state of the NFA is then stored in data structure "St" at 855. At 860, it is determined whether the DFA state "S" includes additional associated NFA states. If so, the method repeats at steps 850 and 855 until all NFA state "S" of the DFA state "S" have been processed. If all NFA states have been processed, the method continues at step 865. At 865, the epothilones terminations for all NFA states "s" in data structure "St" are determined as per fig. 9 and added to data structure "Se".
At step 870, the data structure "Se" is compared to all DFA states "S" present to determine whether the DFA state "S" includes all NFA states "S" in the data structure "Se". As described above with reference to step 645 of method 600, the general method stores the set of NFA states "S" associated with each DFA state "S" in a data structure. To determine whether NFA state "S" in the data structure "Se" is already associated with DFA state "S", each set of NFA states "S" of the data structure "Se" must be compared to each set of NFA states "S" for each DFA state "S". Therefore, such comparison requires a lot of time and memory, as shown in table 2.
In this embodiment, the encryption/perfect hash value of the NFA state in the data structure "Se" is calculated and then compared to a table that correlates the DFA state number with the hash value of its corresponding set of one or more NFA states. If a matching hash value exists, then at step 870, a determination is made as to whether a DFA state associated with the NFA state in the data structure "Se" already exists, and the method moves to step 890. At 890, the transition from DFA state "S" for an input with the letter "a" is set to the existing DFA state associated with the matching hash value. The method moves to step 845 where a determination is made as to whether another letter "a" is present in the language 'a', and if so, the method repeats from step 835. If not, the method moves to step 847. At step 847, the method deletes the set of NFA state numbers and adds the set of states as flagged. The method then continues at step 825.
The runtime of the operation at step 870 is obtained below with reference to table 3. Table 3 lists the storage and maintenance costs of hash matching according to an example embodiment of the present invention. The comment column of table 3 provides a description of the hash match.
TABLE 3
Table 4, shown below, is a DFA table that correlates DFA state numbers with hash values of its corresponding set of one or more NFA states, as described above.
TABLE 4
| Number of DFA states | DFA state hashing | NFA State number set | Mark (M)/unlabeled (U) |
| 0 | 81237912891273 | Deleting | M |
| 1 | 09237504823405 | Deleting | M |
| 2 | 23894729379237 | {4,5,6,2,0,1} | U |
| 3 | 89345798731278 | {4,2,3,7,1,8} | U |
| … | … | … |
Continuing with fig. 8, if there is no matching hash value at step 870, the method moves to step 875. At 875, the NFA state "s" in the data structure "Se" is added to the DFA state set "Sd" as a new unmarked DFA state (e.g., "D"). In addition, the encrypted/perfect hash value for NFA state "s" is computed and the new DFA state is mapped to the hash value in table 4 above. At 880, it is determined whether any of the NFA states "s" belong to the final state set of the NFA graph. If so, the new DFA state is determined to be the final acceptance state of the NFA graph. At 885, a transition with input "a" from the flagged DFA state "S" is determined as a new DFA state "D".
Fig. 9 is a flow diagram of a method 900 for determining an epothilony termination for a set of states in an NFA graph in accordance with an example embodiment of the present invention. Method 900 begins at step 905. At 910, the Epsilon termination calculator receives a set of NFA states. At 915, the set of NFA states is compared to entries within the cache to determine whether the cache contains a matching set of NFA states.
Each entry within the cache may be stored as a hash value representing an epsilonlon terminated set of NFA states that map to an input set of NFA states. The calculator calculates a hash value from the received set of NFA states and determines whether the EC cache has a matching hash value entry associated with the set of NFA states.
As shown below, table 5 is an epsilone termination cache table that maps sets of NFA states to its epsilone termination, as described above.
TABLE 5
| EC input set hashing | EC output set |
| 78346782346782 | {3,4,1,2,7} |
| 89237489237492 | {8,3,2,5,19} |
| … | … |
Continuing with FIG. 9, if a match is found, then at 920 an Epsilon termination is returned. However, if no match is found, then the Epsilon termination for the set of NFA states received is calculated. The epothilones termination is calculated as described above with reference to method 700 described in fig. 7. Once the Epsilon termination is calculated, the Epsilon termination is stored as a new entry in the Epsilon cache at 930. At 935, a calculated Epsilon termination is returned.
Method 900 allows efficient processing of epothilones terminations by eliminating redundant processing. For example, method 900 calculates the Epsilon termination for the NFA state set only when the Epsilon termination has not been calculated. This eliminates the need to perform more than one epothilones termination process for one NFA set. Referring to the method 600 described above with reference to fig. 6A-B, the method can calculate the epsilonlon termination for any given NFA node more than once. However, by storing the previously calculated Epsilon termination in the cache, method 900 eliminates the need for unnecessary data processing.
The example pseudo code #2 of the method referenced above (beginning at step 825 of FIG. 8A) is as follows:
1. for each unlabeled DFA State "d" (column-1-in the DFA State Table, Table 4 above)
1. For each letter "a" in the alphabet set "
1. Set S = { }
2. For each NFA device "n" of "d" (column-3-in the DFA status Table, Table 4 above)
1. If "n" has an outgoing arc to "m" for "a" 1.S = S U { m }
3. Obtain ECLOSURE "Se" for the set "S", as given below
1. Computing the hash "Hi" of the set "S"
2. For each entry "e" in the EC cache table, Table 5 above
1. Let "He" be the hash value at entry "e" (column-1-in the EC cache table above)
2. If "Hi" and "He" are the same
Se = EC output set (e), i.e. column-2-in the above-mentioned EC cache table
2. Proceed to 1.1.4
3.Se=ECLOSURE(S)
4. Adding new entries to the EC cache table with fields "Hi" and "Se
4. The findings were designated as "false"
5. Computing a hash "q" of the set "Se"
6. For each DFA state "f" in the DFA state table, Table 4 above
1. Such that "p" is the hash of the NFA state of DFA state "f
2. If "p" and "q" are the same
1. The findings were designated as "true"
2. Proceed to 1.1.7
7. If found to be "true"
1. Setting switch ("d", "a") = "f"
8. Others
1. Add New DFA State "f" to the DFA State Table, Table 4 with fields "q" and "Se
2. Setting switch ("d", "a") = "f"
2. The set of NFA state numbers for DFA state "d" is deleted from the DFA state table, table 4, and "d" is set to marked.
The advantages of the above referenced method are as follows: i) step 1.1.3 avoids calculating ECLOSURE (), if already calculated, ii) step 1.1.6.2 is a comparison of hash values, whose size is constant and takes a fixed amount of time for comparison, better than the set equivalence test. The amount of time is O (1), as shown in Table 3 above; and iii) because the set of NFA states for the DFA states is deleted after processing, a significant amount of compiler memory footprint is saved, as shown in Table 4.
Another optimization is that the EC _ CACHE can also store the direct DFA state number, rather than the NFA state set corresponding to the epothilones termination set (of NFA states), so that step 870 is not required at all. For example, if there is a hit in EC _ CACHE (), there is no need to search for an equivalent DFA node. As shown below, table 6 is an example EC _ CACHE table that stores the direct DFA state number corresponding to the epothilones termination set (of NFA states).
TABLE 6
| EC input set hashing | Number of DFA states |
| 78346782346782 | 13 |
| 89237489237492 | 14 |
| … |
Thus, the processing table of step 870 becomes:
TABLE 7
Fig. 9A-B are flow diagrams of a method 901 for converting NFA graphs to DFA graphs, in accordance with an example embodiment of the present invention. Method 901 begins at 902. At this stage, the DFA state set "Sd" is empty. At 903, the starting state of the DFA is determined and added to the DFA state set "Sd" as an unmarked state. The start state of the DFA is determined to be the epothilones termination of the start state of the NFA graph. The method of determining Epsilon termination for NFA status according to an example embodiment of the present invention is further described below with reference to FIG. 9C. At 904, the encrypted/perfect hash value of the NFA state associated with the DFA start state is computed, and at 906, the mapping of the DFA start state and the encrypted hash is stored in a table that correlates the DFA state with the hash value of the NFA state associated with the DFA state, as shown in table 6 above.
At 907, it is determined whether the set of DFA states "Sd" includes an unmarked DFA state. If not, the method ends at step 908. If an unlabeled DFA state of the set of DFA states "Sd" exists, then at 909, the unlabeled state "d" is selected and labeled. At 911, the letters (e.g., letters) of the language 'A' identified by the NFA graphic are selected. Further, the "S" for maintaining the data structure of the NFA state set is set to "empty". At step 913, the NFA state 'n' associated with DFA state "d" is selected. At 914, the conversion function "TTn = (s, a)" is applied to the NFA state "n," inputting a value using the letter "a". If an input of "a" is received, the transfer function determines all NFA states that are reached from NFA state "n". The determined NFA state is then stored in data structure "S" 916. At 917, it is determined whether DFA state'd' includes additional associated NFA states. If so, the method repeats at step 913 until all NFA states 'n' of DFA state "d" have been processed. If all NFA states have been processed, the method continues at step 918. At 918, transitions from DFA state "d" according to the letter "a" are determined as per FIG. 9C. At step 919, transition state "f" from DFA state "d" with an input of the letter "a" is set and stored in the DFA state table. At 921, the stored set of NFA transition states is deleted from the data structure "S".
Fig. 9C is a flow diagram of a method 922 for determining an epothilony termination for a set of states in an NFA graph, according to an example embodiment of the invention. The method 922 begins at step 923. At 924, the epothilony termination calculator receives the set of NFA states and calculates a hash value "Hi" for the received set of NFA states. At 926, the hash value "Hi" is compared to entries in the Epsilon cache to determine if the cache contains a match.
Each entry in the cache may be stored as a hash value that represents a set of NFA states that map to DFA states. The calculator calculates a hash value from the received NFA state set and determines whether the EC cache has a matching hash value entry associated with an NFA state set that is related to the DFA state.
If a match is found, then at 933, the DFA state "f" is returned that maps to the matching hash value of the cache table. However, if no match is found, then at step 928, the Epsilon termination for the received set of NFA states is computed. The epothilones termination may be calculated as described above with respect to method 700 described above with reference to fig. 7. Once the epothilones termination is computed, at 929 an epothilones terminated cryptographic hash is computed and stored as a new entry in the epothilones cache. At 931, the new DFA state "f" corresponding to the hash value of the NFA state set is mapped to the hash value in the EC cache table. At 932, the new DFA state "f" is returned.
Example pseudo code #3 for the above referenced method is as follows:
1. for each unlabeled DFA State "d" (column-1-in the DFA State Table, Table 4 above)
1. For each letter "a" in the alphabet set "
1. Set S = { }
2. For each NFA State "n" of "d" (column-3-in the DFA State Table, Table 4 above)
1. If "n" has an outgoing arc to "m" for "a
1.S=S U{M}
3. Transition DFA State "f" with respect to letter "a" to obtain DFA State "d" is as follows
1. Computing the hash "Hi" of the set "S"
2. For each entry "e" in the EC cache table, table 6 above,
1. let "He" be the hash value at entry "e" (column-1-in EC cache Table 6 above)
2. If "Hi" and "He" are the same
1. Assign "f" to the DFA state number of entry "e", i.e., column-2-in the EC cache table above
2. Proceed to 1.1.5
3.Se=ECLOSURE(S)
4. Computing a hash "q" of the set "Se"
5. The findings were designated as "false"
6. For each DFA State "g" in the DFA State Table above "
1. Hash of NFA state such that "p" is DFA state "g
2. If "p" and "q" are the same
1. Assign the discovery as "true" and assign "g" as "f"
2. Proceed to 1.1.3.7
7. If found to be "true"
1. Proceed to 1.1.4
8. Adding a new unmarked DFA State "f" to the DFA State Table Using "q" and "Se
4. Adding a new entry in the EC EACHE table using the hash "Hi" and DFA state number "f
5. Setting switch ("d", "a") = "f"
2. The set of NFA states number for DFA state "d" is deleted from the DFA state table, table 4 above, and the state is set to "flagged".
The size of EC _ CACHE is configurable and limited according to the allowed run-time memory footprint. If run-time memory usage is limited, a replacement policy is required. For example, a replacement policy may preserve the least recently used Epsilon termination (EC) or no replacement at all of the NFA state set. In the latter case, the EC _ CACHE holds only a fixed number of predetermined ECs of the NFA state set. The latter case has been found to be very useful.
The advantages of the above referenced method are as follows: i) if it has already been computed, step 1.1.3.2 avoids computing ECLOSURE (), ii) if possible (step 1.1.6.2 in the previous Algorithm, pseudocode # 2), storing the number of DFA nodes in the EC cache table instead of the ECLOSURE set avoids searching the DFA nodes given its ECLOSURE (); and iii) because the set of NFA states for DFA states is deleted after processing in table 4, a large amount of memory usage by the compiler is saved.
As described above, content searches are typically performed using search algorithms, such as Deterministic Finite Automata (DFA) or non-deterministic finite automata (NFA), to process regular expressions. Another type of string search algorithm that may be implemented is the Aho-Corasik algorithm.
The Aho-coresik algorithm can be used to create finite state machine graphs from a collection of text strings that can be used to process the input payloads in a single pass. For example, given a set of strings, the Aho-Corasik algorithm creates a finite pattern matching machine for processing any arbitrary string of text that may be received via packets in a communication network. The behavior of the created pattern matching machine is controlled by three functions: i) a go-to function "g", ii) a fault function "f", and iii) an output function "output".
Fig. 10 shows a pattern matching machine 1000 for searching for patterns "hers", "his", and "she" using the Aho-coresik algorithm. The go function 'g' maps a pair consisting of a state and an input symbol to a state or message "failure". The failure function "f" maps states to states and queries whenever going to the function "g" reporting "failure". The output function "output" associates a set of keys (which may be null) with each state.
The start state is state 0 (represented by node 1005). In any given state, if the go-to function "g (s, a) = t" ("s" is the current state of the finite machine, "a" is the input value, and "t" is the transition state), the pattern matching machine 1000 enters the state "t" and the exact next symbol to be input becomes the current input symbol. For example, referring to fig. 10, if in state 0 (node 1005) and an's' input value is received, the pattern matching machine 1000 will transition to state 3 (node 1015).
However, if going to the function "g (s, a) = fault" and the fault function "f(s) =" s "", the pattern matching machine repeats the loop with "s" as the current state and the input letter "a" as the current input symbol. Fig. 11A shows a related art failure value for each state of the pattern matching machine 1000. Fig. 11B shows the output function values for states 2, 5, 7 and 9.
For example, referring to FIG. 10, an arbitrary input string "ushers" is processed by the handler as follows:
processing or matching begins in start state 0 (represented by node 1005). For the character "u" of the input stream, the handler remains in state 0 (node 1005). For the character "s" of the input stream, the processor transitions to state 3 (node 1015) and continues as follows:
for "h" = {4}
For "e" = {5}
For "r" = {8}
For "s" = {9}
In state 4 (node 1020), because the go-to function g (4, "e") =5, and the handler 1000 enters state 5, the keywords "she" and "he" are matched at the end of position 4 in the text string "ushers", and the output function issues output (5) (as seen in fig. 11B). For the input symbol "r" in state 5, the machine 1000 makes two state transitions. Because g (5, r) = fault, the machine 1000 enters state 2= f (5). Further, since g (2, r) =8, the machine 1000 enters state 8 and proceeds to the next input symbol.
Having described an example security application in which example embodiments of the present invention may be implemented and a typical process using the Aho-Corasik machine 1000, example embodiments of the present invention are now described in detail.
As described above, the Aho-Corasik machine 1000 detects the presence of a keyword or pattern at each position of an arbitrary input string. In some cases, it may make sense that a given pattern or keyword is only found in a particular region or location of the input string. For example, the HTTP protocol request parser is only interested in the keyword "Get" when it occurs at the beginning of the request, and is not interested in any other "Get" thereafter. Such a pattern may be referred to as an anchor pattern.
Embodiments of the present invention enable the creation of an Aho-Corasik pattern matching machine that recognizes both non-anchor and anchor patterns. As described above, the pattern matching machine 1000 uses the Aho-coresik algorithm to identify the unanchored patterns "hers", "his", and "she". By modifying the Aho-coresik algorithm, a pattern matching machine can be created to recognize the additional anchor patterns "hellp" and "shell".
Given a pattern set, anchored patterns must be distinguished from non-anchored patterns. The anchored pattern may have additional macros that specify that the pattern is anchored. For example, "{ 0 }" may be added to the beginning of a pattern to specify that the pattern is an anchor pattern. Thus, given the pattern list "he, she, his, hers, {0} help, and {0} shell," the compiler can identify the keys "help" and "shell" as anchor patterns.
Once the compiler receives the list of keys/patterns, the compiler can also distinguish the non-anchor patterns from the anchor patterns. The compiler then creates independent state trees for all anchor patterns and independent state trees for non-anchor patterns using the go to function 'g' as described above (machine 1000 shown in FIG. 10). Fig. 12 illustrates a state tree 1200 for the anchor patterns "hellp" and "shell". Fig. 12A illustrates fault values for each state of the state tree 1200 (i.e., assuming that the patterns are compiled as non-anchor patterns) in accordance with the prior art. Fig. 12B illustrates the output function values of state 14 and state 19 of the state tree 1200.
Once the state trees for the anchor pattern and the non-anchor pattern are created, the compiler calculates a fault function "f" for both state numbers. For state trees representing non-anchor patterns, the compiler implements a fault function as described above with reference to FIGS. 10 and 11A. According to an exemplary embodiment of the present invention, the failure function of the anchor pattern is constructed according to the following rules:
a) the failure of the root node of the anchored tree is set equal to the root node of the non-anchored state tree. Thus, if no anchor pattern is matched, the non-anchor pattern is tracked. For example, the failure "f" of state 10 (node 1205) is set equal to the start state 0 (node 1005 in fig. 10).
b) Once the fault for the starting state of the anchor tree is determined, the fault "f" for each state of the anchor tree is determined so that partial matches of non-anchor keys to anchor keys are also tracked using the go-to function "g", as shown in FIG. 13A.
The output function of the anchor pattern is computed separately, but appears to remain overlapping with the non-anchor pattern, as shown in FIG. 13B.
Thereafter, according to an example embodiment of the present invention, the root node of the anchored state tree is set to the root node of the final state tree (a combination of anchored and non-anchored state trees).
Now, the anchored state tree and the non-anchored state tree have been effectively merged into a single state tree.
For example, referring to FIGS. 10, 11A-B, 12 and 13A-B, an arbitrarily input string "ushers" is processed by the handler as follows:
processing or matching begins at start state 10 (represented by node 1205). For the character "u" of the input stream, the machine transitions to state 0 (node 1005 in FIG. 10) according to the fault function shown in FIG. 13A, and "u" is processed again at node 0 (1005), and the machine stays at node "0" (1005). For the character "s" of the input stream, the machine transitions to state 3 (node 1015) and continues as follows:
for "h" = {4}
For "e" = {5}
For "r" = {8}
For "s" = {9}
In state 4 (node 1020), because to function g (4, "e") =5, and the machine 1000 enters state 5, the keywords "she" and "he" are matched at the end of position 4 in the text string "ushers", and the output function issues output (5) (as seen in fig. 11B). For the input symbol "r" in state 5, the machine 1000 transitions to state 8. For the input symbol "s" at state 8, the machine 1000 transitions to state 9, the key "hers" is matched and the output function issues an output (9) (as seen in fig. 11B).
In another example, referring to FIGS. 10, 11A-B, 12 and 13A-B, an arbitrary input string "shell" is processed by the processor as follows:
processing or matching begins at start state 10 (represented by node 1205).
For the character "s" of the input stream, the machine transition to state 15 (FIG. 12) continues as follows:
for "h" = {16}
For "e" = {17}
For "l" = {18}
For "l" = {19}
In state 16 (fig. 12), because the go to function g (16, 'e') =17, and the machine 1200 enters state 17, the keywords "she" and "he" are matched at the end of position three of the text string "shell", and the output function issues an output (17) (as seen in fig. 13B). For the input symbol "l" in state 17, the machine 1200 transitions to state 18. For the input symbol "l" at state 18, the machine 1200 transitions to state 19, the key "shell" is matched as an anchor pattern, and the output function issues an output (19) (as seen in FIG. 13B).
As mentioned above, the input to a DFA or NFA state machine is typically a string of (8-bit) bytes, i.e., the letter is a single byte (one character or symbol). Thus, the size of the entire letter set may be 256. In addition, each byte in the input stream produces a transition from one state to another. However, there are not how many patterns, strings, or regular expressions use the entire set of letters. Most patterns use a small subset of the alphabet set, which may be referred to below simply as the "valid character" set. For example, only printable ASCII letters (e.g., including a-Z, A-Z, 0-9, and some symbols) may be used.
Embodiments of the present invention compress the NFA and NFA graphs to identify only "valid character" sets. According to an example embodiment of the invention, pseudo-codes #1, #2, and #3 only process letters in the "valid character" set during step 1.1 of each pseudo-code.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope and spirit of the present invention as encompassed by the following claims.
Claims (66)
1. A method, comprising:
in a processor of a security device coupled to a network:
converting a non-deterministic finite automaton (NFA) graph for a given set of patterns to a Deterministic Finite Automaton (DFA) graph having a plurality of states, the converting the NFA graph comprising:
mapping each of a plurality of DFA states to one or more states of the NFA graph;
computing a hash value of the one or more states of the NFA graph mapped to each DFA state;
storing a DFA state table that correlates each state of the plurality of DFA states to the hash value for the one or more states of the NFA graph for the given pattern.
2. The method of claim 1, wherein the computed hash value is an encrypted/perfect hash value.
3. The method of claim 1, wherein mapping comprises:
determining whether an unmarked DFA state exists within the plurality of DFA states;
selecting one of the unmarked DFA states;
flagging the unmarked DFA state; and
determining, for characters of letters identified by the NFA graph, transitions of the one or more states of the NFA graph to other NFA states that map to flagged DFA states.
4. The method of claim 3, wherein determining a transition comprises: mapping the other NFA states and the Epsilon terminations of the other NFA states to possible new unmarked DFA states.
5. The method of claim 4, wherein mapping possible new unmarked DFA states to Epsilon terminations for the other NFA states comprises: obtaining the EPON terminations for the other NFA states from an EPON cache, and calculating the EPON terminations for the other NFA states if the EPON terminations for the other NFA states are not present within the EPON cache.
6. The method of claim 5, further comprising: adding the possible new unmarked DFA state to the plurality of DFA states and storing the possible new unmarked DFA state in the DFA state table if the other NFA states mapped to the possible new unmarked DFA state and the Epsilon termination of the other NFA states are not mapped to existing DFA states of the plurality of DFA states.
7. The method of claim 6, wherein adding the possible new unmarked DFA states comprises:
comparing the Epsilon-terminated hash values of the other NFA states mapped to the possible new unmarked DFA states to hash values of NFA states mapped to each of the plurality of DFA states.
8. The method of claim 6, wherein adding the possible new unlabeled DFA states to the plurality of DFA states further comprises: determining whether the EPON termination of the other NFA states mapped to the possible new unmarked states belongs to a final accepted NFA state, and if so, adding a possible unmarked state as a final accepted DFA state.
9. The method of claim 6, further comprising: adding the possible new unmarked states as transitions from marked DFA states for the characters of the letter identified by the NFA graphic.
10. The method of claim 4, further comprising: replacing the mappings of the Epsilon terminations of the other NFA states and the other NFA states to the possible new unmarked DFA states with the calculated mappings of Epsilon termination hash values of the other NFA states and the other NFA states to the possible new unmarked DFA states.
11. The method of claim 10, wherein replacing comprises deleting the other NFA states and the epothilones terminations of the other NFA states.
12. The method of claim 3, further comprising: deleting the transitions of one or more states of the NFA graph from the DFA state table.
13. The method of claim 1, wherein mapping comprises:
determining a DFA start state; and
adding the DFA start state as an unmarked state to a data structure comprising the plurality of DFA states.
14. The method of claim 13, wherein determining a DFA start state comprises:
determining the Epsilon termination for the NFA onset state;
calculating a hash value of the Epsilon termination for the NFA Start State; and
mapping the DFA start state to the hash value of the Epsilon termination of the NFA start state.
15. The method of claim 13, wherein mapping comprises:
determining whether an unmarked DFA state exists within the plurality of DFA states;
selecting one of the unmarked DFA states;
flagging the unmarked DFA state; and
determining, for characters of letters identified by the NFA graph, transitions of one or more states of the NFA graph to other NFA states that map to flagged DFA states.
16. The method of claim 15, wherein determining a transition comprises:
determining a transition DFA state from the flagged DFA state for the character of the letter identified by the NFA graphic.
17. The method of claim 16, wherein determining the transition DFA state comprises:
computing a hash value of the transformation of the one or more states of the NFA graph; and
the computed hash value is compared to a hash value entry in an Epsilon terminating (EC) cache table that maps the hash value to DFA state.
18. The method of claim 17, further comprising:
determining whether the computed hash value matches one of the hash value entries in the EC cache table; and
if a match exists, setting the DFA state mapped to a matching hash value to a converted DFA state for the character of the letter identified by the NFA graph.
19. The method of claim 18, further comprising:
if there is no match, calculating an Epsilon termination for the transition of the one or more states of the NFA pattern;
calculating the Epsilon terminated hash value;
adding a new entry to the EC cache table, the new entry mapping a new DFA state to the hash value for the Epsilon termination; and
setting the new DFA state to the transition DFA state for the character of the letter identified by the NFA graph.
20. The method of claim 19, further comprising:
deleting the transitions of the one or more states of the NFA graph corresponding to the flagged DFA state from the DFA state table.
21. A method for determining epothilones termination for a set of NFA states, the method comprising:
in a processor of a security device coupled to a network:
receiving a set of NFA states, the set of NFA states being transition states for characters of a letter identified by an NFA graphic; and
determining whether the received NFA state set matches a set of NFA states in an Epsilon termination (EC) cache table that maps Epsilon terminations of the NFA state set with hash values of the NFA state set.
22. The method of claim 21, wherein determining further comprises:
calculating a hash value of the received NFA state set; and
matching the hash value with a hash value entry of the EC cache table.
23. The method of claim 22, further comprising:
setting the NFA state set mapped to the matching hash value in the EC cache table to the Epsilon termination of the received NFA state set if there is a match.
24. The method of claim 23, further comprising:
if there is no match, calculating an Epsilon termination for the received NFA state set;
adding a new entry to the EC cache table, the new entry mapping the hash value of the received NFA state set to the Epsilon termination of the received NFA state set.
25. The method of claim 24, wherein adding an entry comprises:
determining whether there is sufficient memory for adding the new entry;
if there is not enough memory, the new entry is added according to a replacement policy.
26. The method of claim 25 wherein the replacement policy adds the new entry by deleting a least recently used entry of the EC cache table.
27. A method for determining epothilones termination for a set of NFA states, the method comprising:
in a processor of a security device coupled to a network:
receiving a set of NFA states, the set of NFA states being transition states for characters of a letter identified by an NFA graphic; and
a determination is made whether the received NFA state set matches a set of Epsilon termination (EC) states in an EC cache table that maps Epsilon-terminated hash values associated with the NFA state set to DFA states.
28. The method of claim 27, wherein determining further comprises:
calculating a hash value of the received NFA state set; and
matching the hash value with a hash value entry of the EC cache table.
29. The method of claim 27, further comprising:
if a match exists, the DFA state associated with the matching hash value entry is set to a transition DFA state from the current DFA state for the input character of the letter identified by the NFA graph.
30. The method of claim 29, further comprising:
if a match does not exist, calculating an Epsilon termination for the received NFA state set;
calculating the Epsilon-terminated hash value of the received NFA state set;
adding a new entry to the EC cache table, the new entry mapping the hash values of the received NFA state set to new DFA states; and
setting the new DFA state to the transition DFA state from a current DFA state for the input character of the letter identified by the NFA graphic.
31. The method of claim 30, wherein adding an entry comprises:
determining whether there is sufficient memory for adding the new entry;
if there is not enough memory, a new entry is added according to the replacement policy.
32. The method of claim 31 wherein the replacement policy adds the new entry by deleting a least recently used entry of the EC cache table.
33. The method of claim 1, further comprising:
determining valid characters of a given pattern associated with letters identified by the NFA graph and the DFA graph; and
creating the NFA graph and the DFA graph to identify a pattern consisting only of valid characters of the given graph associated with the letters identified by the NFA graph and the DFA graph.
34. A security device coupled to a network, the security device comprising:
a processor configured to convert a non-deterministic finite automata (NFA) graph for a given set of patterns into a Deterministic Finite Automata (DFA) graph having a plurality of states; and
a compiler, the compiler configured to:
mapping each of a plurality of DFA states to one or more states of the NFA graph;
computing a hash value of the one or more states of the NFA graph mapped to each DFA state;
storing, in a data store, a DFA state table that correlates each of the plurality of DFA states to the hash value for the one or more states of the NFA graph for a given pattern.
35. The security device of claim 34, wherein the calculated hash value is an encrypted/perfect hash value.
36. The security device of claim 34, wherein the compiler is further configured to:
determining whether an unmarked DFA state exists within the plurality of DFA states;
selecting one of the unmarked DFA states;
flagging the unmarked DFA state; and
determining a transition of the one or more states of the NFA graph mapped to flagged DFA states to other NFA states of characters of letters identified by the NFA graph.
37. The security device of claim 36, wherein the compiler is further configured to: transitions are determined by mapping the other NFA states and the Epsilon terminations of the other NFA states to possible new unmarked DFA states.
38. The security device of claim 37, wherein the compiler is configured to map the possible new unmarked DFA state to the epothilones terminations of the other NFA states by: obtaining the EPON termination for the other NFA state from an EPON cache, and if the EPON termination for the other NFA state does not exist within the EPON cache, the compiler is configured to calculate the EPON termination for the other NFA state.
39. The security device of claim 38, wherein the compiler is further configured to add the possible new unmarked DFA state to the plurality of DFA states and store the possible new unmarked DFA state in the DFA state table if the other NFA states mapped to the possible new unmarked DFA state and the epothilony terminations of the other NFA states do not map to existing DFA states of the plurality of DFA states.
40. The security device of claim 39, wherein the compiler is configured to add the possible new unmarked DFA state by comparing the Epsilon terminated hash values of the other NFA states mapped to the possible new unmarked DFA state to hash values of the NFA states mapped to each of the plurality of DFA states.
41. The security device of claim 39, wherein the compiler is configured to add the possible unmarked DFA states to the plurality of DFA states by: determining whether the EPON terminations of the other NFA states that map to possible new unmarked states belong to a final accepted NFA state, and if so, the compiler is further configured to add a possible unmarked state as a final accepted DFA state.
42. The security device of claim 39, wherein the compiler is further configured to add a possible new unmarked state as a transition from a marked DFA state for the character of the letter identified by the NFA graph.
43. The security device of claim 37, wherein the compiler is further configured to replace the mapping of the epothilones terminations of the other NFA states and the other NFA states to the possible new unmarked DFA states with the calculated mapping of the epothilones terminations of the other NFA states and the other NFA states to the possible new unmarked DFA states.
44. The security device of claim 43, wherein the compiler is configured to delete the other NFA state and the Epsilon termination of the other NFA state.
45. The security device of claim 36, wherein the compiler is further configured to delete the transition of the one or more states of the NFA graph from the DFA state table.
46. The security device of claim 34, wherein the compiler is configured to map each of the plurality of DFA states to one or more states in the NFA graph by determining a DFA start state and adding the DFA start state as an unmarked state to a data structure comprising the plurality of DFA states.
47. The security device of claim 46, wherein the compiler is further configured to:
determining the Epsilon termination for the NFA onset state;
calculating a hash value of the Epsilon termination for the NFA Start State; and
mapping the DFA start state to the hash value of the Epsilon termination of the NFA start state.
48. The security device of claim 46, wherein the compiler is further configured to:
determining whether an unmarked DFA state exists within the plurality of DFA states;
selecting one of the unmarked DFA states;
flagging the unmarked DFA state; and
determining a transition of the one or more states of the NFA graph that map to the marked DFA states to other NFA states for characters of letters identified by the NFA graph.
49. The security device of claim 48, wherein the compiler is configured to determine a transition by determining a transition DFA state from the flagged DFA state for the character of the letter identified by the NFA graphic.
50. The security device of claim 49, wherein the compiler is further configured to:
computing a hash value of the transformation of the one or more states of the NFA graph; and
the computed hash value is compared to a hash value entry in an Epsilon terminating (EC) cache table that maps the hash value to DFA state.
51. The security device of claim 50, wherein the compiler is further configured to:
determining whether the computed hash value matches one of the hash value entries in the EC cache table; and
if a match exists, setting the DFA state mapped to the matched hash value to a converted DFA state for the character of the letter identified by the NFA graph.
52. The security device of claim 51, wherein the compiler is further configured to:
if there is no match, calculating an Epsilon termination for the transition of the one or more states of the NFA pattern;
calculating the Epsilon terminated hash value;
adding a new entry to the EC cache table, the new entry mapping a new DFA state to the hash value for the Epsilon termination; and
setting the new DFA state to the transition DFA state for the character of the letter identified by the NFA graph.
53. The security device of claim 52, wherein the compiler is further configured to:
deleting the transitions of the one or more states of the NFA graph corresponding to the flagged DFA state from a DFA states table.
54. A security device for determining an epothilony termination of an NFA state set, the security device comprising:
a compiler, the compiler configured to:
receiving a set of NFA states, the set of NFA states being transition states of characters of letters identified by an NFA graphic; and
determining whether the received NFA state set matches a set of NFA states in an Epsilon termination (EC) cache table that maps Epsilon terminations of the NFA state set with hash values of the NFA state set.
55. The security device of claim 54, wherein the compiler is further configured to:
calculating a hash value of the received NFA state set; and
matching the hash value with a hash value entry of the EC cache table.
56. The security device of claim 55, wherein the compiler is further configured to:
setting the NFA state set mapped to the matching hash value in the EC cache table to the Epsilon termination of the received NFA state set if there is a match.
57. The security device of claim 56, wherein the compiler is further configured to:
if there is no match, calculating an Epsilon termination for the received NFA state set;
adding a new entry to the EC cache table, the new entry mapping the hash value of the received NFA state set to the Epsilon termination of the received NFA state set.
58. The security device of claim 57, wherein the compiler is further configured to:
determining whether there is sufficient memory for adding the new entry;
if there is not enough memory, the new entry is added according to a replacement policy.
59. The security device of claim 58 wherein the replacement policy adds the new entry by deleting a least recently used entry of the EC cache table.
60. A security device for determining an epothilony termination of an NFA state set, the security device comprising:
a compiler, the compiler configured to:
receiving a set of NFA states, the set of NFA states being transition states of characters of letters identified by an NFA graphic; and
a determination is made whether the received NFA state set matches a set of Epsilon termination (EC) states in an EC cache table that maps Epsilon-terminated hash values associated with the NFA state set to DFA states.
61. The security device of claim 60, wherein the compiler is further configured to:
calculating a hash value of the received NFA state set; and
matching the hash value with a hash value entry of the EC cache table.
62. The security device of claim 61, wherein the compiler is further configured to:
if a match exists, the DFA state associated with the matching hash value entry is set to a transition DFA state from the current DFA state for the input character of the letter identified by the NFA graph.
63. The security device of claim 62, wherein the compiler is further configured to:
if a match does not exist, calculating an Epsilon termination for the received NFA state set;
calculating the Epsilon-terminated hash value of the received NFA state set;
adding a new entry to the EC cache table, the new entry mapping the hash values of the received NFA state set to new DFA states; and
setting the new DFA state to a transition DFA state from a current DFA state for the input character of the letter identified by the NFA graph.
64. The security device of claim 63, wherein the compiler is further configured to:
determining whether there is sufficient memory for adding the new entry;
if there is not enough memory, the new entry is added according to a replacement policy.
65. The security device of claim 64 wherein the replacement policy adds the new entry by deleting a least recently used entry of the EC cache table.
66. The security device of claim 34, wherein the compiler is further configured to:
determining valid characters of a given pattern associated with letters identified by the NFA graph and DFA graph; and
creating the NFA graph and DFA graph to identify a pattern consisting only of valid characters of a given pattern associated with the letters identified by the NFA graph and DFA graph.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/168,450 | 2011-06-24 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1193278A true HK1193278A (en) | 2014-09-12 |
| HK1193278B HK1193278B (en) | 2018-03-09 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107122221B (en) | compiler for regular expressions | |
| US9514246B2 (en) | Anchored patterns | |
| JP4598127B2 (en) | Stateful packet content matching mechanism | |
| Becchi et al. | Memory-efficient regular expression search using state merging | |
| US10367786B2 (en) | Configuration management for a capture/registration system | |
| US9426166B2 (en) | Method and apparatus for processing finite automata | |
| US8706709B2 (en) | System and method for intelligent term grouping | |
| US9015102B2 (en) | Match engine for detection of multi-pattern rules | |
| US20150067776A1 (en) | Method and apparatus for compilation of finite automata | |
| US11093534B2 (en) | System and method for keyword searching using both static and dynamic dictionaries | |
| US10291632B2 (en) | Filtering of metadata signatures | |
| US8272056B2 (en) | Efficient intrusion detection | |
| Weng et al. | Deep packet pre-filtering and finite state encoding for adaptive intrusion detection system | |
| Sharma et al. | Optituned: An optimized framework for zero-day dns tunnel detection using n-grams | |
| HK1193278A (en) | Compiler for regular expressions | |
| HK1193278B (en) | Compiler for regular expressions | |
| Yu et al. | Fast packet pattern-matching algorithms | |
| Li | M-ISDS: A Mobilized Intrusion and Spam Detection System | |
| Xue et al. | A Novel Deep Packet Inspection Method for Polymorphic Network | |
| Yao et al. | Anomaly Detection on Network Traffic | |
| Hashem et al. | HES: Highly Efficient and Scalable Technique for Matching Regex Patterns | |
| CN120110738A (en) | Packet filtering method and device | |
| Manjunath et al. | Fast Pattern Matching Approach for Intrusion Detection Systems | |
| Korpraserttaworn et al. | Piranha+ Cuckoo Snort NIDS | |
| HK1208105B (en) | Method and apparatus for compilation of finite automata |