CN111400563B - Pattern matching method and device for pattern matching - Google Patents
Pattern matching method and device for pattern matching Download PDFInfo
- Publication number
- CN111400563B CN111400563B CN202010183402.5A CN202010183402A CN111400563B CN 111400563 B CN111400563 B CN 111400563B CN 202010183402 A CN202010183402 A CN 202010183402A CN 111400563 B CN111400563 B CN 111400563B
- Authority
- CN
- China
- Prior art keywords
- string
- pattern
- matching
- vector table
- bit vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a pattern matching method, a pattern matching device and a pattern matching device. The method specifically comprises the following steps: performing word segmentation on the pattern strings to be matched to obtain a first word segmentation set corresponding to the pattern strings; encoding each word in the first word segmentation set to obtain an encoded pattern string; distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table. The embodiment of the invention can improve the utilization rate of the B table and the efficiency of pattern matching.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a pattern matching method and apparatus, and a device for pattern matching.
Background
Character string matching, also known as pattern matching, is a key technology widely applied to the fields of information retrieval, intrusion detection, computing biology, search engines, data compression and the like. By pattern matching, it is meant that all occurrence positions of a certain pattern string p=p1p2..pm in the text string t=t1t2..tn are found.
The bit-parallel algorithm is a pattern matching algorithm which is currently more commonly used, and comprises shift-and (shift-and), shift-or (shift-or), BNDM (Backward Nondeterministic Dawg Matching, suffix automaton matching). Typically, bit-parallel algorithms maintain a table of bit vectors, abbreviated as B-table, in a computer cache. The B table can be understood as an n x m 0/1 matrix, where 0/1 is used to indicate whether the corresponding character appears in the pattern string.
However, each bit in the B table represents a character, resulting in the B table taking up a lot of memory space. Furthermore, in practical applications, not every matched result is meaningful due to the specificity of semantics. For example, the text string is "Liu Dehua is as beautiful as" and the pattern string is "Hua Ye" is successfully matched, but the matching is not meaningful because the text string has no semantic association with "Hua Ye", so that meaningless matching results exist.
Disclosure of Invention
The embodiment of the invention provides a pattern matching method, a pattern matching device and a pattern matching device, which can improve the utilization rate of a B table and the pattern matching efficiency.
In order to solve the above problems, an embodiment of the present invention discloses a pattern matching method, which includes:
Performing word segmentation on the pattern strings to be matched to obtain a first word segmentation set corresponding to the pattern strings;
encoding each word in the first word segmentation set to obtain an encoded pattern string;
distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
In another aspect, an embodiment of the present invention discloses a pattern matching apparatus, including:
the first word segmentation module is used for segmenting the mode strings to be matched to obtain a first word segmentation set corresponding to the mode strings;
the first coding module is used for coding each word in the first word segmentation set to obtain a coded mode string;
the allocation module is used for allocating a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
and the pattern matching module is used for matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
In yet another aspect, an embodiment of the present invention discloses an apparatus for pattern matching, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
performing word segmentation on the pattern strings to be matched to obtain a first word segmentation set corresponding to the pattern strings;
encoding each word in the first word segmentation set to obtain an encoded pattern string;
distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
In yet another aspect, embodiments of the present invention disclose a machine-readable medium having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a pattern matching method as described in one or more of the preceding.
The embodiment of the invention has the following advantages:
before pattern matching, the embodiment of the invention performs word segmentation on the pattern string to obtain a first word segmentation sequence corresponding to the pattern string, and encodes each word segmentation in the first word segmentation sequence to obtain an encoded pattern string; and allocating a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for representing whether corresponding segmentation occurs in the pattern string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
Firstly, the embodiment of the invention carries out word segmentation on the mode string and then codes, so that semantic errors can be corrected, meaningless matching results can be filtered, the value range length of the B table can be compressed by the code after word segmentation, the occupied space of the B table is reduced, and the utilization rate of the B table is improved.
In addition, the embodiment of the invention can reduce the probability of generating collision mode strings when hitting the target substring by selecting the target substring through the word frequency of the substring, thereby improving the subsequent query efficiency.
Furthermore, under the condition that the target substring has the collision mode string, the embodiment of the invention determines the query sequence of the collision mode string based on the word frequency of the mode string so as to reduce the query times as much as possible and improve the query efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of steps of an embodiment of a pattern matching method of the present invention;
FIG. 2 is a block diagram of an embodiment of a pattern matching device of the present invention;
FIG. 3 is a block diagram of an apparatus 800 for pattern matching of the present invention; and
Fig. 4 is a schematic diagram of a server in some embodiments of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Method embodiment
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a pattern matching method according to the present invention may specifically include the following steps:
step 101, word segmentation is carried out on a mode string to be matched, and a first word segmentation set corresponding to the mode string is obtained;
102, encoding each word in the first word segmentation set to obtain an encoded mode string;
step 103, allocating a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
And 104, matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
The pattern matching method of the embodiment of the invention is applicable to electronic equipment, and the electronic equipment comprises but is not limited to: servers, smartphones, recording pens, tablet computers, electronic book readers, MP3 (dynamic video expert compression standard audio plane 3,Moving Picture Experts Group Audio Layer III) players, MP4 (dynamic video expert compression standard audio plane 4,Moving Picture Experts Group Audio Layer IV) players, laptop computers, car computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like.
The pattern matching method of the embodiment of the invention can be used for single-mode matching or multi-mode matching, wherein the single-mode matching refers to matching a pattern string in a text string; multimodal matching refers to matching multiple strings of patterns in a text string.
Before matching a text string and a pattern string, the embodiment of the invention performs word segmentation processing on the pattern string to obtain a first word segmentation set corresponding to the pattern string.
In a chinese example of an embodiment of the present invention, assume that the text string to be matched is "Liu Dehua is so beautiful", and the pattern string to be matched includes: "why you and how you are" two pattern strings. Firstly, the two pattern strings to be matched are respectively segmented, and a first segmentation set { "why", "then", "commander", "Hua Cheng" } can be obtained according to the segmentation result. Then, each word in the first word segmentation set is encoded, and an encoded mode string is obtained. For example, "why" is encoded as C, "then" is encoded as D, "commander" is encoded as E, "hua is encoded as F, and then the two encoded pattern strings are respectively: CDE and F.
In an alternative embodiment of the present invention, before the matching between the pattern string and the text string to be matched based on the bit vector table in step 104, the method may further include:
s11, word segmentation is carried out on the text strings, and a second word segmentation set corresponding to the text strings is obtained;
and step S12, encoding each word in the second word segmentation set to obtain an encoded text string.
After the pattern matching system reads the text string, the text string can be segmented to obtain a second segmentation set corresponding to the text string. For example, in the above example, the text string "Liu Dehua" why it is then "may be segmented to obtain a second set of segments {" Liu Dehua "," why it is then "," general "}, and then each segment in the second set of segments is encoded to obtain an encoded text string. For example, when "why" is encoded as C, "then" is encoded as D, "commander" is encoded as E, "Liu Dehua" is encoded as a, then the encoded text string is: ACDE. The first word segmentation set and the second word segmentation set can adopt the same coding rule in the coding process, so that the same word segmentation can correspond to the same coding.
Typically, the bit-parallel algorithm maintains a bit vector table, abbreviated as B table, in the computer cache, where the length of the B table does not exceed w, which is the bit length (a machine word length, e.g., 64 bits) that the computer processes at a time. The B table may be an n x m 0/1 matrix, where 0/1 in the table is used to indicate whether the character at the corresponding position appears in the pattern string, and n is the value range length of the pattern string.
Taking a 64-bit vector table as an example, in the case of single-mode matching (the number of pattern strings to be matched n=1), the pattern strings to be matched may be directly allocated to the 64-bit vector table. In the case of multimode matching (the number n of pattern strings to be matched is greater than 1), the bit vector table needs to be used in groups, and the sum of the bit lengths of all the groups does not exceed w (64 bits). For a bit vector table with the number of the packets being m, n coded pattern strings are firstly allocated into the bit vector table containing m packets, and then, in a plurality of packets of one machine word, bit operations of a shift-and algorithm are simultaneously performed on a plurality of pattern strings. The n coded pattern strings are distributed into a bit vector table containing m packets, and there may be two cases in which n is less than or equal to m, each pattern string may be individually distributed with one packet; in the case where n is greater than m, multiple pattern strings may be allocated to the same packet. For convenience of description, in the embodiment of the present invention, the case where n is less than or equal to m is taken as an example, and in the case where n is greater than m, it is only necessary to compare multiple pattern strings in one packet to determine which pattern string is hit.
In step 102 of the embodiment of the present invention, after encoding each word in the first word segmentation set to obtain an encoded pattern string, a bit vector table (B table) may be allocated to the encoded pattern string, so that bits in the B table corresponding to the encoded pattern string may be used to indicate whether the corresponding word segment appears in the pattern string to be matched.
Step 104, according to the bit vector table, matches the pattern string with the text string to be matched based on a bit parallel algorithm, which may specifically include: and matching the coded pattern string with the coded text string based on a bit parallel algorithm according to the bit vector table.
For example, in the above example, the question of why the matching pattern string "why then" and "how" among the text string "Liu Dehua then" is converted into: the problem of matching the encoded pattern strings "CDE" and "F" in the encoded text string "ACDE". It can be seen that "F" does not exist in "ACDE", so the resulting matching string includes "why it is so beautiful", but does not include "hua-si", and the embodiment of the present invention encodes based on the segmented pattern string, and can filter meaningless matching results.
In addition, the embodiment of the invention distributes the B table based on the segmented mode string, so that the value range length of the B table can be reduced, and the utilization rate of the B table can be improved. For example, in the above example, for the pattern strings "why so" and "how" B tables are assigned in a conventional manner, each bit in the B tables represents a character, and the value range length of the B tables is typically an exponential order of 2. However, in the embodiment of the invention, the words "why" and "how" are the mode string are segmented, the word "why" is coded as C, "what" is coded as D, "how" is coded as E, "how" is coded as F, and the words which do not appear in the mode string are uniformly coded as G. Therefore, the value range length of the B table is only 5, the value range length of the B table is greatly reduced, and the utilization rate of the B table can be further improved. Referring to Table 1, an example of the "why so" allocation B table for a string of modes is shown in accordance with embodiments of the present invention.
TABLE 1
As shown in table 1, the pattern string "why it is then the general" is segmented and encoded to obtain the encoded pattern string "CDE", where "C" is the code corresponding to the segment "why" and the first behavior code "C" corresponds to the bitmask "001" as shown in table 1. It should be noted that, in the embodiment of the present invention, the order of character representation in the pattern string is opposite to the order of character representation in the bit mask. For example, in "001", 1 st bit from right to left is 1, and the remaining bits are 0, which means that in the encoded pattern string "CDE", 1 st bit from left to right is "C". Similarly, as shown in table 1, the second behavior encodes a bit mask "010" corresponding to "D". In "010", the 2 nd bit from right to left is 1, and the remaining bits are 0, indicating that in the encoded pattern string "CDE", the 2 nd bit from left to right is "D". The third behavior encodes a bitmask "100" corresponding to "E". In "100", the 3 rd bit from right to left is 1, and the remaining bits are 0, indicating that in the encoded pattern string "CDE", the 3 rd bit from left to right is "E". The fourth line '000' bit mask indicates the characters that do not appear in the encoded pattern string 'CDE', i.e., indicates the other words that do not appear in the pattern string 'why it is so beautiful'.
It should be noted that, the foregoing coding mode of "why" is coded as "C", "then" is coded as "D", "commander" is coded as "E", "commander" is coded as "F", and other words are uniformly coded as "G" is only an application example of the embodiment of the present invention, and the embodiment of the present invention is not limited to a specific coding mode.
For example, in the case of multimode matching, for n pattern strings to be matched, the n pattern strings are first segmented to obtain a segmented set of segmented words. And then encoding each word in the word segmentation set, wherein each word corresponds to one encoding. Assuming that there are 100 words in the word segmentation set obtained by segmenting the n pattern strings, the 100 words are coded from the code 0, the codes corresponding to the 100 words are from 0 to 99, each code corresponds to one word, for example, the code corresponding to the first word is 0, the code corresponding to the second word is 1, and so on, the code corresponding to the 100 th word is 99, and the codes of the non-occurring words in the 100 words are 100. And then, coding the n mode strings to be matched according to the codes corresponding to the segmentation words, so as to obtain n coded mode strings. Finally, a B table is allocated to the encoded n pattern strings.
The embodiment of the invention carries out word segmentation on the mode string and then codes, so that semantic errors can be corrected, meaningless matching results can be filtered, the value range length of the B table can be compressed by the code after word segmentation, the occupied space of the B table is reduced, and the utilization rate of the B table is improved.
In an optional embodiment of the present invention, before step 103 allocates the bit vector table corresponding to the encoded pattern string, the method may further include:
step S21, dividing the coded pattern string according to the grouping length when the length of the coded pattern string is larger than the preset grouping length, so as to obtain each sub-string after division;
step S22, determining a target substring in each substring;
based on this, step 103 allocates a bit vector table corresponding to the encoded pattern string, which may specifically include: and distributing a bit vector table corresponding to the target substring.
In a specific application, the multimode matching includes a plurality of pattern strings to be matched, so that the B table is generally required to be used in a grouping manner, and the sum of bit lengths of all the groupings does not exceed w (the bit length processed by a computer at a time, such as a machine word length). In the bit parallel algorithm, when the length h of one pattern string is greater than the packet length M of the B table, the pattern string needs to be truncated. The packet length M may be preset according to actual needs. It is common practice to intercept a substring of the first M characters or a substring of the last M characters of the pattern string instead of the pattern string.
In an english example of the embodiment of the invention, for the text string "ababfdelabc", it is assumed that the mode string to be matched includes "abcde" and "abfde" as two mode strings, and the mode strings "abcde" and "abcde" are allocated in the same packet of the table B, and the length M of the packet is assumed to be 2 bits, and the lengths of the mode strings "abcde" and "abfde" are both greater than the packet length, so that it is necessary to truncate the mode strings "abcde" and "abfde" respectively, and if truncating is performed by truncating the first 2 characters, then both mode strings are truncated substrings "ab", that is, the target substring "ab" is used to replace the mode string "abcde", and the target substring "ab" is used to replace the mode string "abfde". At this time, only the B table corresponding to the target substring "ab" needs to be allocated.
Optionally, step 104 matches the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table, which specifically may include:
step S31, matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring, and obtaining the position of the matching string in the text string;
And step S32, inquiring whether the text string hits the mode string corresponding to the target substring according to the position of the matching string in the text string.
In the above example, the same target substring "ab" is used for both the pattern strings s1= "abcde" and s2= "abfde", and the allocated B table is shown in table 2.
TABLE 2
| a | 01 |
| b | 10 |
| c | 00 |
According to the table B shown in table 2, the target substring "ab" and the text string "ababfdebac" are matched based on the bit-parallel algorithm, so that 3 matching strings "ab" can be obtained, and the positions of the three matching strings in the text string "ababfdebac" are pos=0, pos=2, and pos=7, respectively.
In the embodiment of the present invention, assuming that a target sub-string corresponds to k pattern strings, k is an integer greater than 1, in the process of matching text strings to be matched by using the target sub-string, if the text string hits the target sub-string, that is, if the target sub-string exists in the text string, collision of the k pattern strings is generated.
For example, in the above example, the target substring "ab" corresponds to two pattern strings (k=2), respectively, the pattern string "abcde" and the pattern string "abfde". In the process of matching the target substring "ab" with the text string "ababfdebac", it can be found that the text string hits the target substring "ab" when the second character "b" of the text string is read in. Since the target substring "ab" corresponds to the pattern strings "abcde" and "abfde" at the same time, the pattern strings "abcde" and "abfde" collide, which are called "abcde" and "abfde" as collision pattern strings. At this time, it is impossible to determine which pattern string "abfcde" is identical to the text string "ababfdeac" among the pattern strings "abcde" and "abfde".
It can be seen that, in the case of using the target sub-string instead of the pattern string, since the target sub-string represents only part of the content of the pattern string, the text string hits the target sub-string, only represents that the text string has a possibility of hitting the pattern string corresponding to the target sub-string, all the collision pattern strings generated by the target sub-string need to be traversed, and whether the text string hits each collision pattern string is further queried according to the position of the matching string in the text string.
For example, the target substring "ab" generates collision pattern strings "abcde" and "abfde", assuming that the query order is to query the pattern string "abcde" first and then query the pattern string "abfde". First, a string "ababf" of length 5 from the position pos=0 in the text string "ababffdebc" is compared with the pattern string "abcde" to determine whether the string "ababf" and the pattern string "abcde" match. Then, the character string "abfdecabc" of length 5 from the position pos=2 is compared with the pattern string "abcde" to see whether or not the character string "abfdecabc" is matched, and the result is a mismatch. Finally, it is compared whether the character string of length 5 from position pos=7 in the text string "ababffdebc" matches the pattern string "abcde", and the character string from position pos=7 to the last character of the character string is only 3 in length, and thus does not match.
After the completion of the query for the pattern string "abcde", it is found that the pattern string "abcde" does not exist in the text string "ababfdebac", that is, the text string "ababfdebac" misses the pattern string "abcde", and at this time, whether the text string "ababfdebac" hits the pattern string "abfde" is queried again according to the above-described query procedure. The query result is a text string "ababffdebc" hit pattern string "abfde", and the position of the pattern string "abfde" in the text string "ababffdebc" is pos=2.
It can be seen that in the above example, the first two characters "ab" of the pattern strings "abcde" and "abfde" are truncated as target substrings instead of the pattern strings, resulting in collision pattern strings "abcde" and "abfde" being generated in the matching process. In practical applications, the greater the number of pattern strings, the greater the number of generated collision pattern strings, and then the further query comparison is required for each collision pattern string to determine whether the text string actually hits the collision pattern string, resulting in lower matching efficiency.
In an optional embodiment of the present invention, step S32 may specifically include:
Step S41, if the target substring has collision mode strings, respectively calculating word frequency of each collision mode string;
step S42, according to the position of the matching string in the text string, sequentially inquiring whether the text string hits the collision mode string according to the sequence of the word frequency of the collision mode string from large to small.
In the above example, the pattern string includes "abcde" and "abfde", and if the first two characters "ab" of the pattern strings "abcde" and "abfde" are truncated as the target substring instead of the pattern string, the target substring "ab" may have collision pattern strings "abcde" and "abfde" in the process of matching with the text string "ababfdeac".
As can be seen from the above examples, in the case where there is a collision pattern string in the target sub-string, the order in which the collision pattern strings are queried in the text string will affect the efficiency of the query. In order to improve query efficiency, the embodiment of the invention calculates the word frequency of each mode string in the collision mode strings.
Specifically, the word frequency of each pattern string in the collision pattern string may be calculated based on a given corpus, for example, the word frequencies of the pattern strings "abcde" and "abfde" are calculated, and if F (abcde) =10 and F (abfde) =20, the pattern strings "abcde" and "abfde" are hung into a linked list or stored in sequence from large to small according to the word frequency, and are queried according to the word frequency sequence, that is, the pattern string "abcde" is queried first and then the pattern string "abcde" is queried.
Because the word frequency of the collision mode string can be obtained according to corpus statistics, the probability of occurrence of the mode string 'abfde' is higher than that of the mode string 'abcde' under normal conditions, therefore, the mode string with higher word frequency is searched first, the hit probability is higher, the frequency of subsequent inquiry can be reduced, and the efficiency of mode matching is improved.
In an alternative embodiment of the present invention, step S22 may specifically include, among the sub-strings, determining a target sub-string: and determining the substring with the minimum word frequency as a target substring in the substrings.
In order to reduce the probability of generating collision mode strings by target substrings as much as possible, the embodiment of the invention determines the target substrings according to word frequency. Specifically, when the length h of the pattern string exceeds the group length M of the B table, the pattern string is sequentially divided according to the group length M of the B table to obtain each sub-string of the pattern string, word frequencies of the sub-strings obtained after division are calculated respectively, and the sub-string with the smallest word frequency is selected as the target sub-string.
For example, in the above example, for the pattern string "abcde", the length h=5, and the group length m=2 of the B table packet allocated by the pattern string is assumed. Dividing the pattern string according to the group length M of the B table to obtain the substrings of the pattern string as follows: "ab", "bc", "cd", "de", "bf", "fd", "de". The word frequency F of each substring is calculated, assuming that F (ab) = 2,F (bc) =1, F (cd) =1, F (de) = 2,F (bf) =1, F (fd) =1, and for the pattern string "abcde", the word frequency of the substring "bc" and the substring "cd" is the smallest, and therefore the substring "bc" or "cd" can be regarded as the target substring of the pattern string "abcde". Similarly, for the pattern string "abfde", the sub-string "cd" and the sub-string "bf" have the smallest word frequency, and therefore the sub-string "cd" or the sub-string "bf" can be regarded as the target sub-string of the pattern string "abfde".
In one example, let B be table as shown in table 3, assuming that the sub-string "bc" is the target sub-string of the pattern string "abcde" and the sub-string "bf" is the target sub-string of the pattern string "abfde".
TABLE 3 Table 3
| a | 00 |
| b | 01 |
| c | 10 |
| d | 00 |
| e | 00 |
| f | 10 |
| g | 00 |
In the process of matching text strings according to table 3, when a text string hits a target substring "bc" or hits a target substring "bf", a pattern string "abcde" and a pattern string "abfde" do not collide. It can be seen that, according to the embodiment of the invention, the target substring is selected through the word frequency of the substring, so that the probability of generating the collision mode string when the target substring is hit can be reduced, and the subsequent query efficiency can be improved.
It will be appreciated that embodiments of the present invention are not limited to a particular manner of calculating the sub-string word frequency. For example, the calculation may be performed based on a document in the application environment where the current text string is located, or may be performed based on big data, or the like.
In an optional embodiment of the present invention, in step 103, if the number n of the pattern strings to be matched is greater than 1, the allocating a bit vector table corresponding to the encoded pattern string may specifically include:
step S51, determining the grouping number m of the bit vector table;
Step S52, n coded pattern strings are distributed into a bit vector table containing m groups.
In the case of multimode matching (the number n of pattern strings to be matched is greater than 1), the bit vector table needs to be used in groups, and the sum of bit lengths of all the groups does not exceed w (64 bits). For a bit vector table with the number of the packets being m, n coded pattern strings are firstly allocated into the bit vector table containing m packets, and then, in a plurality of packets of one machine word, bit operations of a shift-and algorithm are simultaneously performed on a plurality of pattern strings. The n coded pattern strings are distributed into a bit vector table containing m packets, and there may be two cases in which n is less than or equal to m, each pattern string may be individually distributed with one packet; in the case where n is greater than m, multiple pattern strings may be allocated to the same packet.
For example, 5 pattern strings are allocated in one packet, and when a certain character of a text string is read, the character may hit on a plurality of the 5 pattern strings at the same time, so that a collision of the pattern strings is generated, which pattern string is hit is required to be further compared, the comparison operation cost is increased, and the matching efficiency is affected.
In an alternative embodiment of the present invention, the allocating the n encoded pattern strings into the bit vector table containing m packets may specifically include:
step S61, determining a misjudgment character string set generated by distributing the ith coded pattern string to the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
step S62, determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and step S63, distributing the ith coded pattern string to the packet with the minimum first loss gain until the nth coded pattern string is distributed.
In order to reduce collision of pattern strings as much as possible and improve pattern matching efficiency, in the multimode matching process, n pattern strings are distributed into m packets according to a minimum loss gain based on a greedy principle under the condition of given packet number and packet length.
In summary, before pattern matching, the embodiment of the invention performs word segmentation on the pattern string to obtain a first word segmentation set corresponding to the pattern string, and encodes each word segmentation in the first word segmentation set to obtain an encoded pattern string; and allocating a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for representing whether corresponding segmentation occurs in the pattern string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
Firstly, the embodiment of the invention carries out word segmentation on the mode string and then codes, so that semantic errors can be corrected, meaningless matching results can be filtered, the value range length of the B table can be compressed by the code after word segmentation, the occupied space of the B table is reduced, and the utilization rate of the B table is improved.
In addition, the embodiment of the invention can reduce the probability of generating collision mode strings when hitting the target substring by selecting the target substring through the word frequency of the substring, thereby improving the subsequent query efficiency.
Furthermore, under the condition that the target substring has the collision mode string, the embodiment of the invention determines the query sequence of the collision mode string based on the word frequency of the mode string so as to reduce the query times as much as possible and improve the query efficiency.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Device embodiment
Referring to fig. 2, there is shown a block diagram of an embodiment of a pattern matching device of the present invention, which may specifically include:
the first word segmentation module 201 is configured to segment a pattern string to be matched to obtain a first word segmentation set corresponding to the pattern string;
a first encoding module 202, configured to encode each word in the first word segmentation set to obtain an encoded pattern string;
an allocation module 203, configured to allocate a bit vector table corresponding to the encoded pattern string, where bits in the bit vector table are used to indicate whether a corresponding word occurs in the pattern string;
and the pattern matching module 204 is configured to match the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
Optionally, the apparatus further comprises:
the second word segmentation module is used for segmenting the text string to obtain a second word segmentation set corresponding to the text string;
the second coding module is used for coding each word in the second word segmentation set to obtain a coded text string;
the pattern matching module is specifically configured to match the encoded pattern string with the encoded text string based on a bit parallel algorithm according to the bit vector table.
Optionally, the apparatus further comprises:
the sub-string dividing module is used for dividing the coded mode string according to the grouping length when the length of the coded mode string is larger than the preset grouping length, so as to obtain each divided sub-string;
the target determining module is used for determining target substrings in the substrings;
the allocation module is specifically configured to allocate a bit vector table corresponding to the target substring;
the pattern matching module comprises:
the matching sub-module is used for matching the target sub-string with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target sub-string to obtain the position of the matching string in the text string;
and the inquiring sub-module is used for inquiring whether the text string hits the mode string corresponding to the target sub-string according to the position of the matching string in the text string.
Optionally, the target determining module is specifically configured to determine, among the sub-strings, a sub-string with the smallest word frequency as a target sub-string.
Optionally, the query sub-module includes:
the word frequency calculation unit is used for calculating the word frequency of each collision mode string if the collision mode string exists in the target sub-string;
And the query comparison unit is used for sequentially querying whether the text string hits the collision mode string according to the position of the matching string in the text string and the sequence of the word frequency of the collision mode string from large to small.
Optionally, if the number n of the pattern strings to be matched is greater than 1, the allocation module includes:
a grouping determination submodule, configured to determine the number m of groupings of the bit vector table;
and the mode string allocation submodule is used for allocating the n coded mode strings to a bit vector table containing m groups.
Optionally, the grouping assignment submodule includes:
a first calculation unit, configured to determine a misjudgment string set generated by allocating the ith coded pattern string to the jth packet; wherein, the value of i is 1-n, and the value of j is 1-m;
the second calculation unit is used for determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and the mode string allocation unit is used for allocating the ith coded mode string to the packet with the minimum first loss gain until the nth coded mode string allocation is completed.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
An embodiment of the invention provides a device for pattern matching, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and configured to be executed by one or more processors, the one or more programs comprising instructions for: performing word segmentation on the mode string to obtain a first word segmentation set corresponding to the mode string; encoding each word in the first word segmentation set to obtain an encoded pattern string; distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
Fig. 3 is a block diagram illustrating an apparatus 800 for pattern matching according to an example embodiment. For example, apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 3, apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.
The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or one component of the apparatus 800, the presence or absence of user contact with the apparatus 800, an orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of apparatus 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal), enables the apparatus to perform the pattern matching method shown in fig. 1.
A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal), causes the apparatus to perform a pattern matching method, the method comprising: performing word segmentation on the mode string to obtain a first word segmentation set corresponding to the mode string; encoding each word in the first word segmentation set to obtain an encoded pattern string; distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string; and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
The embodiment of the invention discloses A1 and a pattern matching method, which comprises the following steps:
performing word segmentation on the pattern strings to be matched to obtain a first word segmentation set corresponding to the pattern strings;
encoding each word in the first word segmentation set to obtain an encoded pattern string;
Distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
A2, before the pattern string is matched with the text string to be matched based on a bit parallel algorithm according to the bit vector table and the method of A1, the method further comprises:
word segmentation is carried out on the text strings, and a second word segmentation set corresponding to the text strings is obtained;
encoding each word in the second word segmentation set to obtain an encoded text string;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
and matching the coded pattern string with the coded text string based on a bit parallel algorithm according to the bit vector table.
A3, before the bit vector table corresponding to the encoded pattern string is allocated according to the method of A1, the method further comprises:
dividing the coded mode string according to the grouping length when the length of the coded mode string is larger than the preset grouping length, so as to obtain each sub-string after division;
Determining a target substring in the substrings;
the bit vector table corresponding to the coded pattern string is allocated, which comprises the following steps:
distributing a bit vector table corresponding to the target substring;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring, so as to obtain the position of the matching string in the text string;
and inquiring whether the text string hits the mode string corresponding to the target sub-string according to the position of the matching string in the text string.
A4, determining a target substring in the substrings according to the method of A3, wherein the method comprises the following steps:
and determining the substring with the minimum word frequency as a target substring in the substrings.
A5, according to the method of A3, the searching whether the text string hits the pattern string corresponding to the target sub-string according to the position of the matching string in the text string comprises:
if the target substring has a collision mode string, respectively calculating word frequency of each collision mode string;
and according to the position of the matching string in the text string, sequentially inquiring whether the text string hits the collision mode string according to the sequence of the word frequency of the collision mode string from large to small.
A6, according to the method of A1, if the number n of the pattern strings to be matched is greater than 1, the allocating the bit vector table corresponding to the encoded pattern strings includes:
determining the grouping number m of the bit vector table;
the n encoded pattern strings are assigned to a bit vector table containing m packets.
A7, the method according to A6, wherein the allocating the n encoded pattern strings into the bit vector table containing m packets comprises:
determining a misjudgment character string set generated by distributing the ith coded pattern string to the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and distributing the ith coded pattern string to the packet with the minimum loss gain until the nth coded pattern string is distributed.
The embodiment of the invention discloses a B8 mode matching device, which comprises:
the first word segmentation module is used for segmenting the mode strings to be matched to obtain a first word segmentation set corresponding to the mode strings;
The first coding module is used for coding each word in the first word segmentation set to obtain a coded mode string;
the allocation module is used for allocating a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
and the pattern matching module is used for matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
B9, the apparatus of B8, the apparatus further comprising:
the second word segmentation module is used for segmenting the text string to obtain a second word segmentation set corresponding to the text string;
the second coding module is used for coding each word in the second word segmentation set to obtain a coded text string;
the pattern matching module is specifically configured to match the encoded pattern string with the encoded text string based on a bit parallel algorithm according to the bit vector table.
B10, the apparatus of B8, the apparatus further comprising:
the sub-string dividing module is used for dividing the coded mode string according to the grouping length when the length of the coded mode string is larger than the preset grouping length, so as to obtain each divided sub-string;
The target determining module is used for determining target substrings in the substrings;
the allocation module is specifically configured to allocate a bit vector table corresponding to the target substring;
the pattern matching module comprises:
the matching sub-module is used for matching the target sub-string with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target sub-string to obtain the position of the matching string in the text string;
and the inquiring sub-module is used for inquiring whether the text string hits the mode string corresponding to the target sub-string according to the position of the matching string in the text string.
B11, the device according to B10, the said goal determining module, is used for confirming the minimum sub-string of word frequency as the goal sub-string in each said sub-string specifically.
B12, the apparatus of B10, the query sub-module comprising:
the word frequency calculation unit is used for calculating the word frequency of each collision mode string if the collision mode string exists in the target sub-string;
and the query comparison unit is used for sequentially querying whether the text string hits the collision mode string according to the position of the matching string in the text string and the sequence of the word frequency of the collision mode string from large to small.
B13, if the number n of the pattern strings to be matched is greater than 1, the allocation module includes:
a grouping determination submodule, configured to determine the number m of groupings of the bit vector table;
and the mode string allocation submodule is used for allocating the n coded mode strings to a bit vector table containing m groups.
B14, the apparatus of B13, the grouping assignment submodule, including:
a first calculation unit, configured to determine a misjudgment string set generated by allocating the ith coded pattern string to the jth packet; wherein, the value of i is 1-n, and the value of j is 1-m;
the second calculation unit is used for determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and the mode string allocation unit is used for allocating the ith coded mode string to the packet with the minimum first loss gain until the nth coded mode string allocation is completed.
The embodiment of the invention discloses a C15, a device for data processing, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory, and are configured to be executed by one or more processors, and the one or more programs comprise instructions for:
Performing word segmentation on the pattern strings to be matched to obtain a first word segmentation set corresponding to the pattern strings;
encoding each word in the first word segmentation set to obtain an encoded pattern string;
distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
and matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table.
C16, the device of C15, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:
word segmentation is carried out on the text strings, and a second word segmentation set corresponding to the text strings is obtained;
encoding each word in the second word segmentation set to obtain an encoded text string;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
and matching the coded pattern string with the coded text string based on a bit parallel algorithm according to the bit vector table.
C17, the device of 15, the device further configured to be executed by one or more processors the one or more programs comprising instructions for:
dividing the coded mode string according to the grouping length when the length of the coded mode string is larger than the preset grouping length, so as to obtain each sub-string after division;
determining a target substring in the substrings;
the bit vector table corresponding to the coded pattern string is allocated, which comprises the following steps:
distributing a bit vector table corresponding to the target substring;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring, so as to obtain the position of the matching string in the text string;
and inquiring whether the text string hits the mode string corresponding to the target sub-string according to the position of the matching string in the text string.
C18, the apparatus of C17, wherein determining the target substring in the respective substrings includes:
And determining the substring with the minimum word frequency as a target substring in the substrings.
C19, the device according to C17, according to the position of the matching string in the text string, inquires whether the text string hits the pattern string corresponding to the target sub-string, including:
if the target substring has a collision mode string, respectively calculating word frequency of each collision mode string;
and according to the position of the matching string in the text string, sequentially inquiring whether the text string hits the collision mode string according to the sequence of the word frequency of the collision mode string from large to small.
C20, according to the apparatus of C15, if the number n of the pattern strings to be matched is greater than 1, the allocating the bit vector table corresponding to the encoded pattern string includes:
determining the grouping number m of the bit vector table;
the n encoded pattern strings are assigned to a bit vector table containing m packets.
C21, the apparatus of C20, the allocating the n encoded pattern strings into a bit vector table comprising m packets, comprising:
determining a misjudgment character string set generated by distributing the ith coded pattern string to the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
Determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and distributing the ith coded pattern string to the packet with the minimum loss gain until the nth coded pattern string is distributed.
Embodiments of the invention disclose D22, a machine-readable medium having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a pattern matching method as described in one or more of A1 to A7.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
The above description of a pattern matching method, a pattern matching device and a device for pattern matching provided by the present invention has described specific examples, which are used to illustrate the principles and embodiments of the present invention, and the above examples are only used to help understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.
Claims (19)
1. A pattern matching method, the method comprising:
performing word segmentation on the pattern strings to be matched to obtain a first word segmentation set corresponding to the pattern strings;
encoding each word in the first word segmentation set to obtain an encoded pattern string;
when the length of the coded mode string is larger than a preset grouping length, dividing the coded mode string according to the grouping length to obtain divided sub-strings;
Determining a target substring in the substrings;
distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
according to the bit vector table, matching the pattern string with the text string to be matched based on a bit parallel algorithm;
the bit vector table corresponding to the coded pattern string is allocated, which comprises the following steps:
distributing a bit vector table corresponding to the target substring;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring, so as to obtain the position of the matching string in the text string;
and inquiring whether the text string hits the mode string corresponding to the target sub-string according to the position of the matching string in the text string.
2. The method of claim 1, wherein before matching the pattern string with the text string to be matched based on a bit-parallel algorithm according to the bit vector table, the method further comprises:
Word segmentation is carried out on the text strings, and a second word segmentation set corresponding to the text strings is obtained;
encoding each word in the second word segmentation set to obtain an encoded text string;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
and matching the coded pattern string with the coded text string based on a bit parallel algorithm according to the bit vector table.
3. The method of claim 1, wherein said determining a target substring among said respective substrings comprises:
and determining the substring with the minimum word frequency as a target substring in the substrings.
4. The method of claim 1, wherein the querying whether the text string hits the pattern string corresponding to the target substring based on the location of the matching string in the text string comprises:
if the target substring has a collision mode string, respectively calculating word frequency of each collision mode string;
and according to the position of the matching string in the text string, sequentially inquiring whether the text string hits the collision mode string according to the sequence of the word frequency of the collision mode string from large to small.
5. The method according to claim 1, wherein if the number n of pattern strings to be matched is greater than 1, the allocating the bit vector table corresponding to the encoded pattern strings includes:
determining the grouping number m of the bit vector table;
the n encoded pattern strings are assigned to a bit vector table containing m packets.
6. The method of claim 5, wherein said assigning n encoded pattern strings into a bit vector table containing m packets comprises:
determining a misjudgment character string set generated by distributing the ith coded pattern string to the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and distributing the ith coded pattern string to the packet with the minimum loss gain until the nth coded pattern string is distributed.
7. A pattern matching device, the device comprising:
the first word segmentation module is used for segmenting the mode strings to be matched to obtain a first word segmentation set corresponding to the mode strings;
The first coding module is used for coding each word in the first word segmentation set to obtain a coded mode string;
the allocation module is used for allocating a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
the pattern matching module is used for matching the pattern string with the text string to be matched based on a bit parallel algorithm according to the bit vector table;
the sub-string dividing module is used for dividing the coded mode string according to the grouping length when the length of the coded mode string is larger than the preset grouping length, so as to obtain each divided sub-string;
the target determining module is used for determining target substrings in the substrings;
the allocation module is specifically configured to allocate a bit vector table corresponding to the target substring;
the pattern matching module comprises:
the matching sub-module is used for matching the target sub-string with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target sub-string to obtain the position of the matching string in the text string;
and the inquiring sub-module is used for inquiring whether the text string hits the mode string corresponding to the target sub-string according to the position of the matching string in the text string.
8. The apparatus of claim 7, wherein the apparatus further comprises:
the second word segmentation module is used for segmenting the text string to obtain a second word segmentation set corresponding to the text string;
the second coding module is used for coding each word in the second word segmentation set to obtain a coded text string;
the pattern matching module is specifically configured to match the encoded pattern string with the encoded text string based on a bit parallel algorithm according to the bit vector table.
9. The apparatus of claim 7, wherein the target determining module is specifically configured to determine, among the sub-strings, a sub-string with a smallest word frequency as a target sub-string.
10. The apparatus of claim 7, wherein the query sub-module comprises:
the word frequency calculation unit is used for calculating the word frequency of each collision mode string if the collision mode string exists in the target sub-string;
and the query comparison unit is used for sequentially querying whether the text string hits the collision mode string according to the position of the matching string in the text string and the sequence of the word frequency of the collision mode string from large to small.
11. The apparatus of claim 7, wherein if the number of pattern strings to be matched n is greater than 1, the allocation module comprises:
a grouping determination submodule, configured to determine the number m of groupings of the bit vector table;
and the mode string allocation submodule is used for allocating the n coded mode strings to a bit vector table containing m groups.
12. The apparatus of claim 11, wherein the pattern string assignment submodule comprises:
a first calculation unit, configured to determine a misjudgment string set generated by allocating the ith coded pattern string to the jth packet; wherein, the value of i is 1-n, and the value of j is 1-m;
the second calculation unit is used for determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and the mode string allocation unit is used for allocating the ith coded mode string to the packet with the minimum first loss gain until the nth coded mode string allocation is completed.
13. An apparatus for data processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
Performing word segmentation on the pattern strings to be matched to obtain a first word segmentation set corresponding to the pattern strings;
encoding each word in the first word segmentation set to obtain an encoded pattern string;
distributing a bit vector table corresponding to the coded pattern string, wherein bits in the bit vector table are used for indicating whether corresponding word segmentation occurs in the pattern string;
according to the bit vector table, matching the pattern string with the text string to be matched based on a bit parallel algorithm;
the device is also configured to be executed by one or more processors the one or more programs including instructions for:
when the length of the coded mode string is larger than a preset grouping length, dividing the coded mode string according to the grouping length to obtain divided sub-strings;
determining a target substring in the substrings;
the bit vector table corresponding to the coded pattern string is allocated, which comprises the following steps:
distributing a bit vector table corresponding to the target substring;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
Matching the target substring with the text string based on a bit parallel algorithm according to a bit vector table corresponding to the target substring, so as to obtain the position of the matching string in the text string;
and inquiring whether the text string hits the mode string corresponding to the target sub-string according to the position of the matching string in the text string.
14. The device of claim 13, wherein the device is further configured to be executed by one or more processors the one or more programs include instructions for:
word segmentation is carried out on the text strings, and a second word segmentation set corresponding to the text strings is obtained;
encoding each word in the second word segmentation set to obtain an encoded text string;
the matching of the pattern string and the text string to be matched based on a bit parallel algorithm according to the bit vector table comprises the following steps:
and matching the coded pattern string with the coded text string based on a bit parallel algorithm according to the bit vector table.
15. The apparatus of claim 13, wherein said determining a target substring among said respective substrings comprises:
And determining the substring with the minimum word frequency as a target substring in the substrings.
16. The apparatus of claim 13, wherein the querying whether the text string hits the pattern string corresponding to the target substring based on the location of the matching string in the text string comprises:
if the target substring has a collision mode string, respectively calculating word frequency of each collision mode string;
and according to the position of the matching string in the text string, sequentially inquiring whether the text string hits the collision mode string according to the sequence of the word frequency of the collision mode string from large to small.
17. The apparatus of claim 13, wherein if the number n of pattern strings to be matched is greater than 1, the allocating the bit vector table corresponding to the encoded pattern string comprises:
determining the grouping number m of the bit vector table;
the n encoded pattern strings are assigned to a bit vector table containing m packets.
18. The apparatus of claim 17, wherein said allocating n encoded pattern strings into a bit vector table comprising m packets comprises:
determining a misjudgment character string set generated by distributing the ith coded pattern string to the jth grouping; wherein, the value of i is 1-n, and the value of j is 1-m;
Determining a first loss gain generated by distributing the ith coded mode string to the jth grouping according to the word frequency of each misjudgment character string in the misjudgment character string set;
and distributing the ith coded pattern string to the packet with the minimum loss gain until the nth coded pattern string is distributed.
19. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the pattern matching method of one or more of claims 1 to 6.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010183402.5A CN111400563B (en) | 2020-03-16 | 2020-03-16 | Pattern matching method and device for pattern matching |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010183402.5A CN111400563B (en) | 2020-03-16 | 2020-03-16 | Pattern matching method and device for pattern matching |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111400563A CN111400563A (en) | 2020-07-10 |
| CN111400563B true CN111400563B (en) | 2023-08-01 |
Family
ID=71428920
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010183402.5A Active CN111400563B (en) | 2020-03-16 | 2020-03-16 | Pattern matching method and device for pattern matching |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111400563B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112131866B (en) * | 2020-09-25 | 2024-06-14 | 马上消费金融股份有限公司 | Word segmentation method, device, equipment and readable storage medium |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110276071A (en) * | 2019-05-24 | 2019-09-24 | 众安在线财产保险股份有限公司 | A kind of text matching technique, device, computer equipment and storage medium |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TWI525606B (en) * | 2014-06-05 | 2016-03-11 | 和碩聯合科技股份有限公司 | Information providing method, system and string providing system |
-
2020
- 2020-03-16 CN CN202010183402.5A patent/CN111400563B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110276071A (en) * | 2019-05-24 | 2019-09-24 | 众安在线财产保险股份有限公司 | A kind of text matching technique, device, computer equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111400563A (en) | 2020-07-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107608532B (en) | Association input method and device and electronic equipment | |
| CN107340880B (en) | Association input method and device and electronic equipment for realizing association input | |
| CN112861175B (en) | Data processing method and device for data processing | |
| CN107564526B (en) | Processing method, apparatus and machine-readable medium | |
| CN114168809B (en) | Document character string coding matching method and device based on similarity | |
| CN113268179B (en) | Session message processing method, device, equipment and storage medium | |
| CN111400563B (en) | Pattern matching method and device for pattern matching | |
| CN114168808B (en) | Document character string coding identification method and device based on regular expression | |
| CN106959970B (en) | Word bank, processing method and device of word bank and device for processing word bank | |
| CN108628883B (en) | Data processing method and device and electronic equipment | |
| CN108573697B (en) | Language model updating method, device and equipment | |
| CN112987941B (en) | Method and device for generating candidate words | |
| CN111103986B (en) | User word stock management method and device, and user word stock input method and device | |
| CN115794809B (en) | Resource data retrieval method, device, electronic device and storage medium | |
| CN114168807B (en) | String matching method and device | |
| CN110780749B (en) | Character string error correction method and device | |
| CN109992790B (en) | Data processing method and device for data processing | |
| CN114298227B (en) | Text deduplication method, device, equipment and medium | |
| CN111382325B (en) | Pattern string allocation method and device for pattern string allocation | |
| CN110110292B (en) | Data processing method and device for data processing | |
| CN113312475B (en) | Text similarity determination method and device | |
| CN110019657B (en) | Processing method, apparatus and machine-readable medium | |
| CN108073566B (en) | Word segmentation method and device and word segmentation device | |
| CN112651221A (en) | Data processing method and device and data processing device | |
| CN111460836B (en) | Data processing method and device for data processing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |