CN107291761A - The matching process and device of a kind of regular expression - Google Patents
The matching process and device of a kind of regular expression Download PDFInfo
- Publication number
- CN107291761A CN107291761A CN201610206194.XA CN201610206194A CN107291761A CN 107291761 A CN107291761 A CN 107291761A CN 201610206194 A CN201610206194 A CN 201610206194A CN 107291761 A CN107291761 A CN 107291761A
- Authority
- CN
- China
- Prior art keywords
- regular expression
- sub
- matched
- text
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of matching process of regular expression and device, methods described is:A sub- regular expression is partitioned into from the regular expression for text matches, pending sub- regular expression is used as;Following processing is performed for pending sub- regular expression:Matched using pending sub- regular expression with text to be matched;If matching, matching result is obtained, matching terminates;If not matching, then continue to remove from regular expression in the remainder after the pending sub- regular expression and be partitioned into a sub- regular expression, it is used as new pending sub- regular expression, returned for new pending sub- regular expression and continue executing with the processing, untill it can not be partitioned into new pending sub- regular expression again from regular expression, then matching terminates.This method by splitting regular expression, with for former regular expression better simply sub- regular expression order go matched text, matching speed can be improved.
Description
Technical field
The present invention relates to the matching process and device of data processing field, more particularly to a kind of regular expression.
Background technology
Regular expression can use the complicated data characteristics of simple syntactic description, therefore be widely used in net
The multiple fields such as network intrusion detection, document content retrieval.
The pattern match of regular expression mainly can be by deterministic finite automaton (referred to as:DFA) and
Non-deterministic finite automaton is (referred to as:NFA) two kinds of data structures are realized.
But the matching of existing regular expression all has respective defect, for example, DFA matching speeds
It hurry up, but EMS memory occupation is too high, for many complicated regular expression rules or large-scale regular expressions
Formula regular collection, DFA can generating state blast.In another example, NFA EMS memory occupations are small, but matching speed
It is extremely slow, actual network processes requirement can not be met at all on multinuclear or general purpose processor platform.
The content of the invention
The application provides a kind of matching process and device of regular expression, for solving existing regular expressions
Either the matching process of formula can produce substantial amounts of intermediateness, or the problem of matching speed is slow.
The application first aspect there is provided a kind of matching process of regular expression, including:
A sub- regular expression is partitioned into from the regular expression for text matches, pending son is used as
Regular expression;
Following processing is performed for the pending sub- regular expression:
Matched using the pending sub- regular expression with text to be matched;
If matching, matching result is obtained, matching terminates;
If not matching, continue to remove from the regular expression after the pending sub- regular expression
Remainder in be partitioned into a sub- regular expression, as new pending sub- regular expression, for
New pending sub- regular expression, which is returned, continues executing with the processing, until from the regular expression not
Untill new pending sub- regular expression can be partitioned into again, then matching terminates.
In a possible design, a sub- canonical is partitioned into from the regular expression for text matches
Expression formula, including:
Since for the original position of the regular expression of text matches, from front to back search first not by
Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag
Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions
First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits
Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
In a possible design, the pending sub- regular expression and text to be matched progress are used
Match somebody with somebody, including:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression
And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, after the publicly-owned prefix is removed from the pending sub- regular expression and be publicly-owned
Sew, obtain remainder, obtained remainder is split into at least one sub- regular expression, and will tear open
The every sub- regular expression got is matched with text to be matched successively, until one of them splits
To the success of sub- regular expression and text matches to be matched untill, or until split every obtained sub- canonical
Untill expression formula fails with text matches to be matched.
As can be seen here, the matching process that the application is provided is the matching process that quickly fails, in publicly-owned prefix and
/ or the failure of publicly-owned suffix match in the case of, it is not necessary to the remainder of pending sub- regular expression is allocated as
Match somebody with somebody, can so accelerate matching speed, improve matching efficiency.
In a possible design, carried out using the pending sub- regular expression with text to be matched
Matching, including:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described
Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
In a possible design, carried out using the pending sub- regular expression with text to be matched
Matching, including:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression
Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into at least one sub- regular expressions
Formula;
The every sub- regular expression split into is matched with text to be matched successively, until one of them
Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each
Untill sub- regular expression fails with text matches to be matched.
The application second aspect there is provided a kind of coalignment of regular expression, including:
Cutting unit, for being partitioned into a sub- regular expressions from the regular expression for text matches
Formula, is used as pending sub- regular expression;
Processing unit, for performing following processing for the pending sub- regular expression:Treated using described
Sub- regular expression is handled to be matched with text to be matched;If matching, matching result is obtained, is matched
Terminate;If not matching, continue to treat described in removing from the regular expression by the cutting unit
Handle and a sub- regular expression is partitioned into the remainder after sub- regular expression, as new pending
Sub- regular expression;Returned for new pending sub- regular expression and continue executing with the processing, until institute
Untill stating cutting unit and can not being partitioned into new pending sub- regular expression again from the regular expression,
Then matching terminates.
In a possible design, the cutting unit is dividing from the regular expression for text matches
When cutting out a sub- regular expression, specifically for:
Since for the original position of the regular expression of text matches, from front to back search first not by
Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag
Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions
First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits
Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
In a possible design, the processing unit using the pending sub- regular expression with treating
When matched text is matched, specifically for:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression
And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, the public affairs are removed from the pending sub- regular expression by the cutting unit
There are prefix and publicly-owned suffix, obtain remainder, obtained remainder is split into at least one sub- canonical
Expression formula, and every sub- regular expression that fractionation is obtained is matched with text to be matched successively, until
Untill one of them splits obtained sub- regular expression and text matches to be matched success, or until split
To every sub- regular expression with text matches to be matched failure untill.
In a possible design, the processing unit using the pending sub- regular expression with treating
When the text of matching is matched, specifically for:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described
Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
In a possible design, the processing unit using the pending sub- regular expression with treating
When the text of matching is matched, specifically for:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression
Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into by least one by cutting unit
Sub- regular expression;
The every sub- regular expression split into is matched with text to be matched successively, until one of them
Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each
Untill sub- regular expression fails with text matches to be matched.
The scheme provided using the application, the application provide regular expression matching process be one kind with just
It is then the method being oriented to, therefore different from using text as the DFA being oriented to, this method will not produce intermediateness and lead
State explosion is caused, also, this method is by splitting regular expression, with for former regular expression
Better simply sub- regular expression order goes matched text, so as to improve matching speed.
Brief description of the drawings
A kind of flow chart of the matching process for regular expression that Fig. 1 provides for the application;
A kind of structural representation of the coalignment for regular expression that Fig. 2 provides for the application.
Embodiment
Hereinafter, the part term in the application is explained.
" regular expression ", in many text editors or other instruments, is usually used to retrieval and/or replaces
Change those content of text for meeting some pattern.Regular expression is made up of general character and metacharacter.Commonly
Character includes the letter of numeral and capital and small letter, and metacharacter is the character with special implication, including as follows
11 alphabetic characters:[]\^$.|*+().Metacharacter is used for specific use, for example, " " is used to match
Except line feed metacharacter " n " and " any character in addition to r ";The subexpression that " * " is used to match above is arbitrarily secondary;
“" represent matching 0 or 1 that character just before it, when the character immediately any one its
His delimiter (* ,+,, { n }, { n, }, { n, m }) below when, match pattern is non-greedy, non-greediness
The character string that pattern few matching as far as possible is searched for, and the greedy pattern given tacit consent to then matching institute as much as possible
The character string of search;" | " represents two matching conditions carrying out logical "or" (English:Or) computing;" () ",
By (and) between expression formula be defined as " group " (English:group).
" the sub- regular expression " of some regular expression refers to the part being partitioned into from the regular expression, this
Metacharacter "or" in application not included by metacharacter " group ", i.e. " | " " () " outside are partitioning standards.
For example, it is assumed that a regular expression is (A (B | C)) | D, then can divide and obtain (A (B | C)) and two sons of D are just
Then expression formula.It should be noted that when the outermost for dividing obtained sub- regular expression has " () ",
Continuing to remove outmost " () " into the sub- regular expression before processing, it is so-called outmost to include
Number, will bracket that entirely sub- regular expression is included, for example, above-mentioned to continue with
(A (B | C)), then first to delete outmost " () ", obtain A (B | C);If but obtained sub- regular expressions
The bracket of formula is not in the outermost of the sub- regular expression, such as (DE (F | G) H) | MLN, then without
Remove the bracket in the sub- regular expression.The operation of the sub- outmost bracket of regular expression will hereafter be removed
It is considered default action, no longer specially refers to.
" publicly-owned prefix " and " publicly-owned suffix ", if a certain (son) regular expression does not include not by metacharacter " group "
Comprising metacharacter "or" when, should the part on " group " outer left side of (son) regular expression be referred to as publicly-owned prefix,
Should (son) regular expression " group " outside on the right of part be referred to as publicly-owned suffix, for example, it is assumed that a canonical table
It is A (B | C) D up to formula, then the publicly-owned prefix of the regular expression is A, publicly-owned suffix is D.
" non-determined type expression formula ", refer to (son) regular expressions that can not be matched according to general character string
Include the metacharacter of particular meaning in formula, general non-determined type expression formula, such as d, w .*, [d-f] { 2,3 },
[1-3] * etc..
Corresponding, " deterministic type expression formula " is (son) for referring to be matched according to general character string
Metacharacter not comprising particular meaning in regular expression, general deterministic type expression formula, such as abc, 1234,
A12 etc..
Substantial amounts of intermediateness can be produced for DFA, state explosion is caused, and NFA matching speeds are slow
The problem of, present applicant proposes a kind of matching process of regular expression and device, this method and NFA classes
Seemingly, it is a kind of using canonical as the method being oriented to, therefore different from using text as the DFA being oriented to, this method will not
Producing intermediateness causes state explosion, also, this method is by splitting regular expression, with compared to original
Better simply sub- regular expression order goes matched text for regular expression, can improve matching speed.
What the matching process for the regular expression that the application is provided went for that traditional NFA is applicable owns
Program and language, such as Java, GNU Emacs, ergp, less, more .NET etc..
The preferred embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.
A kind of flow chart of the matching process of the regular expression provided as shown in Figure 1 for the application, the side
Method includes:
Step 101:A sub- regular expression is partitioned into from the regular expression for text matches, is made
For pending sub- regular expression.
Optionally, the process of segmentation regular expression is as follows:
First, since for the original position of the regular expression of text matches, first is searched from front to back
The position of individual do not included by constituent element character or metacharacter;Or, can also be from the knot of the regular expression
Beam position starts, and first position do not included by constituent element character or metacharacter is searched from back to front.Below
For concise explanation, by do not included by the constituent element character or metacharacter, i.e., " | " symbol outside " () "
It is known as " group outer or " metacharacter.
Secondly, can be by the original position of the regular expression to first " group outer or " if finding
Part between the position of metacharacter is used as a sub- regular expression;Or, can also be by the canonical table
Up to the position to the portion between the end position of the regular expression of first " group outer or " metacharacter of formula
It is allocated as a sub- regular expression;Or, can also be searched in the regular expression first
The wherein shorter part of symbol lengths is selected as one in two parts that " group outer or " metacharacter is divided into
Sub- regular expression.
If not finding, the regular expression is regard as a sub- regular expression.
As an example it is assumed that a regular expression is ABC | (DE (F | G (O | P) H)) | MLN, if
Search first " group outer or " metacharacter, then first from front to back from the starting position of the regular expression
" group outer or " metacharacter is located at behind " ABC ", can select ABC or (DE (F | G (O | P) H))
| the sub- regular expression that MLN is partitioned into as first;If from the end position of the regular expression from
Search first " group outer or " metacharacter forward afterwards, then first " group outer or " metacharacter is located at before " MLN "
Face, can select ABC | (DE (F | G (O | P) H)) or MLN as first son being partitioned into just
Then expression formula.
In another example, it is assumed that a regular expression is DE (F | G), because the regular expression is not present, " group is outer
Or " metacharacter, then a sub- regular expression can be used as using DE (F | G) is overall.
Step 102:Following processing is performed for the pending sub- regular expression:Using described pending
Sub- regular expression is matched with text to be matched;If matching, step 103 is performed, if matching is not
Arrive, then perform step 104.
Step 103:Matching result is obtained, matching terminates.
Step 104:Continue to remove from the regular expression surplus after the pending sub- regular expression
A sub- regular expression is partitioned into remaining part point, as new pending sub- regular expression, for new
Pending sub- regular expression return to step 102 continues executing with the processing, until from the regular expression
In can not be partitioned into new pending sub- regular expression again untill, then matching terminate.
Optionally, when performing step 102, the form of pending sub- regular expression is different, corresponding matching
Mode can also be different, can specifically there is following three kinds:
The first matching way, it is adaptable to include or metacharacter and/or constituent element character and comprising publicly-owned prefix and
/ or publicly-owned suffix pending sub- regular expression, such as DE (F | G (O | P) H.
Second of matching way, it is adaptable to include or metacharacter and/or constituent element character but not comprising publicly-owned prefix
And/or the pending sub- regular expression of publicly-owned suffix, such as F | G (O | P).
The third matching way, it is adaptable to do not include or metacharacter or the pending sub- regular expressions of constituent element character
Formula, such as ABC.
Group regular expression is contained or when metacharacter and/or constituent element character, represents that the sub- regular expression can
Continue to be divided into the sub- regular expression of smaller particle size.Group regular expression does not include or metacharacter or constituent element
During character, it has been indivisible minimum unit to represent the sub- regular expression.
These three matching ways are introduced separately below.
The handling process of the first matching way is as follows;
Step one, included or metacharacter and/or constituent element character and comprising publicly-owned in pending sub- regular expression
When prefix and/or publicly-owned suffix, the publicly-owned prefix and/or publicly-owned suffix and text to be matched progress are used
Match somebody with somebody.
Optionally, if the publicly-owned prefix that pending sub- regular expression is included is non-determined type expression formula,
The publicly-owned prefix can be converted into " .*" or " .* ", that is, any value can be matched by giving tacit consent to the publicly-owned prefix, temporarily
When the publicly-owned prefix is not matched.For example, it is assumed that a regular expression for d (d) abc, before its is publicly-owned
Sew d be non-determined type expression formula, therefore the publicly-owned prefix can be converted to " .*", the canonical table after conversion
It is .* up to formula(d) abc, and continue with conversion after regular expression.
Similarly, can if the publicly-owned suffix that pending sub- regular expression is included is non-determined type expression formula
So that the publicly-owned suffix is converted into " .*" or " .* ", that is, any value can be matched by giving tacit consent to the publicly-owned suffix, temporarily
The publicly-owned suffix is not matched.
“.*" and " .* " the two metacharacter strings may be incorporated for match any value.Difference is that " .* " is
Greedy pattern, " .*" be inadequate pattern, for example, text to be matched be 124deffg, if use " .*f " this
Regular expression goes matched text, will match to 124deff, if using " .*" this regular expression goes
With text, 124def will match to.
“.*" advantage be that backtracking will not be formed, the advantage of " .* ", which is to match, all meets canonical
The symbol of expression formula.
Step 2, if matching is less than confirming the pending sub- regular expression and text matches to be matched
Failure.
As can be seen here, the matching process that the application is provided is the matching process that quickly fails, in publicly-owned prefix and
/ or the failure of publicly-owned suffix match in the case of, it is not necessary to the remainder of pending sub- regular expression is allocated as
Match somebody with somebody, can so accelerate matching speed, improve matching efficiency.
Step 3, if matching, removed from the pending sub- regular expression the publicly-owned prefix and
Publicly-owned suffix, obtains remainder, and obtained remainder is split into at least one sub- regular expression,
And matched every sub- regular expression that fractionation is obtained with text to be matched successively, until one of them
Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each
Untill sub- regular expression fails with text matches to be matched.
When every sub- regular expression for obtaining fractionation is matched with text to be matched successively, Ke Yigen
The form of the every sub- regular expression obtained according to splitting, selects suitable matching way to be carried out to it respectively
Match somebody with somebody.
Optionally, the matching result that pending sub- regular expression can be obtained in step 3, as tearing open
The text to be matched for the sub- regular expression got.
For example, it is assumed that regular expression is DE (F | G (O | P)) H, its publicly-owned prefix DE is used
The coordinate range obtained with publicly-owned suffix H matched texts is [50,100], then subsequent match F | G (O | P)
When only need to the text [50,100] coordinate in match, it is not necessary to whole text is matched again.
Unmatched range of text can be so filtered out, has cut down a large amount of times produced in whole matching process
Trace back, improve matching efficiency.
The handling process of second of matching way is as follows;
Step one, include or metacharacter and/or constituent element character and do not wrap in the pending sub- regular expression
Containing publicly-owned prefix and/or during publicly-owned suffix, the pending sub- regular expression is being split into at least one son just
Then expression formula.
Step 2, the every sub- regular expression split into is matched with text to be matched successively, until
Untill one of them splits obtained sub- regular expression and text matches to be matched success, or until split
To every sub- regular expression with text matches to be matched failure untill.
, can basis when every split into sub- regular expression is matched with text to be matched successively
The form of obtained every sub- regular expression is split, selects suitable matching way to be carried out to it respectively
Match somebody with somebody.
Optionally, can be by the matching result of the sub- regular expression before fractionation, as from the sub- regular expressions
The text to be matched for the sub- regular expression that formula is split out.
The handling process of the third matching way is as follows;
Step one, do not include or when metacharacter or constituent element character in the pending sub- regular expression, use
The pending sub- regular expression is matched with text to be matched.
Step 2, if matching is less than confirming the pending sub- regular expression and text matches to be matched
Failure.
Step 3, if matching, confirm the pending sub- regular expression and text matches to be matched into
Work(.
In order to illustrate more clearly of the technical scheme of the application, above-mentioned flow is done below by one embodiment
Further instruction is, it is necessary to which explanation, this embodiment is only one embodiment of the application, is not constituted
Restriction to the application.
Assuming that text String text to be matched are:
String text=" zhe shi yi ge ce shi de www://cdn.voole.com/a=3&bc=d&cd=kk
&de=9923&http://125.39.27.10Ca=appid "
+ " &bofangchuan=ddd "
+"jav orcale php mysql ks tea aliasmethod 3des rsa md5"
+"vlive://voole.comCa=kksjmds ";
Regular expression String str for text matches are:
String str=" (http | vlive)://|(\\d(\\.\\d)*|\\w(\\.\\w)*)\\(ca=(appid (d { 3,5 } | &bof
Angchuan=([3-9] { 4,8 } | [d-f] { 2,3 })) | aid)) ";
Processing procedure is as follows:
The first step, for above-mentioned regular expression, first since beginning, " group is outer for interception first
Or " metasymbol carry out regular expression segmentation, obtain first pending sub- regular expressions
Formula:(http|vlive)://.
Second step, interception (http | vlive):// publicly-owned prefix and publicly-owned suffix.
It can be seen that (http | vlive):// without publicly-owned prefix, publicly-owned suffix is://.
3rd step, is determined according to BM algorithms:// appear in position in whole text.
It can be seen that:// occurred in that altogether in text to be matched three times, it is respectively:
For the first time:de www://;Matching coordinate range is:0-27.
Second:De=9923&http://;Matching coordinate range is:28-71.
For the third time:vlive:The vlive of // last column://;Matching coordinate range is:72-168.
4th step, remove (http | vlive):// in publicly-owned suffix, obtain (http | vlive), then remove outermost
The bracket in face, obtains http | vlive.
5th step, because http | vlive includes " | ", therefore it is alienable, by http | and vlive further divides
It is cut into http and vlive two parts.
6th step, determines that http appears in the position in the matching coordinate range that the 3rd step is obtained according to BM algorithms
Put, because http is to determine type expression formula, therefore without http is converted into " .*" or " .* ".
As can be seen that http is appeared in second matching coordinate range 28-71, http specifically matches seat
Marking scope is:67-71.
7th step, it is no subsequently because current sub- regular expression http has been indivisible unit
Continue the sub- regular expression matched, and this sub- regular expression http has been met, therefore matching can be returned
Coordinate range 67-71, and terminate matching process.
The matching process of regular expression based on the above-mentioned offer of the application, the application provides a kind of regular expressions
The coalignment 200 of formula, as shown in Fig. 2 described device 200 includes cutting unit 201 and processing unit
202:
The cutting unit 201, for being partitioned into a son from the regular expression for text matches just
Then expression formula, is used as pending sub- regular expression.
The processing unit 202, for performing following processing for the pending sub- regular expression:Make
Matched with the pending sub- regular expression with text to be matched;If matching, matching knot is obtained
Really, matching terminates;If not matching, continued by the cutting unit 201 from the regular expression
A sub- regular expression is partitioned into the middle remainder removed after the pending sub- regular expression, is made
For new pending sub- regular expression;Returned for new pending sub- regular expression described in continuing executing with
Processing, until the cutting unit 201 can not be partitioned into new pending son again from the regular expression
Untill regular expression, then matching terminates.
Optionally, the cutting unit 201 is being partitioned into one from the regular expression for text matches
During sub- regular expression, specifically for:Since for the original position of the regular expression of text matches,
First position do not included by constituent element character or metacharacter is searched from front to back;If finding, by institute
The original position of regular expression is stated to described first position do not included by constituent element character or metacharacter
Between part be used as a sub- regular expression;Or, by first of the regular expression not by constituent element
Position that character is included or metacharacter is used as one to the part between the end position of the regular expression
Individual sub- regular expression;If not finding, the regular expression is regard as a sub- regular expression.
Optionally, the processing unit 202 is using the pending sub- regular expression and text to be matched
When being matched, specifically for:Included or metacharacter and/or constituent element word in the pending sub- regular expression
Symbol and comprising publicly-owned prefix and/or during publicly-owned suffix, using the publicly-owned prefix and/or publicly-owned suffix with treating
Matched text is matched;If matching is less than confirming the current sub- regular expression and text to be matched
It fails to match;If matching, by the cutting unit 201 from the pending sub- regular expression
The publicly-owned prefix and publicly-owned suffix are removed, remainder is obtained, obtained remainder is split at least
One sub- regular expression, and every sub- regular expression that fractionation is obtained is carried out with text to be matched successively
Matching, untill the sub- regular expression that one of fractionation is obtained with text matches to be matched success, or
Untill every obtained sub- regular expression is split with text matches to be matched failure.
Optionally, the processing unit 202 is using the pending sub- regular expression and text to be matched
When this progress is matched, specifically for:Do not include or metacharacter or constituent element in the pending sub- regular expression
During character, matched using the pending sub- regular expression with text to be matched;If matching less than,
Then confirm that the pending sub- regular expression fails with text matches to be matched;If matching, institute is confirmed
State pending sub- regular expression and text matches to be matched success.
Optionally, the processing unit 202 is using the pending sub- regular expression and text to be matched
When this progress is matched, specifically for:Included or metacharacter and/or constituent element in the pending sub- regular expression
Character and during not comprising publicly-owned prefix and/or publicly-owned suffix, by cutting unit 201 by the pending son
Regular expression splits at least one sub- regular expression;By the every sub- regular expression split into successively
Matched with text to be matched, until sub- regular expression and text to be matched that one of fractionation is obtained
Untill the match is successful, or until every sub- regular expression that fractionation is obtained fails with text matches to be matched
Untill.
In summary, the technical scheme that the application is provided is a kind of using canonical as the method being oriented to, therefore is different from
Using text as the DFA being oriented to, this method, which will not produce intermediateness, causes state explosion, also, this method
By splitting regular expression, with the better simply sub- regular expression order for former regular expression
Matched text is gone, matching speed can be improved.In addition, the technical scheme that the application is provided can be by before fractionation
Sub- regular expression matching result, be used as the sub- regular expression split out from the sub- regular expression
Text to be matched, can so filter out unmatched range of text, cut down in whole matching process and produced
A large amount of backtrackings, improve matching efficiency.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or meter
Calculation machine program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or knot
The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the present invention can be used
Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage,
CD-ROM, optical memory etc.) on the form of computer program product implemented.
The present invention is with reference to the production of method according to embodiments of the present invention, equipment (system) and computer program
The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and
/ or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion
Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate
The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one
The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
, but those skilled in the art once know base although preferred embodiments of the present invention have been described
This creative concept, then can make other change and modification to these embodiments.So, appended right will
Ask and be intended to be construed to include preferred embodiment and fall into having altered and changing for the scope of the invention.
Claims (10)
1. a kind of matching process of regular expression, it is characterised in that including:
A sub- regular expression is partitioned into from the regular expression for text matches, pending son is used as
Regular expression;
Following processing is performed for the pending sub- regular expression:
Matched using the pending sub- regular expression with text to be matched;
If matching, matching result is obtained, matching terminates;
If not matching, continue to remove from the regular expression after the pending sub- regular expression
Remainder in be partitioned into a sub- regular expression, as new pending sub- regular expression, for
New pending sub- regular expression, which is returned, continues executing with the processing, until from the regular expression not
Untill new pending sub- regular expression can be partitioned into again, then matching terminates.
2. the method as described in claim 1, it is characterised in that from the regular expressions for text matches
A sub- regular expression is partitioned into formula, including:
Since for the original position of the regular expression of text matches, from front to back search first not by
Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag
Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions
First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits
Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
3. the method as described in claim 1, it is characterised in that use the pending sub- regular expressions
Formula is matched with text to be matched, including:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression
And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, after the publicly-owned prefix is removed from the pending sub- regular expression and be publicly-owned
Sew, obtain remainder, obtained remainder is split into at least one sub- regular expression, and will tear open
The every sub- regular expression got is matched with text to be matched successively, until one of them splits
To the success of sub- regular expression and text matches to be matched untill, or until split every obtained sub- canonical
Untill expression formula fails with text matches to be matched.
4. the method as described in claim 1, it is characterised in that use the pending sub- regular expressions
Formula is matched with text to be matched, including:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described
Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
5. the method as described in claim 1, it is characterised in that use the pending sub- regular expressions
Formula is matched with text to be matched, including:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression
Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into at least one sub- regular expressions
Formula;
The every sub- regular expression split into is matched with text to be matched successively, until one of them
Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each
Untill sub- regular expression fails with text matches to be matched.
6. a kind of coalignment of regular expression, it is characterised in that including:
Cutting unit, for being partitioned into a sub- regular expressions from the regular expression for text matches
Formula, is used as pending sub- regular expression;
Processing unit, for performing following processing for the pending sub- regular expression:Treated using described
Sub- regular expression is handled to be matched with text to be matched;If matching, matching result is obtained, is matched
Terminate;If not matching, continue to treat described in removing from the regular expression by the cutting unit
Handle and a sub- regular expression is partitioned into the remainder after sub- regular expression, as new pending
Sub- regular expression;Returned for new pending sub- regular expression and continue executing with the processing, until institute
Untill stating cutting unit and can not being partitioned into new pending sub- regular expression again from the regular expression,
Then matching terminates.
7. device as claimed in claim 6, it is characterised in that the cutting unit is from for text
When a sub- regular expression is partitioned into the regular expression of matching, specifically for:
Since for the original position of the regular expression of text matches, from front to back search first not by
Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag
Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions
First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits
Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
8. device as claimed in claim 6, it is characterised in that the processing unit is treated using described
When handling sub- regular expression and being matched with text to be matched, specifically for:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression
And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, the public affairs are removed from the pending sub- regular expression by the cutting unit
There are prefix and publicly-owned suffix, obtain remainder, obtained remainder is split into at least one sub- canonical
Expression formula, and every sub- regular expression that fractionation is obtained is matched with text to be matched successively, until
Untill one of them splits obtained sub- regular expression and text matches to be matched success, or until split
To every sub- regular expression with text matches to be matched failure untill.
9. device as claimed in claim 6, it is characterised in that the processing unit is treated using described
When handling sub- regular expression and being matched with text to be matched, specifically for:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described
Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
10. device as claimed in claim 6, it is characterised in that the processing unit is treated using described
When handling sub- regular expression and being matched with text to be matched, specifically for:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression
Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into by least one by cutting unit
Sub- regular expression;
The every sub- regular expression split into is matched with text to be matched successively, until one of them
Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each
Untill sub- regular expression fails with text matches to be matched.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610206194.XA CN107291761A (en) | 2016-04-05 | 2016-04-05 | The matching process and device of a kind of regular expression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610206194.XA CN107291761A (en) | 2016-04-05 | 2016-04-05 | The matching process and device of a kind of regular expression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107291761A true CN107291761A (en) | 2017-10-24 |
Family
ID=60092985
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610206194.XA Pending CN107291761A (en) | 2016-04-05 | 2016-04-05 | The matching process and device of a kind of regular expression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107291761A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376339A (en) * | 2018-08-02 | 2019-02-22 | 浙江大学 | A text conversion candidate rule information extraction method based on user behavior |
CN110246592A (en) * | 2019-06-25 | 2019-09-17 | 山东健康医疗大数据有限公司 | Realize the mapping method and system of medical institutions' isomeric data codomain code standardization |
CN110928793A (en) * | 2019-11-28 | 2020-03-27 | Oppo广东移动通信有限公司 | Regular expression detection method and device and computer readable storage medium |
CN111027794A (en) * | 2019-03-29 | 2020-04-17 | 广东小天才科技有限公司 | Dictation operation correcting method and learning equipment |
CN114444445A (en) * | 2022-02-07 | 2022-05-06 | 北京百度网讯科技有限公司 | Text processing method, device, electronic device and storage medium |
CN114896360A (en) * | 2022-05-18 | 2022-08-12 | 北京达佳互联信息技术有限公司 | Text processing method and device, electronic equipment and computer readable storage medium |
CN117312485A (en) * | 2023-09-27 | 2023-12-29 | 东北大学 | Regular expression matching method of log data oriented to database management system |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050223109A1 (en) * | 2003-08-27 | 2005-10-06 | Ascential Software Corporation | Data integration through a services oriented architecture |
US20060271526A1 (en) * | 2003-02-04 | 2006-11-30 | Cataphora, Inc. | Method and apparatus for sociological data analysis |
CN101267357A (en) * | 2007-03-13 | 2008-09-17 | 北京启明星辰信息技术有限公司 | A SQL injection attack detection method and system |
CN101286988A (en) * | 2008-04-18 | 2008-10-15 | 北京启明星辰信息技术股份有限公司 | Parallel multi-mode matching method and system therefor |
CN101377816A (en) * | 2008-08-15 | 2009-03-04 | 北京启明星辰信息技术股份有限公司 | Method and system for matching paralleling multiple-mode of matching regulation including displacement indication symbol |
CN101841546A (en) * | 2010-05-17 | 2010-09-22 | 华为技术有限公司 | Rule matching method, device and system |
CN102063493A (en) * | 2010-12-30 | 2011-05-18 | 北京大学 | Content extraction method based on regular expression group and control logic |
US8135711B2 (en) * | 2002-02-04 | 2012-03-13 | Cataphora, Inc. | Method and apparatus for sociological data analysis |
CN102789504A (en) * | 2012-07-19 | 2012-11-21 | 姜赢 | Chinese grammar correcting method and system on basis of XLM (Extensible Markup Language) rule |
CN103198065A (en) * | 2012-01-06 | 2013-07-10 | 北京奇策科技有限公司 | Optimization method for regular expression matching circuit |
CN103729452A (en) * | 2013-12-31 | 2014-04-16 | 杭州华为数字技术有限公司 | Rule matching method and device |
CN104714951A (en) * | 2013-12-13 | 2015-06-17 | 世纪禾光科技发展(北京)有限公司 | Parallel multi-pattern matching method and system |
-
2016
- 2016-04-05 CN CN201610206194.XA patent/CN107291761A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8135711B2 (en) * | 2002-02-04 | 2012-03-13 | Cataphora, Inc. | Method and apparatus for sociological data analysis |
US20060271526A1 (en) * | 2003-02-04 | 2006-11-30 | Cataphora, Inc. | Method and apparatus for sociological data analysis |
US20050223109A1 (en) * | 2003-08-27 | 2005-10-06 | Ascential Software Corporation | Data integration through a services oriented architecture |
CN101267357A (en) * | 2007-03-13 | 2008-09-17 | 北京启明星辰信息技术有限公司 | A SQL injection attack detection method and system |
CN101286988A (en) * | 2008-04-18 | 2008-10-15 | 北京启明星辰信息技术股份有限公司 | Parallel multi-mode matching method and system therefor |
CN101377816A (en) * | 2008-08-15 | 2009-03-04 | 北京启明星辰信息技术股份有限公司 | Method and system for matching paralleling multiple-mode of matching regulation including displacement indication symbol |
CN101841546A (en) * | 2010-05-17 | 2010-09-22 | 华为技术有限公司 | Rule matching method, device and system |
CN102063493A (en) * | 2010-12-30 | 2011-05-18 | 北京大学 | Content extraction method based on regular expression group and control logic |
CN103198065A (en) * | 2012-01-06 | 2013-07-10 | 北京奇策科技有限公司 | Optimization method for regular expression matching circuit |
CN102789504A (en) * | 2012-07-19 | 2012-11-21 | 姜赢 | Chinese grammar correcting method and system on basis of XLM (Extensible Markup Language) rule |
CN104714951A (en) * | 2013-12-13 | 2015-06-17 | 世纪禾光科技发展(北京)有限公司 | Parallel multi-pattern matching method and system |
CN103729452A (en) * | 2013-12-31 | 2014-04-16 | 杭州华为数字技术有限公司 | Rule matching method and device |
CN103729452B (en) * | 2013-12-31 | 2017-05-10 | 杭州华为数字技术有限公司 | Rule matching method and device |
Non-Patent Citations (1)
Title |
---|
邵妍: "正则表达式匹配算法并行化技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376339A (en) * | 2018-08-02 | 2019-02-22 | 浙江大学 | A text conversion candidate rule information extraction method based on user behavior |
CN109376339B (en) * | 2018-08-02 | 2020-07-03 | 浙江大学 | Text conversion candidate rule information extraction method based on user behaviors |
CN111027794A (en) * | 2019-03-29 | 2020-04-17 | 广东小天才科技有限公司 | Dictation operation correcting method and learning equipment |
CN111027794B (en) * | 2019-03-29 | 2023-09-26 | 广东小天才科技有限公司 | A correction method and learning device for dictation assignments |
CN110246592A (en) * | 2019-06-25 | 2019-09-17 | 山东健康医疗大数据有限公司 | Realize the mapping method and system of medical institutions' isomeric data codomain code standardization |
CN110246592B (en) * | 2019-06-25 | 2023-07-14 | 山东浪潮智慧医疗科技有限公司 | Mapping method and system for realizing standardization of medical institution heterogeneous data value domain codes |
CN110928793A (en) * | 2019-11-28 | 2020-03-27 | Oppo广东移动通信有限公司 | Regular expression detection method and device and computer readable storage medium |
CN114444445A (en) * | 2022-02-07 | 2022-05-06 | 北京百度网讯科技有限公司 | Text processing method, device, electronic device and storage medium |
CN114896360A (en) * | 2022-05-18 | 2022-08-12 | 北京达佳互联信息技术有限公司 | Text processing method and device, electronic equipment and computer readable storage medium |
CN117312485A (en) * | 2023-09-27 | 2023-12-29 | 东北大学 | Regular expression matching method of log data oriented to database management system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107291761A (en) | The matching process and device of a kind of regular expression | |
US10346456B2 (en) | Conditional string search | |
Kumar et al. | Advanced algorithms for fast and scalable deep packet inspection | |
JPH03122766A (en) | Prefix search tree with partial key branch function | |
CN103049709B (en) | Based on password recovery system and the restoration methods thereof of generator expansion rainbow table | |
CN110120942A (en) | Security strategy rule matching method and device, firewall box and medium | |
KR20110138237A (en) | Method and apparatus for matching input symbol streams to symbol patterns | |
CN105824825B (en) | A sensitive data identification method and device | |
CN105468588A (en) | Character string matching method and apparatus | |
CN107748778B (en) | A method and device for extracting addresses | |
CN103617226A (en) | Regular expression matching method and device | |
CN106469186A (en) | A kind of method and device of character string comparison | |
CN110830603B (en) | IPV6 address summarizing method and device | |
US20080133443A1 (en) | Methods and Apparatus for User-Guided Inference of Regular Expressions for Information Extraction | |
US9529835B2 (en) | Online compression for limited sequence length radix tree | |
US9218336B2 (en) | Efficient implementation of morphology for agglutinative languages | |
CN107807918A (en) | The method and device of Thai words recognition | |
CN104991963B (en) | Document handling method and device | |
CN109543024B (en) | Text processing method and device | |
Scott et al. | Tomita-style generalised LR parsers | |
CN113792247A (en) | Method, apparatus, device and medium for generating functional flow chart based on code characteristics | |
KR101626721B1 (en) | An efficient algorithm for boxed mesh permutation pattern matching | |
CN115329066B (en) | Text matching method, device, computing device and computer storage medium | |
CA2614974C (en) | Conditional string search | |
ZhanPeng et al. | Dictionary matching: review of the aho-corasick algorithm and vision for large dictionaries |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20171024 |
|
WD01 | Invention patent application deemed withdrawn after publication |