[go: up one dir, main page]

CN107291761A - The matching process and device of a kind of regular expression - Google Patents

The matching process and device of a kind of regular expression Download PDF

Info

Publication number
CN107291761A
CN107291761A CN201610206194.XA CN201610206194A CN107291761A CN 107291761 A CN107291761 A CN 107291761A CN 201610206194 A CN201610206194 A CN 201610206194A CN 107291761 A CN107291761 A CN 107291761A
Authority
CN
China
Prior art keywords
regular expression
sub
matched
text
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610206194.XA
Other languages
Chinese (zh)
Inventor
赵也
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING UNION VOOLE TECHNOLOGY Co Ltd
Original Assignee
BEIJING UNION VOOLE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING UNION VOOLE TECHNOLOGY Co Ltd filed Critical BEIJING UNION VOOLE TECHNOLOGY Co Ltd
Priority to CN201610206194.XA priority Critical patent/CN107291761A/en
Publication of CN107291761A publication Critical patent/CN107291761A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of matching process of regular expression and device, methods described is:A sub- regular expression is partitioned into from the regular expression for text matches, pending sub- regular expression is used as;Following processing is performed for pending sub- regular expression:Matched using pending sub- regular expression with text to be matched;If matching, matching result is obtained, matching terminates;If not matching, then continue to remove from regular expression in the remainder after the pending sub- regular expression and be partitioned into a sub- regular expression, it is used as new pending sub- regular expression, returned for new pending sub- regular expression and continue executing with the processing, untill it can not be partitioned into new pending sub- regular expression again from regular expression, then matching terminates.This method by splitting regular expression, with for former regular expression better simply sub- regular expression order go matched text, matching speed can be improved.

Description

The matching process and device of a kind of regular expression
Technical field
The present invention relates to the matching process and device of data processing field, more particularly to a kind of regular expression.
Background technology
Regular expression can use the complicated data characteristics of simple syntactic description, therefore be widely used in net The multiple fields such as network intrusion detection, document content retrieval.
The pattern match of regular expression mainly can be by deterministic finite automaton (referred to as:DFA) and Non-deterministic finite automaton is (referred to as:NFA) two kinds of data structures are realized.
But the matching of existing regular expression all has respective defect, for example, DFA matching speeds It hurry up, but EMS memory occupation is too high, for many complicated regular expression rules or large-scale regular expressions Formula regular collection, DFA can generating state blast.In another example, NFA EMS memory occupations are small, but matching speed It is extremely slow, actual network processes requirement can not be met at all on multinuclear or general purpose processor platform.
The content of the invention
The application provides a kind of matching process and device of regular expression, for solving existing regular expressions Either the matching process of formula can produce substantial amounts of intermediateness, or the problem of matching speed is slow.
The application first aspect there is provided a kind of matching process of regular expression, including:
A sub- regular expression is partitioned into from the regular expression for text matches, pending son is used as Regular expression;
Following processing is performed for the pending sub- regular expression:
Matched using the pending sub- regular expression with text to be matched;
If matching, matching result is obtained, matching terminates;
If not matching, continue to remove from the regular expression after the pending sub- regular expression Remainder in be partitioned into a sub- regular expression, as new pending sub- regular expression, for New pending sub- regular expression, which is returned, continues executing with the processing, until from the regular expression not Untill new pending sub- regular expression can be partitioned into again, then matching terminates.
In a possible design, a sub- canonical is partitioned into from the regular expression for text matches Expression formula, including:
Since for the original position of the regular expression of text matches, from front to back search first not by Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
In a possible design, the pending sub- regular expression and text to be matched progress are used Match somebody with somebody, including:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, after the publicly-owned prefix is removed from the pending sub- regular expression and be publicly-owned Sew, obtain remainder, obtained remainder is split into at least one sub- regular expression, and will tear open The every sub- regular expression got is matched with text to be matched successively, until one of them splits To the success of sub- regular expression and text matches to be matched untill, or until split every obtained sub- canonical Untill expression formula fails with text matches to be matched.
As can be seen here, the matching process that the application is provided is the matching process that quickly fails, in publicly-owned prefix and / or the failure of publicly-owned suffix match in the case of, it is not necessary to the remainder of pending sub- regular expression is allocated as Match somebody with somebody, can so accelerate matching speed, improve matching efficiency.
In a possible design, carried out using the pending sub- regular expression with text to be matched Matching, including:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
In a possible design, carried out using the pending sub- regular expression with text to be matched Matching, including:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into at least one sub- regular expressions Formula;
The every sub- regular expression split into is matched with text to be matched successively, until one of them Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each Untill sub- regular expression fails with text matches to be matched.
The application second aspect there is provided a kind of coalignment of regular expression, including:
Cutting unit, for being partitioned into a sub- regular expressions from the regular expression for text matches Formula, is used as pending sub- regular expression;
Processing unit, for performing following processing for the pending sub- regular expression:Treated using described Sub- regular expression is handled to be matched with text to be matched;If matching, matching result is obtained, is matched Terminate;If not matching, continue to treat described in removing from the regular expression by the cutting unit Handle and a sub- regular expression is partitioned into the remainder after sub- regular expression, as new pending Sub- regular expression;Returned for new pending sub- regular expression and continue executing with the processing, until institute Untill stating cutting unit and can not being partitioned into new pending sub- regular expression again from the regular expression, Then matching terminates.
In a possible design, the cutting unit is dividing from the regular expression for text matches When cutting out a sub- regular expression, specifically for:
Since for the original position of the regular expression of text matches, from front to back search first not by Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
In a possible design, the processing unit using the pending sub- regular expression with treating When matched text is matched, specifically for:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, the public affairs are removed from the pending sub- regular expression by the cutting unit There are prefix and publicly-owned suffix, obtain remainder, obtained remainder is split into at least one sub- canonical Expression formula, and every sub- regular expression that fractionation is obtained is matched with text to be matched successively, until Untill one of them splits obtained sub- regular expression and text matches to be matched success, or until split To every sub- regular expression with text matches to be matched failure untill.
In a possible design, the processing unit using the pending sub- regular expression with treating When the text of matching is matched, specifically for:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
In a possible design, the processing unit using the pending sub- regular expression with treating When the text of matching is matched, specifically for:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into by least one by cutting unit Sub- regular expression;
The every sub- regular expression split into is matched with text to be matched successively, until one of them Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each Untill sub- regular expression fails with text matches to be matched.
The scheme provided using the application, the application provide regular expression matching process be one kind with just It is then the method being oriented to, therefore different from using text as the DFA being oriented to, this method will not produce intermediateness and lead State explosion is caused, also, this method is by splitting regular expression, with for former regular expression Better simply sub- regular expression order goes matched text, so as to improve matching speed.
Brief description of the drawings
A kind of flow chart of the matching process for regular expression that Fig. 1 provides for the application;
A kind of structural representation of the coalignment for regular expression that Fig. 2 provides for the application.
Embodiment
Hereinafter, the part term in the application is explained.
" regular expression ", in many text editors or other instruments, is usually used to retrieval and/or replaces Change those content of text for meeting some pattern.Regular expression is made up of general character and metacharacter.Commonly Character includes the letter of numeral and capital and small letter, and metacharacter is the character with special implication, including as follows 11 alphabetic characters:[]\^$.|*+().Metacharacter is used for specific use, for example, " " is used to match Except line feed metacharacter " n " and " any character in addition to r ";The subexpression that " * " is used to match above is arbitrarily secondary; “" represent matching 0 or 1 that character just before it, when the character immediately any one its His delimiter (* ,+,, { n }, { n, }, { n, m }) below when, match pattern is non-greedy, non-greediness The character string that pattern few matching as far as possible is searched for, and the greedy pattern given tacit consent to then matching institute as much as possible The character string of search;" | " represents two matching conditions carrying out logical "or" (English:Or) computing;" () ", By (and) between expression formula be defined as " group " (English:group).
" the sub- regular expression " of some regular expression refers to the part being partitioned into from the regular expression, this Metacharacter "or" in application not included by metacharacter " group ", i.e. " | " " () " outside are partitioning standards. For example, it is assumed that a regular expression is (A (B | C)) | D, then can divide and obtain (A (B | C)) and two sons of D are just Then expression formula.It should be noted that when the outermost for dividing obtained sub- regular expression has " () ", Continuing to remove outmost " () " into the sub- regular expression before processing, it is so-called outmost to include Number, will bracket that entirely sub- regular expression is included, for example, above-mentioned to continue with (A (B | C)), then first to delete outmost " () ", obtain A (B | C);If but obtained sub- regular expressions The bracket of formula is not in the outermost of the sub- regular expression, such as (DE (F | G) H) | MLN, then without Remove the bracket in the sub- regular expression.The operation of the sub- outmost bracket of regular expression will hereafter be removed It is considered default action, no longer specially refers to.
" publicly-owned prefix " and " publicly-owned suffix ", if a certain (son) regular expression does not include not by metacharacter " group " Comprising metacharacter "or" when, should the part on " group " outer left side of (son) regular expression be referred to as publicly-owned prefix, Should (son) regular expression " group " outside on the right of part be referred to as publicly-owned suffix, for example, it is assumed that a canonical table It is A (B | C) D up to formula, then the publicly-owned prefix of the regular expression is A, publicly-owned suffix is D.
" non-determined type expression formula ", refer to (son) regular expressions that can not be matched according to general character string Include the metacharacter of particular meaning in formula, general non-determined type expression formula, such as d, w .*, [d-f] { 2,3 }, [1-3] * etc..
Corresponding, " deterministic type expression formula " is (son) for referring to be matched according to general character string Metacharacter not comprising particular meaning in regular expression, general deterministic type expression formula, such as abc, 1234, A12 etc..
Substantial amounts of intermediateness can be produced for DFA, state explosion is caused, and NFA matching speeds are slow The problem of, present applicant proposes a kind of matching process of regular expression and device, this method and NFA classes Seemingly, it is a kind of using canonical as the method being oriented to, therefore different from using text as the DFA being oriented to, this method will not Producing intermediateness causes state explosion, also, this method is by splitting regular expression, with compared to original Better simply sub- regular expression order goes matched text for regular expression, can improve matching speed.
What the matching process for the regular expression that the application is provided went for that traditional NFA is applicable owns Program and language, such as Java, GNU Emacs, ergp, less, more .NET etc..
The preferred embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.
A kind of flow chart of the matching process of the regular expression provided as shown in Figure 1 for the application, the side Method includes:
Step 101:A sub- regular expression is partitioned into from the regular expression for text matches, is made For pending sub- regular expression.
Optionally, the process of segmentation regular expression is as follows:
First, since for the original position of the regular expression of text matches, first is searched from front to back The position of individual do not included by constituent element character or metacharacter;Or, can also be from the knot of the regular expression Beam position starts, and first position do not included by constituent element character or metacharacter is searched from back to front.Below For concise explanation, by do not included by the constituent element character or metacharacter, i.e., " | " symbol outside " () " It is known as " group outer or " metacharacter.
Secondly, can be by the original position of the regular expression to first " group outer or " if finding Part between the position of metacharacter is used as a sub- regular expression;Or, can also be by the canonical table Up to the position to the portion between the end position of the regular expression of first " group outer or " metacharacter of formula It is allocated as a sub- regular expression;Or, can also be searched in the regular expression first The wherein shorter part of symbol lengths is selected as one in two parts that " group outer or " metacharacter is divided into Sub- regular expression.
If not finding, the regular expression is regard as a sub- regular expression.
As an example it is assumed that a regular expression is ABC | (DE (F | G (O | P) H)) | MLN, if Search first " group outer or " metacharacter, then first from front to back from the starting position of the regular expression " group outer or " metacharacter is located at behind " ABC ", can select ABC or (DE (F | G (O | P) H)) | the sub- regular expression that MLN is partitioned into as first;If from the end position of the regular expression from Search first " group outer or " metacharacter forward afterwards, then first " group outer or " metacharacter is located at before " MLN " Face, can select ABC | (DE (F | G (O | P) H)) or MLN as first son being partitioned into just Then expression formula.
In another example, it is assumed that a regular expression is DE (F | G), because the regular expression is not present, " group is outer Or " metacharacter, then a sub- regular expression can be used as using DE (F | G) is overall.
Step 102:Following processing is performed for the pending sub- regular expression:Using described pending Sub- regular expression is matched with text to be matched;If matching, step 103 is performed, if matching is not Arrive, then perform step 104.
Step 103:Matching result is obtained, matching terminates.
Step 104:Continue to remove from the regular expression surplus after the pending sub- regular expression A sub- regular expression is partitioned into remaining part point, as new pending sub- regular expression, for new Pending sub- regular expression return to step 102 continues executing with the processing, until from the regular expression In can not be partitioned into new pending sub- regular expression again untill, then matching terminate.
Optionally, when performing step 102, the form of pending sub- regular expression is different, corresponding matching Mode can also be different, can specifically there is following three kinds:
The first matching way, it is adaptable to include or metacharacter and/or constituent element character and comprising publicly-owned prefix and / or publicly-owned suffix pending sub- regular expression, such as DE (F | G (O | P) H.
Second of matching way, it is adaptable to include or metacharacter and/or constituent element character but not comprising publicly-owned prefix And/or the pending sub- regular expression of publicly-owned suffix, such as F | G (O | P).
The third matching way, it is adaptable to do not include or metacharacter or the pending sub- regular expressions of constituent element character Formula, such as ABC.
Group regular expression is contained or when metacharacter and/or constituent element character, represents that the sub- regular expression can Continue to be divided into the sub- regular expression of smaller particle size.Group regular expression does not include or metacharacter or constituent element During character, it has been indivisible minimum unit to represent the sub- regular expression.
These three matching ways are introduced separately below.
The handling process of the first matching way is as follows;
Step one, included or metacharacter and/or constituent element character and comprising publicly-owned in pending sub- regular expression When prefix and/or publicly-owned suffix, the publicly-owned prefix and/or publicly-owned suffix and text to be matched progress are used Match somebody with somebody.
Optionally, if the publicly-owned prefix that pending sub- regular expression is included is non-determined type expression formula, The publicly-owned prefix can be converted into " .*" or " .* ", that is, any value can be matched by giving tacit consent to the publicly-owned prefix, temporarily When the publicly-owned prefix is not matched.For example, it is assumed that a regular expression for d (d) abc, before its is publicly-owned Sew d be non-determined type expression formula, therefore the publicly-owned prefix can be converted to " .*", the canonical table after conversion It is .* up to formula(d) abc, and continue with conversion after regular expression.
Similarly, can if the publicly-owned suffix that pending sub- regular expression is included is non-determined type expression formula So that the publicly-owned suffix is converted into " .*" or " .* ", that is, any value can be matched by giving tacit consent to the publicly-owned suffix, temporarily The publicly-owned suffix is not matched.
“.*" and " .* " the two metacharacter strings may be incorporated for match any value.Difference is that " .* " is Greedy pattern, " .*" be inadequate pattern, for example, text to be matched be 124deffg, if use " .*f " this Regular expression goes matched text, will match to 124deff, if using " .*" this regular expression goes With text, 124def will match to.
“.*" advantage be that backtracking will not be formed, the advantage of " .* ", which is to match, all meets canonical The symbol of expression formula.
Step 2, if matching is less than confirming the pending sub- regular expression and text matches to be matched Failure.
As can be seen here, the matching process that the application is provided is the matching process that quickly fails, in publicly-owned prefix and / or the failure of publicly-owned suffix match in the case of, it is not necessary to the remainder of pending sub- regular expression is allocated as Match somebody with somebody, can so accelerate matching speed, improve matching efficiency.
Step 3, if matching, removed from the pending sub- regular expression the publicly-owned prefix and Publicly-owned suffix, obtains remainder, and obtained remainder is split into at least one sub- regular expression, And matched every sub- regular expression that fractionation is obtained with text to be matched successively, until one of them Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each Untill sub- regular expression fails with text matches to be matched.
When every sub- regular expression for obtaining fractionation is matched with text to be matched successively, Ke Yigen The form of the every sub- regular expression obtained according to splitting, selects suitable matching way to be carried out to it respectively Match somebody with somebody.
Optionally, the matching result that pending sub- regular expression can be obtained in step 3, as tearing open The text to be matched for the sub- regular expression got.
For example, it is assumed that regular expression is DE (F | G (O | P)) H, its publicly-owned prefix DE is used The coordinate range obtained with publicly-owned suffix H matched texts is [50,100], then subsequent match F | G (O | P) When only need to the text [50,100] coordinate in match, it is not necessary to whole text is matched again.
Unmatched range of text can be so filtered out, has cut down a large amount of times produced in whole matching process Trace back, improve matching efficiency.
The handling process of second of matching way is as follows;
Step one, include or metacharacter and/or constituent element character and do not wrap in the pending sub- regular expression Containing publicly-owned prefix and/or during publicly-owned suffix, the pending sub- regular expression is being split into at least one son just Then expression formula.
Step 2, the every sub- regular expression split into is matched with text to be matched successively, until Untill one of them splits obtained sub- regular expression and text matches to be matched success, or until split To every sub- regular expression with text matches to be matched failure untill.
, can basis when every split into sub- regular expression is matched with text to be matched successively The form of obtained every sub- regular expression is split, selects suitable matching way to be carried out to it respectively Match somebody with somebody.
Optionally, can be by the matching result of the sub- regular expression before fractionation, as from the sub- regular expressions The text to be matched for the sub- regular expression that formula is split out.
The handling process of the third matching way is as follows;
Step one, do not include or when metacharacter or constituent element character in the pending sub- regular expression, use The pending sub- regular expression is matched with text to be matched.
Step 2, if matching is less than confirming the pending sub- regular expression and text matches to be matched Failure.
Step 3, if matching, confirm the pending sub- regular expression and text matches to be matched into Work(.
In order to illustrate more clearly of the technical scheme of the application, above-mentioned flow is done below by one embodiment Further instruction is, it is necessary to which explanation, this embodiment is only one embodiment of the application, is not constituted Restriction to the application.
Assuming that text String text to be matched are:
String text=" zhe shi yi ge ce shi de www://cdn.voole.com/a=3&bc=d&cd=kk &de=9923&http://125.39.27.10Ca=appid "
+ " &bofangchuan=ddd "
+"jav orcale php mysql ks tea aliasmethod 3des rsa md5"
+"vlive://voole.comCa=kksjmds ";
Regular expression String str for text matches are:
String str=" (http | vlive)://|(\\d(\\.\\d)*|\\w(\\.\\w)*)\\(ca=(appid (d { 3,5 } | &bof Angchuan=([3-9] { 4,8 } | [d-f] { 2,3 })) | aid)) ";
Processing procedure is as follows:
The first step, for above-mentioned regular expression, first since beginning, " group is outer for interception first Or " metasymbol carry out regular expression segmentation, obtain first pending sub- regular expressions Formula:(http|vlive)://.
Second step, interception (http | vlive):// publicly-owned prefix and publicly-owned suffix.
It can be seen that (http | vlive):// without publicly-owned prefix, publicly-owned suffix is://.
3rd step, is determined according to BM algorithms:// appear in position in whole text.
It can be seen that:// occurred in that altogether in text to be matched three times, it is respectively:
For the first time:de www://;Matching coordinate range is:0-27.
Second:De=9923&http://;Matching coordinate range is:28-71.
For the third time:vlive:The vlive of // last column://;Matching coordinate range is:72-168.
4th step, remove (http | vlive):// in publicly-owned suffix, obtain (http | vlive), then remove outermost The bracket in face, obtains http | vlive.
5th step, because http | vlive includes " | ", therefore it is alienable, by http | and vlive further divides It is cut into http and vlive two parts.
6th step, determines that http appears in the position in the matching coordinate range that the 3rd step is obtained according to BM algorithms Put, because http is to determine type expression formula, therefore without http is converted into " .*" or " .* ".
As can be seen that http is appeared in second matching coordinate range 28-71, http specifically matches seat Marking scope is:67-71.
7th step, it is no subsequently because current sub- regular expression http has been indivisible unit Continue the sub- regular expression matched, and this sub- regular expression http has been met, therefore matching can be returned Coordinate range 67-71, and terminate matching process.
The matching process of regular expression based on the above-mentioned offer of the application, the application provides a kind of regular expressions The coalignment 200 of formula, as shown in Fig. 2 described device 200 includes cutting unit 201 and processing unit 202:
The cutting unit 201, for being partitioned into a son from the regular expression for text matches just Then expression formula, is used as pending sub- regular expression.
The processing unit 202, for performing following processing for the pending sub- regular expression:Make Matched with the pending sub- regular expression with text to be matched;If matching, matching knot is obtained Really, matching terminates;If not matching, continued by the cutting unit 201 from the regular expression A sub- regular expression is partitioned into the middle remainder removed after the pending sub- regular expression, is made For new pending sub- regular expression;Returned for new pending sub- regular expression described in continuing executing with Processing, until the cutting unit 201 can not be partitioned into new pending son again from the regular expression Untill regular expression, then matching terminates.
Optionally, the cutting unit 201 is being partitioned into one from the regular expression for text matches During sub- regular expression, specifically for:Since for the original position of the regular expression of text matches, First position do not included by constituent element character or metacharacter is searched from front to back;If finding, by institute The original position of regular expression is stated to described first position do not included by constituent element character or metacharacter Between part be used as a sub- regular expression;Or, by first of the regular expression not by constituent element Position that character is included or metacharacter is used as one to the part between the end position of the regular expression Individual sub- regular expression;If not finding, the regular expression is regard as a sub- regular expression.
Optionally, the processing unit 202 is using the pending sub- regular expression and text to be matched When being matched, specifically for:Included or metacharacter and/or constituent element word in the pending sub- regular expression Symbol and comprising publicly-owned prefix and/or during publicly-owned suffix, using the publicly-owned prefix and/or publicly-owned suffix with treating Matched text is matched;If matching is less than confirming the current sub- regular expression and text to be matched It fails to match;If matching, by the cutting unit 201 from the pending sub- regular expression The publicly-owned prefix and publicly-owned suffix are removed, remainder is obtained, obtained remainder is split at least One sub- regular expression, and every sub- regular expression that fractionation is obtained is carried out with text to be matched successively Matching, untill the sub- regular expression that one of fractionation is obtained with text matches to be matched success, or Untill every obtained sub- regular expression is split with text matches to be matched failure.
Optionally, the processing unit 202 is using the pending sub- regular expression and text to be matched When this progress is matched, specifically for:Do not include or metacharacter or constituent element in the pending sub- regular expression During character, matched using the pending sub- regular expression with text to be matched;If matching less than, Then confirm that the pending sub- regular expression fails with text matches to be matched;If matching, institute is confirmed State pending sub- regular expression and text matches to be matched success.
Optionally, the processing unit 202 is using the pending sub- regular expression and text to be matched When this progress is matched, specifically for:Included or metacharacter and/or constituent element in the pending sub- regular expression Character and during not comprising publicly-owned prefix and/or publicly-owned suffix, by cutting unit 201 by the pending son Regular expression splits at least one sub- regular expression;By the every sub- regular expression split into successively Matched with text to be matched, until sub- regular expression and text to be matched that one of fractionation is obtained Untill the match is successful, or until every sub- regular expression that fractionation is obtained fails with text matches to be matched Untill.
In summary, the technical scheme that the application is provided is a kind of using canonical as the method being oriented to, therefore is different from Using text as the DFA being oriented to, this method, which will not produce intermediateness, causes state explosion, also, this method By splitting regular expression, with the better simply sub- regular expression order for former regular expression Matched text is gone, matching speed can be improved.In addition, the technical scheme that the application is provided can be by before fractionation Sub- regular expression matching result, be used as the sub- regular expression split out from the sub- regular expression Text to be matched, can so filter out unmatched range of text, cut down in whole matching process and produced A large amount of backtrackings, improve matching efficiency.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or knot The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the present invention can be used Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) on the form of computer program product implemented.
The present invention is with reference to the production of method according to embodiments of the present invention, equipment (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and / or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
, but those skilled in the art once know base although preferred embodiments of the present invention have been described This creative concept, then can make other change and modification to these embodiments.So, appended right will Ask and be intended to be construed to include preferred embodiment and fall into having altered and changing for the scope of the invention.

Claims (10)

1. a kind of matching process of regular expression, it is characterised in that including:
A sub- regular expression is partitioned into from the regular expression for text matches, pending son is used as Regular expression;
Following processing is performed for the pending sub- regular expression:
Matched using the pending sub- regular expression with text to be matched;
If matching, matching result is obtained, matching terminates;
If not matching, continue to remove from the regular expression after the pending sub- regular expression Remainder in be partitioned into a sub- regular expression, as new pending sub- regular expression, for New pending sub- regular expression, which is returned, continues executing with the processing, until from the regular expression not Untill new pending sub- regular expression can be partitioned into again, then matching terminates.
2. the method as described in claim 1, it is characterised in that from the regular expressions for text matches A sub- regular expression is partitioned into formula, including:
Since for the original position of the regular expression of text matches, from front to back search first not by Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
3. the method as described in claim 1, it is characterised in that use the pending sub- regular expressions Formula is matched with text to be matched, including:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, after the publicly-owned prefix is removed from the pending sub- regular expression and be publicly-owned Sew, obtain remainder, obtained remainder is split into at least one sub- regular expression, and will tear open The every sub- regular expression got is matched with text to be matched successively, until one of them splits To the success of sub- regular expression and text matches to be matched untill, or until split every obtained sub- canonical Untill expression formula fails with text matches to be matched.
4. the method as described in claim 1, it is characterised in that use the pending sub- regular expressions Formula is matched with text to be matched, including:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
5. the method as described in claim 1, it is characterised in that use the pending sub- regular expressions Formula is matched with text to be matched, including:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into at least one sub- regular expressions Formula;
The every sub- regular expression split into is matched with text to be matched successively, until one of them Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each Untill sub- regular expression fails with text matches to be matched.
6. a kind of coalignment of regular expression, it is characterised in that including:
Cutting unit, for being partitioned into a sub- regular expressions from the regular expression for text matches Formula, is used as pending sub- regular expression;
Processing unit, for performing following processing for the pending sub- regular expression:Treated using described Sub- regular expression is handled to be matched with text to be matched;If matching, matching result is obtained, is matched Terminate;If not matching, continue to treat described in removing from the regular expression by the cutting unit Handle and a sub- regular expression is partitioned into the remainder after sub- regular expression, as new pending Sub- regular expression;Returned for new pending sub- regular expression and continue executing with the processing, until institute Untill stating cutting unit and can not being partitioned into new pending sub- regular expression again from the regular expression, Then matching terminates.
7. device as claimed in claim 6, it is characterised in that the cutting unit is from for text When a sub- regular expression is partitioned into the regular expression of matching, specifically for:
Since for the original position of the regular expression of text matches, from front to back search first not by Position that constituent element character is included or metacharacter;
If finding, by the original position of the regular expression to described first not by constituent element character bag Part between position contain or metacharacter is used as a sub- regular expression;Or, by the regular expressions First of formula be being included by constituent element character or position of metacharacter to the regular expression stop bits Part between putting is used as a sub- regular expression;
If not finding, the regular expression is regard as a sub- regular expression.
8. device as claimed in claim 6, it is characterised in that the processing unit is treated using described When handling sub- regular expression and being matched with text to be matched, specifically for:
Included or metacharacter and/or constituent element character and comprising publicly-owned prefix in the pending sub- regular expression And/or during publicly-owned suffix, matched using the publicly-owned prefix and/or publicly-owned suffix with text to be matched;
If matching is less than confirming that the current sub- regular expression fails with text matches to be matched;
If matching, the public affairs are removed from the pending sub- regular expression by the cutting unit There are prefix and publicly-owned suffix, obtain remainder, obtained remainder is split into at least one sub- canonical Expression formula, and every sub- regular expression that fractionation is obtained is matched with text to be matched successively, until Untill one of them splits obtained sub- regular expression and text matches to be matched success, or until split To every sub- regular expression with text matches to be matched failure untill.
9. device as claimed in claim 6, it is characterised in that the processing unit is treated using described When handling sub- regular expression and being matched with text to be matched, specifically for:
Do not include in the pending sub- regular expression or when metacharacter or constituent element character, wait to locate using described Sub- regular expression is managed to be matched with text to be matched;
If matching is less than confirming that the pending sub- regular expression and text matches to be matched fail;
If matching, the pending sub- regular expression and text matches to be matched success are confirmed.
10. device as claimed in claim 6, it is characterised in that the processing unit is treated using described When handling sub- regular expression and being matched with text to be matched, specifically for:
Included or metacharacter and/or constituent element character and not comprising before publicly-owned in the pending sub- regular expression Sew and/or during publicly-owned suffix, the pending sub- regular expression is split into by least one by cutting unit Sub- regular expression;
The every sub- regular expression split into is matched with text to be matched successively, until one of them Untill splitting obtained sub- regular expression and text matches to be matched success, or until split obtain each Untill sub- regular expression fails with text matches to be matched.
CN201610206194.XA 2016-04-05 2016-04-05 The matching process and device of a kind of regular expression Pending CN107291761A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610206194.XA CN107291761A (en) 2016-04-05 2016-04-05 The matching process and device of a kind of regular expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610206194.XA CN107291761A (en) 2016-04-05 2016-04-05 The matching process and device of a kind of regular expression

Publications (1)

Publication Number Publication Date
CN107291761A true CN107291761A (en) 2017-10-24

Family

ID=60092985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610206194.XA Pending CN107291761A (en) 2016-04-05 2016-04-05 The matching process and device of a kind of regular expression

Country Status (1)

Country Link
CN (1) CN107291761A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376339A (en) * 2018-08-02 2019-02-22 浙江大学 A text conversion candidate rule information extraction method based on user behavior
CN110246592A (en) * 2019-06-25 2019-09-17 山东健康医疗大数据有限公司 Realize the mapping method and system of medical institutions' isomeric data codomain code standardization
CN110928793A (en) * 2019-11-28 2020-03-27 Oppo广东移动通信有限公司 Regular expression detection method and device and computer readable storage medium
CN111027794A (en) * 2019-03-29 2020-04-17 广东小天才科技有限公司 Dictation operation correcting method and learning equipment
CN114444445A (en) * 2022-02-07 2022-05-06 北京百度网讯科技有限公司 Text processing method, device, electronic device and storage medium
CN114896360A (en) * 2022-05-18 2022-08-12 北京达佳互联信息技术有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN117312485A (en) * 2023-09-27 2023-12-29 东北大学 Regular expression matching method of log data oriented to database management system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223109A1 (en) * 2003-08-27 2005-10-06 Ascential Software Corporation Data integration through a services oriented architecture
US20060271526A1 (en) * 2003-02-04 2006-11-30 Cataphora, Inc. Method and apparatus for sociological data analysis
CN101267357A (en) * 2007-03-13 2008-09-17 北京启明星辰信息技术有限公司 A SQL injection attack detection method and system
CN101286988A (en) * 2008-04-18 2008-10-15 北京启明星辰信息技术股份有限公司 Parallel multi-mode matching method and system therefor
CN101377816A (en) * 2008-08-15 2009-03-04 北京启明星辰信息技术股份有限公司 Method and system for matching paralleling multiple-mode of matching regulation including displacement indication symbol
CN101841546A (en) * 2010-05-17 2010-09-22 华为技术有限公司 Rule matching method, device and system
CN102063493A (en) * 2010-12-30 2011-05-18 北京大学 Content extraction method based on regular expression group and control logic
US8135711B2 (en) * 2002-02-04 2012-03-13 Cataphora, Inc. Method and apparatus for sociological data analysis
CN102789504A (en) * 2012-07-19 2012-11-21 姜赢 Chinese grammar correcting method and system on basis of XLM (Extensible Markup Language) rule
CN103198065A (en) * 2012-01-06 2013-07-10 北京奇策科技有限公司 Optimization method for regular expression matching circuit
CN103729452A (en) * 2013-12-31 2014-04-16 杭州华为数字技术有限公司 Rule matching method and device
CN104714951A (en) * 2013-12-13 2015-06-17 世纪禾光科技发展(北京)有限公司 Parallel multi-pattern matching method and system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8135711B2 (en) * 2002-02-04 2012-03-13 Cataphora, Inc. Method and apparatus for sociological data analysis
US20060271526A1 (en) * 2003-02-04 2006-11-30 Cataphora, Inc. Method and apparatus for sociological data analysis
US20050223109A1 (en) * 2003-08-27 2005-10-06 Ascential Software Corporation Data integration through a services oriented architecture
CN101267357A (en) * 2007-03-13 2008-09-17 北京启明星辰信息技术有限公司 A SQL injection attack detection method and system
CN101286988A (en) * 2008-04-18 2008-10-15 北京启明星辰信息技术股份有限公司 Parallel multi-mode matching method and system therefor
CN101377816A (en) * 2008-08-15 2009-03-04 北京启明星辰信息技术股份有限公司 Method and system for matching paralleling multiple-mode of matching regulation including displacement indication symbol
CN101841546A (en) * 2010-05-17 2010-09-22 华为技术有限公司 Rule matching method, device and system
CN102063493A (en) * 2010-12-30 2011-05-18 北京大学 Content extraction method based on regular expression group and control logic
CN103198065A (en) * 2012-01-06 2013-07-10 北京奇策科技有限公司 Optimization method for regular expression matching circuit
CN102789504A (en) * 2012-07-19 2012-11-21 姜赢 Chinese grammar correcting method and system on basis of XLM (Extensible Markup Language) rule
CN104714951A (en) * 2013-12-13 2015-06-17 世纪禾光科技发展(北京)有限公司 Parallel multi-pattern matching method and system
CN103729452A (en) * 2013-12-31 2014-04-16 杭州华为数字技术有限公司 Rule matching method and device
CN103729452B (en) * 2013-12-31 2017-05-10 杭州华为数字技术有限公司 Rule matching method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邵妍: "正则表达式匹配算法并行化技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376339A (en) * 2018-08-02 2019-02-22 浙江大学 A text conversion candidate rule information extraction method based on user behavior
CN109376339B (en) * 2018-08-02 2020-07-03 浙江大学 Text conversion candidate rule information extraction method based on user behaviors
CN111027794A (en) * 2019-03-29 2020-04-17 广东小天才科技有限公司 Dictation operation correcting method and learning equipment
CN111027794B (en) * 2019-03-29 2023-09-26 广东小天才科技有限公司 A correction method and learning device for dictation assignments
CN110246592A (en) * 2019-06-25 2019-09-17 山东健康医疗大数据有限公司 Realize the mapping method and system of medical institutions' isomeric data codomain code standardization
CN110246592B (en) * 2019-06-25 2023-07-14 山东浪潮智慧医疗科技有限公司 Mapping method and system for realizing standardization of medical institution heterogeneous data value domain codes
CN110928793A (en) * 2019-11-28 2020-03-27 Oppo广东移动通信有限公司 Regular expression detection method and device and computer readable storage medium
CN114444445A (en) * 2022-02-07 2022-05-06 北京百度网讯科技有限公司 Text processing method, device, electronic device and storage medium
CN114896360A (en) * 2022-05-18 2022-08-12 北京达佳互联信息技术有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN117312485A (en) * 2023-09-27 2023-12-29 东北大学 Regular expression matching method of log data oriented to database management system

Similar Documents

Publication Publication Date Title
CN107291761A (en) The matching process and device of a kind of regular expression
US10346456B2 (en) Conditional string search
Kumar et al. Advanced algorithms for fast and scalable deep packet inspection
JPH03122766A (en) Prefix search tree with partial key branch function
CN103049709B (en) Based on password recovery system and the restoration methods thereof of generator expansion rainbow table
CN110120942A (en) Security strategy rule matching method and device, firewall box and medium
KR20110138237A (en) Method and apparatus for matching input symbol streams to symbol patterns
CN105824825B (en) A sensitive data identification method and device
CN105468588A (en) Character string matching method and apparatus
CN107748778B (en) A method and device for extracting addresses
CN103617226A (en) Regular expression matching method and device
CN106469186A (en) A kind of method and device of character string comparison
CN110830603B (en) IPV6 address summarizing method and device
US20080133443A1 (en) Methods and Apparatus for User-Guided Inference of Regular Expressions for Information Extraction
US9529835B2 (en) Online compression for limited sequence length radix tree
US9218336B2 (en) Efficient implementation of morphology for agglutinative languages
CN107807918A (en) The method and device of Thai words recognition
CN104991963B (en) Document handling method and device
CN109543024B (en) Text processing method and device
Scott et al. Tomita-style generalised LR parsers
CN113792247A (en) Method, apparatus, device and medium for generating functional flow chart based on code characteristics
KR101626721B1 (en) An efficient algorithm for boxed mesh permutation pattern matching
CN115329066B (en) Text matching method, device, computing device and computer storage medium
CA2614974C (en) Conditional string search
ZhanPeng et al. Dictionary matching: review of the aho-corasick algorithm and vision for large dictionaries

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20171024

WD01 Invention patent application deemed withdrawn after publication