WO2024235443A1 - System and method for data processing for enhanced regex matching - Google Patents
System and method for data processing for enhanced regex matching Download PDFInfo
- Publication number
- WO2024235443A1 WO2024235443A1 PCT/EP2023/062945 EP2023062945W WO2024235443A1 WO 2024235443 A1 WO2024235443 A1 WO 2024235443A1 EP 2023062945 W EP2023062945 W EP 2023062945W WO 2024235443 A1 WO2024235443 A1 WO 2024235443A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- input data
- regex
- sequences
- characters
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Definitions
- the present disclosure relates generally to data processing systems and computing resources management; more specifically, the present disclosure relates to a system and a method for data processing for enhanced regex matching; for example, the present disclosure relates to reducing the computing time or computing resources for matching groups of regular expressions (i.e., commonly referred to as “regex”).
- regular expressions are used to locate and group information from documents, such as by creating a state machine and passing a text of the documents to the state machine until one or more matches are found.
- PII personally identifiable information
- an identification of personally identifiable information (PII) that includes a fixed template, such as a series of digits for a mobile number, an Israeli ID, for example, with nine digits only and the like from the documents.
- PII personally identifiable information
- the scanning of the documents to search the regex for grouping the regex outperforms searching for each regex separately because an aggregated state machine includes fewer states and transitions than the sum of states and transitions of each regex state machine.
- the present disclosure provides a system and a method of using the system including computing hardware that is configured to process input data to generate corresponding output data.
- the present disclosure provides a solution to the existing problem of how to increase the performance of regex matching search or perform grouping of the regular expression without adding any additional load on computing time and resources.
- An aim of the present disclosure is to provide a solution that overcomes, at least partially, the technical problem encountered in the prior art and provides an improved system and method of data processing for enhanced regex matching that reduces the computing time or computing resources to match groups of regular expressions.
- the disclosed method and the system significantly improve the regular expression performance and grouping, thereby improving the functioning of the computing device itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional systems and methods.
- PII personally identifiable information
- the present disclosure provides a system including computing hardware that is configured to generate corresponding output data. Moreover, the system is configured to store the input data in the data memory of the system, to scan a regex state machine of the system to identify one or more sequences of shared characters associated with states of the state machine, and to determine lengths X of the one or more sequences.
- local is meant within one hundred (100) characters from where a match occurs, more optionally within twenty (20) characters from where the match occurs, and yet more optionally within ten (10) characters from where the match occurs.
- “local” may be computed by combining a length of a common string that is to be searched (LI), a longest prefix in the regex group (L2), and a length of a common string that is to be searched (L3), and a longest suffix (L4).
- LI a length of a common string that is to be searched
- L2 a longest prefix in the regex group
- L3 a length of a common string that is to be searched
- L4 longest suffix
- the system provides an efficient and an optimized data processing for enhanced regex matching that reduces the computing time or the computing resources to match groups of regular expressions.
- the system is configured to automatically detect character sequences in a state machine describing a regex and further allows skipping of the majority of the text instead of scanning the whole text character by character. Thereafter, the system is configured to apply the regex match on a smaller subset of the text.
- the system is configured to provide encoding characteristics to perform “shorthand” searches for character sequences and to identify the same character type sequences to reduce the size of the searched text.
- the regular expressions are used to search and identify special patterns in a document, such as email addresses, phone numbers, uniform resource locators (URLs), and the like.
- storing the input data in data memory to scan the regex for the identification of one or more sequences of the shared characters is used to determine one or more matches quickly and accurately with reduced processing time.
- the system may be used to detect multiple regular expressions, such as personally identifiable information (PII) detection, malware detection and the like that includes strictly specified characteristics in an efficient and optimized manner with reduced processing time and reduced utilization of resources.
- PII personally identifiable information
- the disclosed system significantly improves the regular expression performance and grouping, thereby improving the functioning of the computing hardware itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional systems.
- the regex search is limited to one or more regions of the input data adjacent to where the match of expressions of states are identified.
- the system is configured to eliminate the requirement of performing the regex search in the whole document and allows the regex search in the input data among the characters that are local to the at least one match occurrence to reduce the resource utilization with an improved processing time.
- the one or more sequences of shared characters are implemented as number sequences.
- implementation of the one or more sequences of the shared characters is implemented as the number of sequences that improve the overall algorithmic cost of scanning for the one or more sequences is reduced, which improves the overall processing time of the system.
- system is configured to build the groups to be able to identify more than one regex present in the input data.
- the system is configured to process and identify the matching patterns within a large amount of the input data efficiently and accurately in order to perform the required actions, such as retrieving the PII from a document and the like.
- the present disclosure provides a method for (namely, a method of) using a system including computing hardware that is configured to process input data to generate corresponding output data.
- the method includes storing the input data in the data memory of the system, scanning a regex state machine of the system to identify one or more sequences of shared characters associated with states of the state machine, and determining lengths X of the one or more sequences, scanning characters included in the input data in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data.
- local is meant within one hundred (100) characters from where a match occurs, more optionally within twenty (20) characters from where the match occurs, and yet more optionally within ten (10) characters from where the match occurs.
- “local” may be computed by combining a length of a common string that is to be searched (LI), a longest prefix in the regex group (L2), and a length of a common string that is to be searched (L3), and a longest suffix (L4).
- LI a length of a common string that is to be searched
- L2 a longest prefix in the regex group
- L3 a length of a common string that is to be searched
- L4 longest suffix
- the method achieves all the advantages and technical effects of the system of the present disclosure.
- FIG. 1 is an illustration of a system including computing hardware that is configured to process input data to generate corresponding output data, in accordance with an embodiment of the present disclosure
- FIG. 2 is an exemplary illustration of scanning a text of a document, in accordance with an embodiment of the present disclosure
- FIG. 3 is another exemplary illustration of a scanning of a text of a document, in accordance with an embodiment of the present disclosure.
- FIG. 4 is a flowchart of a method for using a system including a computing hardware that is configured to process input data to generate corresponding output data.
- an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent.
- a non-underlined number relates to an item identified by a line linking the nonunderlined number to the item.
- the non-underlined number is used to identify a general item at which the arrow is pointing.
- FIG. 1 is an illustration of a system including computing hardware that is configured to process input data to generate corresponding output data, in accordance with an embodiment of the present disclosure.
- a system 102 includes computing hardware 104 that is coupled to a data memory 106, wherein the system 102 is configured in use to receive input data 108.
- the system 102 including the computing hardware 104 is configured to process the input data 108 to generate corresponding output data.
- the computing hardware 104 of the system 102 is configured to process the input data 108, to generate the corresponding output data with improved efficiency, accuracy, and reduced processing time, such as by providing an improved regular expression performance and grouping.
- the input data 108 includes at least one of text data, image data, and video data.
- the input data 108 includes the text data.
- the input data 108 includes the image data.
- the input data 108 is used to search and group one or more sequences from the input data 108.
- personally identifiable information PII
- PII personally identifiable information
- the input data 108 enables the computing hardware 104 of the system 102 to process the input data 108, such as the text data or the image data or the video data to generate the corresponding output data.
- the system 102 is configured to store the input data 108 in the data memory 106 of the system 102.
- the system 102 is configured to store the text data in the data memory 106 of the system 102.
- the system 102 is configured to store the image data in the data memory 106 of the system 102.
- the input data 108 stored in the data memory 106 is processed to generate the output data.
- the system 102 is configured to scan a regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine and to determine lengths X of the one or more sequences.
- the system 102 is configured to analyze the state machine and scan for the one or more sequences of shared characters that are associated with states in the machine.
- the system 102 is configured to determine the length of the one or more sequences in order to scan the input data efficiently and accurately.
- the identification of the shared sequences is used to optimize the scanning of the text data to reduce the utilization of the resources to search for matching patterns in the input data.
- the states are related by way of probabilities of occurrence.
- the regex state machine is used to parse the text characters of the one or more sequences of the shared characters to identify the matching occurrences of text in the input data 108 by scanning the input data 108.
- the one or more sequences of shared characters are implemented as number sequences.
- the implementation of the number sequences as the one or more sequences of the shared characters is advantageous for grouping the number of states in the regex state machine that further provides an efficient data processing of the regex with the shared sequences to provide an optimized pattern matching that improves the overall efficiency and processing time of the system 102.
- the system 102 is configured to group the one or more sequences into groups. Moreover, sequences of a given group share a sequence of shared characters. In addition, the sequence of shared characters is used for implementing the scanning of characters present in the input data 108.
- the groups of the one or more sequences are used to scan the input data 108 more efficiently and accurately with reduced processing time as the grouping of the one or more sequences eliminates the requirement of scanning the input data 108 again and again for the one or more sequences.
- the system 102 is configured to build the groups to be able to identify more than one regex present in the input data 108. For example, if a 7-digit string or 2-digit characters are followed by a 4-digit string, then, in that case, the 4- digit string is a shared regex that allows the identification of more than one regex, such as a 7- digit string or a 4-digit string.
- the system 102 is configured to build the groups of more or less equal size as shown in the given below table:-
- the overlapping of the one or more groups is required to ensure an accurate and efficient identification of the one or more regex present in the document, such as the identification of the regex that resides between the data segments.
- an expression is split between two data segments, such as an expression
- abbreviations “abl2345678bb” is further shared between two data segments.
- the system 102 is configured to perform overlapping of the data segments due to which the scanning of the multiple data segments can be performed.
- the system 102 is configured to process and identify the matching patterns within a large amount of the input data 108 efficiently and accurately in order to perform the required actions, such as retrieving the PII from the document and the like.
- system 102 is further configured to scan characters included in the input data 108 in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data 108.
- the system 102 is configured to scan every character that is included in the document to determine the lengths X of the one or more sequences.
- the identification of one or more sequences before performing the regex search allows the system 102 to eliminate the major portion of the text of the document that does not include, such one or more sequences that include the characters of the determined lengths X.
- the system 102 allows the computing hardware 104 to scan and identify the one or more sequences of the shared characters associated with a state of the state machine with reduced memory utilization and reduced processing time.
- the system 102 is configured to scan the regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine and to determine lengths X of the one or more sequences. Thereafter, if at least one match occurrence is found, then, in that case, the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the at least one match occurrence.
- the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the corresponding match occurrence.
- the regex search is limited to one or more regions of the input data 108 adjacent to where the match of expressions of states are identified.
- the system 102 is configured to eliminate the requirement of performing the regex search in the whole document and allows the regex search in the input data 108 among the characters that are local to the at least one match occurrence to reduce the resource utilization with an improved processing time.
- the system 102 is further configured to generate the output data indicative of where the regex search in an event that at least one match occurrence is found and identifies a match of expressions of states of the regex state machine to the input data 108.
- the regex search is performed.
- the system 102 is configured to identify the match of the expressions of the states in the regex state machine to the input data 108. For example, the identification of the personally identifiable information (PII), such as an Israeli ID, a phone number, and the like.
- PII personally identifiable information
- the system 102 is allowed to locate and analyze the specific areas of the input data 108, such as the matching patterns that may be further used for data mining, natural language processing, and the like.
- the system 102 provides an efficient and an optimized data processing for enhanced regex matching that reduces the computing time or computing resources to match groups of regular expressions.
- the system 102 is configured to automatically detect character sequences in a state machine describing a regex and further allows skipping of the majority of the text. Thereafter, the system 102 is configured to apply the regex match on a smaller subset of the text.
- the system 102 is configured to provide encoding characteristics to perform “shorthand” searches for character sequences and to identify the same character type sequences to reduce the size of the searched text.
- the regular expressions are used to search and identify special patterns in a document, such as email addresses, phone numbers, uniform resource locators (URLs), and the like.
- storing the input data 108 in data memory to scan the regex for the identification of one or more sequences of the shared characters is used to determine one or more matches quickly and accurately with reduced processing time.
- the system 102 provided may be used to detect multiple regular expressions, such as personally identifiable information (PII) detection, malware detection and the like that includes strictly specified characteristics in an efficient and optimized manner with reduced processing time and reduced utilization of resources.
- PII personally identifiable information
- the disclosed system significantly improves the regular expression performance and grouping, thereby improving the functioning of the computing hardware 104 itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional systems.
- PII personally identifiable information
- FIG. 2 is an exemplary illustration of scanning a text of a document, in accordance with an embodiment of the present disclosure.
- FIG. 2 is described in conjunction with elements from FIG.1.
- an illustration 200 of scanning the text of the document that includes a series of operations from 202 to 216.
- the computing hardware 104 of the system 102 is configured to execute the series of operations from 202 to 216, such as by scanning the text of the document.
- the computing hardware 104 is configured to scan for a 7-digit sequence, such as from the operation 202 instead of scanning the document character by character, such as by parsing through each of the states of the regex state machine to identify the 7-digit sequence.
- the system 102 is configured to limit the size of the regex state machine.
- the system 102 is configured to perform optimized scanning of the input data 108.
- breaking the regex into groups of more or less equal size allows the system 102 to perform parallel programming (i.e., multi-threading or multi-processing) that improves the overall performance of the system 102.
- the regex state machine is applied only when a character is found, so the document is scanned not for every character, but in steps. Moreover, the state machine is used in conjunction with searching the document, only when such a character is found, and then the characters are applied to the state machine one by one.
- the computing hardware 104 is configured to scan a first digit, such as at the operation 204 followed by the scanning of a second digit, such as at the operation 206, followed by the scanning of a third digit, such as at the operation 208, followed by the scanning of a fourth digit, such as at the operation 210, followed by the scanning of a fifth digit, such as at the operation 212, followed by the scanning of a sixth digit, such as at the operation 214, and finally the scanning of a seventh digit, such as at the operation 216. Thereafter, the computing hardware 104 of the system 102 is configured to apply the regex matching.
- the 7- digit sequence is detected by scanning for digits (e.g., digits “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, or “9”) in the regex state machine to provide an optimized scanning of the text of the document with reduced processing time and resource utilization.
- digits e.g., digits “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, or “9”.
- FIG. 3 is another exemplary illustration of scanning a text of a document, in accordance with an embodiment of the present disclosure.
- FIG. 3 is described in conjunction with elements from FIG.1 and FIG. 2.
- an illustration 300 of a scanning of the text of the document there is shown an illustration 300 of scanning the text of the document.
- an illustration 300 of scanning the text of the document that includes a series of operations from 302 to 320.
- the computing hardware 104 of the system 102 is configured to execute the series of operations from 302 to 320, such as by scanning the text of the document.
- the computing hardware 104 is configured to scan the one or more sequences, such as at the operation 302.
- the scanning of the one or more sequences includes a first sequence that includes the scanning of the two characters, such as from “a”, “b”, “c”, “d”, and “e” followed by 4-digit numbers, such as from digits “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, or “9”.
- the scanning of the one or more sequences includes scanning for a second sequence includes the scanning of the 7-digit number such as “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, or “9”.
- the scanning of the first sequence includes the scanning of a first character, such as at the operation 304, followed by the scanning of a second character, such as at the operation 306, followed by the scanning of a first digit, such as at the operation 314, followed by the scanning of a second digit, such as at the operation 316, followed by the scanning of a third digit, such as at the operation 318, and finally scanning of the fourth digit, such as at the operation 320.
- the scanning of the second sequence includes scanning of a first digit, such as at the operation 308, followed by the scanning of a second digit, such as at the operation 310, followed by the scanning of a third digit, such as at the operation 312, followed by the scanning of a fourth digit, such as at the operation 314, followed by the scanning of a fifth digit, such as at the operation 316, followed by the scanning of a sixth digit, such as at the operation 318 and finally scanning of the seventh digit, such as at the operation 320.
- the computing hardware 104 is configured to perform regex matching after the identification of the first sequence or the second sequence.
- the traversal of states such as “2”, “3”, “5”, “6”, and “7”, as shown in FIG. 3 are shared between both the sequences, such as the first sequence and the second sequence, which are used to reduce the size of the search text.
- an uncapitalized alphabetical characters for example are not useful since they consist of the majority of many textual documents and for human-readable text, digit sequences or sequences of capitalized characters are usually useful for reducing the text size. Therefore, in order to find the required one or more sequences, the state machine registers all the shared paths ordered by their length during the creation of the state machine.
- the required sequence is used either by preprocessing of the text, the context of the document (e.g., document in a folder containing human-readable text), or by document metadata (e.g., MS Word document) that contains information about the number of characters and their type.
- document metadata e.g., MS Word document
- the scanning of the text before applying the regex matching allows the system 102 to provide optimized scanning and reduced processing time, such as by encapsulating several consecutive states that are shared (i.e., “2”, “3”, “5”, and “6”) between all paths to the final state (i.e., “7”).
- FIG. 4 is a flowchart of a method for use in a system including computing hardware that is configured to process input data to generate corresponding output data, in accordance with an embodiment of the present disclosure.
- FIG. 4 is described in conjunction with elements from FIGs. 1, 2 and 3.
- FIG. 4 there is shown a flowchart of a method 400 for use in the system 102 that is configured to process input data (i.e., the input data 108 of FIG. 1) to generate corresponding output data.
- the method 400 includes steps 402 to 410.
- the computing hardware 104 of the system 102 is configured to process the input data 108 to generate the corresponding output data efficiently and accurately with reduced processing time, such as by providing an improved regular expression performance and grouping.
- the input data 108 includes at least one of text data, image data, and video data.
- the input data 108 includes the text data.
- the input data 108 includes the image data.
- the input data 108 is used to search and group one or more sequences from the input data 108. For example, personally identifiable information (PII), such as names, phone numbers or email ids that are included in the document is identified in the input data 108, such as the text data.
- PII personally identifiable information
- the method 400 includes, storing the input data 108 in the data memory 106 of the system 102.
- the system 102 is configured to store the text data in the data memory 106 of the system 102.
- the system 102 is configured to store the image data in the data memory 106 of the system 102. Therefore, the input data 108 is stored in the data memory 106 that is processed to generate the output data.
- the method 400 includes, scanning a regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine, and to determine lengths X of the one or more sequences.
- the system 102 is configured to scan ASCII numeric variables having values between “48” (which encodes 0) and 57 (which encodes 9). In such a case, the system 102 is not required to scan the text if the text includes the character “0”, “1”, “2”, and the like.
- the system 102 is configured to scan the ASCII numeric values between “48” and “57” instead of “9”.
- the system 102 is configured to scan the ASCII alphabetical characters and other encodings such as Unicode and others to further scan the regex state machine of the system 102.
- the complexity of the regex state machine may be measured by 0(m*n) where n is the size of the text, which is reduced by the system 102, such as by reducing the value of n.
- the identification of the shared sequences is used to optimize the scanning of the text data to reduce the utilization of the resources to search for matching patterns in the input data 108 as well as the overall algorithmic complexity of the regex search.
- the regex state machine is implemented as a Boltzmann machine.
- the states are related by way of probabilities of occurrence.
- the Boltzmann machine corresponds to a neural network that uses probabilities to model the interactions between the nodes of the neural networks.
- the regex state machine that is implemented as the Boltzmann machine is used to parse the text characters of the one or more sequences of the shared characters to identify the matching occurrences of text in the input data 108 by scanning the input data 108.
- the one or more sequences of shared characters are implemented as number sequences. For example, the scanning of consecutive matches of the same characters.
- the implementation of the number sequences as the one or more sequences of the shared characters is advantageous for grouping the reduced number of states in the state machine due to shared regex sequences between an individual regex and skipping the substantial portions of the text to apply the regex matching for the text from a specific area. As a result, the overall algorithmic cost of scanning for the one or more sequences is reduced, which improves the overall processing time of the system 102.
- the system 102 is configured to group the one or more sequences into groups. Moreover, sequences of a given group share a sequence of shared characters. In addition, the sequence of shared characters is used for implementing the scanning of characters present in the input data 108. In other words, the groups of the one or more sequences are used to scan the input data 108 more efficiently and accurately with reduced processing time as the grouping of the one or more sequences eliminates the requirement of scanning the input data 108 again and again for the one or more sequences. In accordance with an embodiment, the system 102 is configured to build the groups to be able to identify more than one regex present in the input data 108.
- the 4- digit string is a shared regex that allows the identification of more than one regex, such as the regex of 7- digit string, a regex of 4-digit string.
- the system 102 is configured to process and identify the matching patterns within a large amount of the input data 108 efficiently and accurately in order to perform the required actions, such as retrieving the PII from a document and the like.
- the method 400 includes, scanning characters included in the input data 108 in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data 108.
- the system 102 allows, such as through the computing hardware 104 to scan and identify the one or more sequences of the shared characters associated with a state of the state machine with reduced memory utilization and reduced processing time.
- the method 400 includes, in an event that at least one match occurrence is found in scanning and identifying the one or more matching occurrences of the one or more sequences, executing a regex search in the input data among characters that are local to the at least one match occurrence.
- the system 102 is configured to scan the regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine and to determine lengths X of the one or more sequences. Thereafter, if at least one match occurrence is found, then, in that case, the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the at least one match occurrence. For example, if a one-match occurrence is identified during the scanning of the characters, then, the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the corresponding match occurrence.
- the regex search is limited to one or more regions of the input data 108 adjacent to where the match of expressions of states are identified.
- the system 102 is configured to eliminate the requirement of performing the regex search in the whole document and allows the regex search in the input data 108 among the characters that are local to the at least one match occurrence to reduce the resource utilization with an improved processing time.
- the method 400 includes, generating the output data indicative of where the regex search in executing a regex search on the input data among characters that identifies a match of expressions of states of the regex state machine to the input data 108.
- the system 102 is allowed to locate and analyze the specific areas of the input data 108, such as the matching patterns that can be further used for data mining, natural language processing, and the like.
- the method 400 is used to provide an efficient and optimized data processing for enhanced regex matching that reduces the 1 computing time or computing resources to match groups of regular expressions.
- the regular expressions are used to search and identify special patterns in a document, such as email addresses, phone numbers, uniform resource locators (URLs), and the like.
- storing the input data 108 in data memory to scan the regex for the identification of one or more sequences of the shared characters is used to determine one or more matches quickly and accurately with reduced processing time.
- the method 400 is used to detect multiple regular expressions, such as personally identifiable information (PII) detection, malware detection and the like that includes strictly specified characteristics in an efficient and optimized manner with reduced processing time and reduced utilization of resources.
- PII personally identifiable information
- the method 400 significantly improves the regular expression performance and grouping, thereby improving the functioning of the system 102 itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional methods.
- PII personally identifiable information
- steps 402 and 410 are only illustrative, and other alternatives may optionally also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
- a software product comprising computer-executable instructions that when executed by a computing hardware cause the computing hardware to execute the method 400.
- the instructions are implemented on the computer-executable instructions which include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read-Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer- readable storage medium, and/or CPU cache memory.
- the instructions are generated by a computer program, which is implemented in view of the method 400, and for use in implementing the method 400 on the computing hardware 104.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A system includes computing hardware that is configured to process input data to generate corresponding output data. The system is configured to store the input data in data memory to scan a regex state machine of the system to identify one or more sequences of shared characters associated with states of the state machine and determine lengths X of the one or more sequences. Thereafter, scan characters included in the input data to identify one or more matching occurrences of the one or more sequences of shared characters. In an event that at least one match occurrence is found in scanning execute a regex search in the input data among characters that are local to the at least one match occurrence then generates the output data indicative of where the regex search identifies a match of expressions of states of the regex state machine to the input data.
Description
SYSTEM AND METHOD FOR DATA PROCESSING FOR ENHANCED REGEX MATCHING
TECHNICAL FIELD
The present disclosure relates generally to data processing systems and computing resources management; more specifically, the present disclosure relates to a system and a method for data processing for enhanced regex matching; for example, the present disclosure relates to reducing the computing time or computing resources for matching groups of regular expressions (i.e., commonly referred to as “regex”).
BACKGROUND
Generally, regular expressions (referred to as “regex”) are used to locate and group information from documents, such as by creating a state machine and passing a text of the documents to the state machine until one or more matches are found. For example, an identification of personally identifiable information (PII) that includes a fixed template, such as a series of digits for a mobile number, an Israeli ID, for example, with nine digits only and the like from the documents. However, it is observed that the scanning of the documents to search the regex for grouping the regex outperforms searching for each regex separately because an aggregated state machine includes fewer states and transitions than the sum of states and transitions of each regex state machine. Currently, using conventional systems and algorithms for regex matching can unnecessarily increase computing time or computing resources. For example, in certain scenarios, generic algorithms for regex matching can also be an overkill for certain types of regular expressions, especially if there exist some characteristics, which can reduce the size of the state machine that is required to perform the matching. For example, regular expressions that consist of simple patterns, such as matching a single character or a specific substring, can often be handled more efficiently by specialized algorithms that are tailored to those specific patterns. Thus, the problem with “blindly” grouping regular expressions is that some groupings outperform others by orders of magnitude. Additionally, this type of blind grouping does not allow for additional optimizations to data processing (e.g., text processing). Thus, there exists a technical problem of how to configure data processing systems to increase the performance
of regex matching search or perform grouping of the regular expression without adding any additional load on computing time and resources of the data processing systems.
Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with configuring data processing systems using conventional techniques for searching the regular expressions and grouping the regular expressions.
SUMMARY
The present disclosure provides a system and a method of using the system including computing hardware that is configured to process input data to generate corresponding output data. The present disclosure provides a solution to the existing problem of how to increase the performance of regex matching search or perform grouping of the regular expression without adding any additional load on computing time and resources. An aim of the present disclosure is to provide a solution that overcomes, at least partially, the technical problem encountered in the prior art and provides an improved system and method of data processing for enhanced regex matching that reduces the computing time or computing resources to match groups of regular expressions. The disclosed method and the system significantly improve the regular expression performance and grouping, thereby improving the functioning of the computing device itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional systems and methods.
One or more objectives of the present disclosure are achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.
In one aspect, the present disclosure provides a system including computing hardware that is configured to generate corresponding output data. Moreover, the system is configured to store the input data in the data memory of the system, to scan a regex state machine of the system to identify one or more sequences of shared characters associated with states of the state machine, and to determine lengths X of the one or more sequences. Furthermore, to scan characters
included in the input data in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data, in an event that at least one match occurrence is found, to execute a regex search in the input data among characters that are local to the at least one match occurrence and to generate the output data indicative of where the regex search identifies a match of expressions of states of the regex state machine to the input data.
Optionally, by “local” is meant within one hundred (100) characters from where a match occurs, more optionally within twenty (20) characters from where the match occurs, and yet more optionally within ten (10) characters from where the match occurs.
Optionally, “local” may be computed by combining a length of a common string that is to be searched (LI), a longest prefix in the regex group (L2), and a length of a common string that is to be searched (L3), and a longest suffix (L4). For example:
(a|bc)[0-9]{8) , namely ‘a’ or ‘be’ followed by 9 digits, wherein the common string is 9 digits long, so a search is performed for a digit every 9 characters, wherein the longest prefix is 2 (‘ab’); when there is found a digit, there is a need to go back 11 characters (to match the case that the digit that was found was the last), and search forward 9 characters (to match the case that the digit that was found was the first).
The system provides an efficient and an optimized data processing for enhanced regex matching that reduces the computing time or the computing resources to match groups of regular expressions. The system is configured to automatically detect character sequences in a state machine describing a regex and further allows skipping of the majority of the text instead of scanning the whole text character by character. Thereafter, the system is configured to apply the regex match on a smaller subset of the text. The system is configured to provide encoding characteristics to perform “shorthand” searches for character sequences and to identify the same character type sequences to reduce the size of the searched text. For example, the regular expressions are used to search and identify special patterns in a document, such as email addresses, phone numbers, uniform resource locators (URLs), and the like. Moreover, storing the input data in data memory to scan the regex for the identification of one or more sequences of the shared characters is used to determine one or more matches quickly and accurately with
reduced processing time. In addition, the system may be used to detect multiple regular expressions, such as personally identifiable information (PII) detection, malware detection and the like that includes strictly specified characteristics in an efficient and optimized manner with reduced processing time and reduced utilization of resources. Furthermore, the disclosed system significantly improves the regular expression performance and grouping, thereby improving the functioning of the computing hardware itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional systems.
In an implementation form, the regex search is limited to one or more regions of the input data adjacent to where the match of expressions of states are identified.
Advantageously, the system is configured to eliminate the requirement of performing the regex search in the whole document and allows the regex search in the input data among the characters that are local to the at least one match occurrence to reduce the resource utilization with an improved processing time.
In a further implementation form, the one or more sequences of shared characters are implemented as number sequences.
In such implementation, implementation of the one or more sequences of the shared characters is implemented as the number of sequences that improve the overall algorithmic cost of scanning for the one or more sequences is reduced, which improves the overall processing time of the system.
In a further implementation, the system is configured to build the groups to be able to identify more than one regex present in the input data.
Advantageously, the system is configured to process and identify the matching patterns within a large amount of the input data efficiently and accurately in order to perform the required actions, such as retrieving the PII from a document and the like.
In another aspect, the present disclosure provides a method for (namely, a method of) using a system including computing hardware that is configured to process input data to generate corresponding output data. The method includes storing the input data in the data memory of the system, scanning a regex state machine of the system to identify one or more sequences of shared characters associated with states of the state machine, and determining lengths X of the one or more sequences, scanning characters included in the input data in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data. Moreover, in an event that at least one match occurrence is found, executing a regex search in the input data among characters that are local to the at least one match occurrence and generating the output data indicative of where the regex search identifies a match of expressions of states of the regex state machine to the input data.
Optionally by “local” is meant within one hundred (100) characters from where a match occurs, more optionally within twenty (20) characters from where the match occurs, and yet more optionally within ten (10) characters from where the match occurs.
Optionally, “local” may be computed by combining a length of a common string that is to be searched (LI), a longest prefix in the regex group (L2), and a length of a common string that is to be searched (L3), and a longest suffix (L4). For example:
(a|bc)[0-9]{8) , namely ‘a’ or ‘be’ followed by 9 digits, wherein the common string is 9 digits long, so a search is performed for a digit every 9 characters, wherein the longest prefix is 2 (‘ab’); when there is found a digit, there is a need to go back 11 characters (to match the case that the digit that was found was the last), and search forward 9 characters (to match the case that the digit that was found was the first).
The method achieves all the advantages and technical effects of the system of the present disclosure.
It is to be appreciated that all the aforementioned implementation forms may be optionally combined.
It is be noted that all devices, elements, circuitry, units, and means described in the present disclosure may be optionally implemented in the software or hardware elements or any kind of
combination thereof. All steps which are performed by the various entities described in the present disclosure, as well as the functionalities described to be performed by the various entities, are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity that performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities may be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
Additional aspects, advantages, features, and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.
BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
FIG. 1 is an illustration of a system including computing hardware that is configured to process input data to generate corresponding output data, in accordance with an embodiment of the present disclosure;
FIG. 2 is an exemplary illustration of scanning a text of a document, in accordance with an embodiment of the present disclosure;
FIG. 3 is another exemplary illustration of a scanning of a text of a document, in accordance with an embodiment of the present disclosure; and
FIG. 4 is a flowchart of a method for using a system including a computing hardware that is configured to process input data to generate corresponding output data.
In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION OF EMBODIMENTS
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practising the present disclosure are also possible.
FIG. 1 is an illustration of a system including computing hardware that is configured to process input data to generate corresponding output data, in accordance with an embodiment of the present disclosure. In FIG. 1, as indicated generally by 100, a system 102 includes computing hardware 104 that is coupled to a data memory 106, wherein the system 102 is configured in use to receive input data 108.
The system 102 including the computing hardware 104 is configured to process the input data 108 to generate corresponding output data. In other words, the computing hardware 104 of the system 102 is configured to process the input data 108, to generate the corresponding output data with improved efficiency, accuracy, and reduced processing time, such as by providing an improved regular expression performance and grouping.
In accordance with an embodiment, the input data 108 includes at least one of text data, image data, and video data. In an implementation, the input data 108 includes the text data. In another implementation, the input data 108 includes the image data. The input data 108 is used to search
and group one or more sequences from the input data 108. For example, personally identifiable information (PII) is identified in the input data 108, such as the text data; such text data includes at least one of names, phone numbers or email ids (namely, email identifications) that are included in a document. Thus, the input data 108 enables the computing hardware 104 of the system 102 to process the input data 108, such as the text data or the image data or the video data to generate the corresponding output data.
In operation, the system 102 is configured to store the input data 108 in the data memory 106 of the system 102. For example, the system 102 is configured to store the text data in the data memory 106 of the system 102. In another example, the system 102 is configured to store the image data in the data memory 106 of the system 102. Thereafter, the input data 108 stored in the data memory 106 is processed to generate the output data. Furthermore, the system 102 is configured to scan a regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine and to determine lengths X of the one or more sequences. In other words, the system 102 is configured to analyze the state machine and scan for the one or more sequences of shared characters that are associated with states in the machine. The system 102 is configured to determine the length of the one or more sequences in order to scan the input data efficiently and accurately. As a result, the identification of the shared sequences is used to optimize the scanning of the text data to reduce the utilization of the resources to search for matching patterns in the input data. Moreover, in the regex state machine, the states are related by way of probabilities of occurrence. For example, the regex state machine is used to parse the text characters of the one or more sequences of the shared characters to identify the matching occurrences of text in the input data 108 by scanning the input data 108. As a result, the overall efficiency, and the accuracy of the system 102 to identify the matching patterns in the document are improved. In accordance with an embodiment, the one or more sequences of shared characters are implemented as number sequences. The implementation of the number sequences as the one or more sequences of the shared characters is advantageous for grouping the number of states in the regex state machine that further provides an efficient data processing of the regex with the shared sequences to provide an optimized pattern matching that improves the overall efficiency and processing time of the system 102. In accordance with an embodiment, the system 102 is configured to group the one or more sequences into groups. Moreover, sequences of a given group share a sequence of shared characters. In addition, the sequence of shared characters is used for implementing the
scanning of characters present in the input data 108. In other words, the groups of the one or more sequences are used to scan the input data 108 more efficiently and accurately with reduced processing time as the grouping of the one or more sequences eliminates the requirement of scanning the input data 108 again and again for the one or more sequences. In accordance with an embodiment, the system 102 is configured to build the groups to be able to identify more than one regex present in the input data 108. For example, if a 7-digit string or 2-digit characters are followed by a 4-digit string, then, in that case, the 4- digit string is a shared regex that allows the identification of more than one regex, such as a 7- digit string or a 4-digit string. In an implementation, the system 102 is configured to build the groups of more or less equal size as shown in the given below table:-
However, the overlapping of the one or more groups is required to ensure an accurate and efficient identification of the one or more regex present in the document, such as the identification of the regex that resides between the data segments. For example, an expression is split between two data segments, such as an expression
“abl2345678bb” is further shared between two data segments. In such a case, the system 102 is configured to perform overlapping of the data segments due to which the scanning of the multiple data segments can be performed. As a result, the system 102 is configured to process and identify the matching patterns within a large amount of the input data 108 efficiently and accurately in order to perform the required actions, such as retrieving the PII from the document and the like.
Furthermore, the system 102 is further configured to scan characters included in the input data 108 in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data 108. In other words, the system 102 is configured to scan every character that is included in the document to determine the lengths X of the one or more sequences. In addition, the identification of one or more sequences before performing the regex search allows the system 102 to eliminate the major portion of the text of the document that does not include, such one or more sequences that include the characters of
the determined lengths X. As a result, the system 102 allows the computing hardware 104 to scan and identify the one or more sequences of the shared characters associated with a state of the state machine with reduced memory utilization and reduced processing time. Furthermore, in an event that at least one match occurrence is found in scanning characters, there is executed a regex search in the input data 108 among characters that are local to the at least one match occurrence. Firstly, the system 102 is configured to scan the regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine and to determine lengths X of the one or more sequences. Thereafter, if at least one match occurrence is found, then, in that case, the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the at least one match occurrence. For example, if a one-match occurrence is identified during the scanning of the characters, then, the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the corresponding match occurrence. In accordance with an embodiment, the regex search is limited to one or more regions of the input data 108 adjacent to where the match of expressions of states are identified. As a result, the system 102 is configured to eliminate the requirement of performing the regex search in the whole document and allows the regex search in the input data 108 among the characters that are local to the at least one match occurrence to reduce the resource utilization with an improved processing time.
Furthermore, the system 102 is further configured to generate the output data indicative of where the regex search in an event that at least one match occurrence is found and identifies a match of expressions of states of the regex state machine to the input data 108. After scanning the input data 108, the regex search is performed. Thereafter, the system 102 is configured to identify the match of the expressions of the states in the regex state machine to the input data 108. For example, the identification of the personally identifiable information (PII), such as an Israeli ID, a phone number, and the like. As a result, the system 102 is allowed to locate and analyze the specific areas of the input data 108, such as the matching patterns that may be further used for data mining, natural language processing, and the like.
The system 102 provides an efficient and an optimized data processing for enhanced regex matching that reduces the computing time or computing resources to match groups of regular expressions. The system 102 is configured to automatically detect character sequences in a state machine describing a regex and further allows skipping of the majority of the text. Thereafter,
the system 102 is configured to apply the regex match on a smaller subset of the text. The system 102 is configured to provide encoding characteristics to perform “shorthand” searches for character sequences and to identify the same character type sequences to reduce the size of the searched text. For example, the regular expressions are used to search and identify special patterns in a document, such as email addresses, phone numbers, uniform resource locators (URLs), and the like. Moreover, storing the input data 108 in data memory to scan the regex for the identification of one or more sequences of the shared characters is used to determine one or more matches quickly and accurately with reduced processing time. In addition, the system 102 provided may be used to detect multiple regular expressions, such as personally identifiable information (PII) detection, malware detection and the like that includes strictly specified characteristics in an efficient and optimized manner with reduced processing time and reduced utilization of resources. Furthermore, the disclosed system (i.e., the system 102) significantly improves the regular expression performance and grouping, thereby improving the functioning of the computing hardware 104 itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional systems.
FIG. 2 is an exemplary illustration of scanning a text of a document, in accordance with an embodiment of the present disclosure. FIG. 2 is described in conjunction with elements from FIG.1. With reference to FIG. 2, there is shown an illustration 200 of scanning the text of the document that includes a series of operations from 202 to 216. The computing hardware 104 of the system 102 is configured to execute the series of operations from 202 to 216, such as by scanning the text of the document.
In an implementation, the computing hardware 104 is configured to scan for a 7-digit sequence, such as from the operation 202 instead of scanning the document character by character, such as by parsing through each of the states of the regex state machine to identify the 7-digit sequence. However, in order to avoid an inefficient grouping, the system 102 is configured to limit the size of the regex state machine. Furthermore, the system 102 is configured to perform optimized scanning of the input data 108. In addition, breaking the regex into groups of more
or less equal size allows the system 102 to perform parallel programming (i.e., multi-threading or multi-processing) that improves the overall performance of the system 102.
It will be appreciated that the regex state machine is applied only when a character is found, so the document is scanned not for every character, but in steps. Moreover, the state machine is used in conjunction with searching the document, only when such a character is found, and then the characters are applied to the state machine one by one.
For example, the computing hardware 104 is configured to scan a first digit, such as at the operation 204 followed by the scanning of a second digit, such as at the operation 206, followed by the scanning of a third digit, such as at the operation 208, followed by the scanning of a fourth digit, such as at the operation 210, followed by the scanning of a fifth digit, such as at the operation 212, followed by the scanning of a sixth digit, such as at the operation 214, and finally the scanning of a seventh digit, such as at the operation 216. Thereafter, the computing hardware 104 of the system 102 is configured to apply the regex matching. Moreover, the 7- digit sequence is detected by scanning for digits (e.g., digits “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, or “9”) in the regex state machine to provide an optimized scanning of the text of the document with reduced processing time and resource utilization.
FIG. 3 is another exemplary illustration of scanning a text of a document, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIG.1 and FIG. 2. With reference to FIG. 3, there is shown an illustration 300 of a scanning of the text of the document. With reference to FIG. 3, there is shown an illustration 300 of scanning the text of the document that includes a series of operations from 302 to 320. The computing hardware 104 of the system 102 is configured to execute the series of operations from 302 to 320, such as by scanning the text of the document.
In an implementation, the computing hardware 104 is configured to scan the one or more sequences, such as at the operation 302. The scanning of the one or more sequences includes a first sequence that includes the scanning of the two characters, such as from “a”, “b”, “c”, “d”, and “e” followed by 4-digit numbers, such as from digits “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, or “9”. Furthermore, the scanning of the one or more sequences includes scanning for a second sequence includes the scanning of the 7-digit number such as “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, or “9”. In an example, the scanning of the first sequence includes the
scanning of a first character, such as at the operation 304, followed by the scanning of a second character, such as at the operation 306, followed by the scanning of a first digit, such as at the operation 314, followed by the scanning of a second digit, such as at the operation 316, followed by the scanning of a third digit, such as at the operation 318, and finally scanning of the fourth digit, such as at the operation 320. In another example, the scanning of the second sequence includes scanning of a first digit, such as at the operation 308, followed by the scanning of a second digit, such as at the operation 310, followed by the scanning of a third digit, such as at the operation 312, followed by the scanning of a fourth digit, such as at the operation 314, followed by the scanning of a fifth digit, such as at the operation 316, followed by the scanning of a sixth digit, such as at the operation 318 and finally scanning of the seventh digit, such as at the operation 320. Moreover, the computing hardware 104 is configured to perform regex matching after the identification of the first sequence or the second sequence. In addition, the traversal of states, such as “2”, “3”, “5”, “6”, and “7”, as shown in FIG. 3 are shared between both the sequences, such as the first sequence and the second sequence, which are used to reduce the size of the search text. However, an uncapitalized alphabetical characters for example are not useful since they consist of the majority of many textual documents and for human-readable text, digit sequences or sequences of capitalized characters are usually useful for reducing the text size. Therefore, in order to find the required one or more sequences, the state machine registers all the shared paths ordered by their length during the creation of the state machine. Furthermore, the required sequence is used either by preprocessing of the text, the context of the document (e.g., document in a folder containing human-readable text), or by document metadata (e.g., MS Word document) that contains information about the number of characters and their type. As a result, the scanning of the text before applying the regex matching allows the system 102 to provide optimized scanning and reduced processing time, such as by encapsulating several consecutive states that are shared (i.e., “2”, “3”, “5”, and “6”) between all paths to the final state (i.e., “7”).
FIG. 4 is a flowchart of a method for use in a system including computing hardware that is configured to process input data to generate corresponding output data, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements from FIGs. 1, 2 and 3. With reference to FIG. 4, there is shown a flowchart of a method 400 for use in the system 102 that is configured to process input data (i.e., the input data 108 of FIG. 1) to generate corresponding output data. The method 400 includes steps 402 to 410.
There is provided the method 400 of using the system 102 including the computing hardware 104 that is configured to process the input data 108 to generate the corresponding output data. In an implementation, the computing hardware 104 of the system 102 is configured to process the input data 108 to generate the corresponding output data efficiently and accurately with reduced processing time, such as by providing an improved regular expression performance and grouping. In accordance with an embodiment, the input data 108 includes at least one of text data, image data, and video data. In an implementation, the input data 108 includes the text data. In another implementation, the input data 108 includes the image data. The input data 108 is used to search and group one or more sequences from the input data 108. For example, personally identifiable information (PII), such as names, phone numbers or email ids that are included in the document is identified in the input data 108, such as the text data.
At the step 402, the method 400 includes, storing the input data 108 in the data memory 106 of the system 102. For example, the system 102 is configured to store the text data in the data memory 106 of the system 102. In another example, the system 102 is configured to store the image data in the data memory 106 of the system 102. Therefore, the input data 108 is stored in the data memory 106 that is processed to generate the output data. At the step 404, the method 400 includes, scanning a regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine, and to determine lengths X of the one or more sequences. For example, the system 102 is configured to scan ASCII numeric variables having values between “48” (which encodes 0) and 57 (which encodes 9). In such a case, the system 102 is not required to scan the text if the text includes the character “0”, “1”, “2”, and the like. The system 102 is configured to scan the ASCII numeric values between “48” and “57” instead of “9”. Similarly, the system 102 is configured to scan the ASCII alphabetical characters and other encodings such as Unicode and others to further scan the regex state machine of the system 102. Moreover, the complexity of the regex state machine may be measured by 0(m*n) where n is the size of the text, which is reduced by the system 102, such as by reducing the value of n. As a result, the identification of the shared sequences is used to optimize the scanning of the text data to reduce the utilization of the resources to search for matching patterns in the input data 108 as well as the overall algorithmic complexity of the regex search. In accordance with an embodiment, the regex state machine is implemented as a Boltzmann machine. Moreover, the states are related by way of probabilities of occurrence. In an implementation, the Boltzmann machine corresponds to a neural network that uses
probabilities to model the interactions between the nodes of the neural networks. For example, the regex state machine that is implemented as the Boltzmann machine is used to parse the text characters of the one or more sequences of the shared characters to identify the matching occurrences of text in the input data 108 by scanning the input data 108. As a result, the overall efficiency and accuracy of the system 102 to identify the matching patterns in the documents is improved. In accordance with an embodiment, the one or more sequences of shared characters are implemented as number sequences. For example, the scanning of consecutive matches of the same characters. The implementation of the number sequences as the one or more sequences of the shared characters is advantageous for grouping the reduced number of states in the state machine due to shared regex sequences between an individual regex and skipping the substantial portions of the text to apply the regex matching for the text from a specific area. As a result, the overall algorithmic cost of scanning for the one or more sequences is reduced, which improves the overall processing time of the system 102.
In accordance with an embodiment, the system 102 is configured to group the one or more sequences into groups. Moreover, sequences of a given group share a sequence of shared characters. In addition, the sequence of shared characters is used for implementing the scanning of characters present in the input data 108. In other words, the groups of the one or more sequences are used to scan the input data 108 more efficiently and accurately with reduced processing time as the grouping of the one or more sequences eliminates the requirement of scanning the input data 108 again and again for the one or more sequences. In accordance with an embodiment, the system 102 is configured to build the groups to be able to identify more than one regex present in the input data 108. For example, if a seven 7-digit string or 2-digit characters are followed by a 4-digits string, then, in that case, the 4- digit string is a shared regex that allows the identification of more than one regex, such as the regex of 7- digit string, a regex of 4-digit string. As a result, the system 102 is configured to process and identify the matching patterns within a large amount of the input data 108 efficiently and accurately in order to perform the required actions, such as retrieving the PII from a document and the like.
At the step 406, the method 400 includes, scanning characters included in the input data 108 in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data 108. As a result, the system 102 allows, such as through the computing hardware 104 to scan and identify the one or more sequences of the
shared characters associated with a state of the state machine with reduced memory utilization and reduced processing time. Furthermore, at the step 408, the method 400 includes, in an event that at least one match occurrence is found in scanning and identifying the one or more matching occurrences of the one or more sequences, executing a regex search in the input data among characters that are local to the at least one match occurrence. Firstly, the system 102 is configured to scan the regex state machine of the system 102 to identify one or more sequences of shared characters associated with states of the state machine and to determine lengths X of the one or more sequences. Thereafter, if at least one match occurrence is found, then, in that case, the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the at least one match occurrence. For example, if a one-match occurrence is identified during the scanning of the characters, then, the system 102 is configured to execute the regex search in the input data 108 among the characters that are local to the corresponding match occurrence. In accordance with an embodiment, the regex search is limited to one or more regions of the input data 108 adjacent to where the match of expressions of states are identified. As a result, the system 102 is configured to eliminate the requirement of performing the regex search in the whole document and allows the regex search in the input data 108 among the characters that are local to the at least one match occurrence to reduce the resource utilization with an improved processing time. At the step 410, the method 400 includes, generating the output data indicative of where the regex search in executing a regex search on the input data among characters that identifies a match of expressions of states of the regex state machine to the input data 108. As a result, the system 102 is allowed to locate and analyze the specific areas of the input data 108, such as the matching patterns that can be further used for data mining, natural language processing, and the like.
The method 400 is used to provide an efficient and optimized data processing for enhanced regex matching that reduces the 1 computing time or computing resources to match groups of regular expressions. For example, the regular expressions are used to search and identify special patterns in a document, such as email addresses, phone numbers, uniform resource locators (URLs), and the like. Moreover, storing the input data 108 in data memory to scan the regex for the identification of one or more sequences of the shared characters is used to determine one or more matches quickly and accurately with reduced processing time. In addition, the method 400 is used to detect multiple regular expressions, such as personally identifiable information (PII) detection, malware detection and the like that includes strictly specified characteristics in
an efficient and optimized manner with reduced processing time and reduced utilization of resources. Furthermore, the method 400 significantly improves the regular expression performance and grouping, thereby improving the functioning of the system 102 itself in terms of reducing scan complexity, to locate information in one or more documents, for example, to locate personally identifiable information (PII), while at the same time also reducing the computing time or computing resources to match groups of regular expressions by manyfold as compared to conventional methods.
The steps 402 and 410 are only illustrative, and other alternatives may optionally also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
There is further provided a software product comprising computer-executable instructions that when executed by a computing hardware cause the computing hardware to execute the method 400. In an example, the instructions are implemented on the computer-executable instructions which include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read-Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer- readable storage medium, and/or CPU cache memory. In an example, the instructions are generated by a computer program, which is implemented in view of the method 400, and for use in implementing the method 400 on the computing hardware 104.
Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe, and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also
be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.
Claims
1. A system (102) including computing hardware (104) that is configured to process input data (108) to generate corresponding output data, characterized in that the system (102) is configured:
(i) to store the input data (108) in data memory (106) of the system (102);
(ii) to scan a regex state machine of the system (102) to identify one or more sequences of shared characters associated with states of the state machine, and to determine lengths X of the one or more sequences;
(iii) to scan characters included in the input data (108) in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data (108);
(iv) in an event that at least one match occurrence is found in (iii), to execute a regex search in the input data (108) among characters that are local to the at least one match occurrence; and
(v) to generate the output data indicative of where the regex search in (iv) identifies a match of expressions of states of the regex state machine to the input data (108).
2. A system (102) of claim 1, wherein the regex search is limited to one or more regions of the input data (108) adj acent to where the match of expressions of states are identified.
3. A system (102) of claim 1 or 2, wherein the one or more sequences of shared characters are implemented as number sequences.
4. A system (102) of any one of the preceding claims, wherein the system (102) is configured to group the one or more sequences into groups, wherein sequences of a given group share a sequence of shared characters, and wherein the sequence of shared characters is used for implementing scanning of characters present in the input data (108).
5. A system (102) of claim 4, wherein the system (102) is configured to build the groups to be able to identify more than one regex present in the input data (108).
6. A system (102) of any one of the preceding claims, wherein the input data (108) includes at least one of: text data, image data, video data.
7. A method (400) for using a system (102) including computing hardware (104) that is configured to process input data (108) to generate corresponding output data, characterized in that the method (400) includes:
(i) storing the input data (108) in data memory (106) of the system (102);
(ii) scanning a regex state machine of the system (102) to identify one or more sequences of shared characters associated with states of the state machine, and to determine lengths X of the one or more sequences;
(iii) scanning characters included in the input data (108) in steps of X to identify one or more matching occurrences of the one or more sequences of shared characters to characters included in the input data (108);
(iv) in an event that at least one match occurrence is found in (iii), executing a regex search in the input data (108) among characters that are local to the at least one match occurrence; and
(v) generating the output data indicative of where the regex search in (iv) identifies a match of expressions of states of the regex state machine to the input data (108).
8. A method (400) of claim 7, wherein the method (400) includes limiting the regex search to one or more regions of the input data (108) adjacent to where the match of expressions of states are identified.
9. A method (400) of claim 7 or 8, wherein the method (400) includes limiting the one or more sequences of shared characters as number sequences.
10. A method (400) of any one of claims 7 to 9, wherein the method (400) includes updating at least one of the regex state machine as a function of the input data (108).
11. A method (400) of any one of claims 7 to 10, wherein the method (400) includes configuring the system (102) to group the one or more sequences into groups, wherein sequences of a given group share a sequence of shared characters, and wherein the
sequence of shared characters is used for implementing scanning of characters present in the input data (108).
12. A method (400) of claim 11, wherein the method (400) includes configuring the system to build the groups to be able to identify more than one regex present in the input data (108).
13. A method (400) of any one of claims 7 to 13, wherein the input data (108) includes at least one of: text data, image data, video data.
14. A software product including computer-executable instructions, wherein the computerexecutable instructions are executable on computing hardware (104) to implement a method (400) of any one of claims 7 to 13.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2023/062945 WO2024235443A1 (en) | 2023-05-15 | 2023-05-15 | System and method for data processing for enhanced regex matching |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2023/062945 WO2024235443A1 (en) | 2023-05-15 | 2023-05-15 | System and method for data processing for enhanced regex matching |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024235443A1 true WO2024235443A1 (en) | 2024-11-21 |
Family
ID=86605733
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2023/062945 Pending WO2024235443A1 (en) | 2023-05-15 | 2023-05-15 | System and method for data processing for enhanced regex matching |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024235443A1 (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2390797A1 (en) * | 2010-05-25 | 2011-11-30 | Huawei Technologies Co., Ltd. | Regular expression matching method and system |
| WO2020013881A1 (en) * | 2018-07-10 | 2020-01-16 | Didi Research America, Llc | Expression recognition using character skipping |
-
2023
- 2023-05-15 WO PCT/EP2023/062945 patent/WO2024235443A1/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2390797A1 (en) * | 2010-05-25 | 2011-11-30 | Huawei Technologies Co., Ltd. | Regular expression matching method and system |
| WO2020013881A1 (en) * | 2018-07-10 | 2020-01-16 | Didi Research America, Llc | Expression recognition using character skipping |
Non-Patent Citations (1)
| Title |
|---|
| FANG YU ET AL: "Fast and memory-efficient regular expression matching for deep packet inspection", ARCHITECTURE FOR NETWORKING AND COMMUNICATIONS SYSTEMS, PROCEEDINGS OF THE 2006 ACM/IEEE SYMPOSIUM, IEEE, PISCATAWAY, NJ, USA, 3 December 2006 (2006-12-03), pages 93 - 102, XP058134419, ISBN: 978-1-59593-580-9, DOI: 10.1145/1185347.1185360 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9245007B2 (en) | Dynamically detecting near-duplicate documents | |
| US10169425B2 (en) | Fast identification of complex strings in a data stream | |
| US8849841B2 (en) | Memory circuit for Aho-corasick type character recognition automaton and method of storing data in such a circuit | |
| CN101398820A (en) | Large scale key word matching method | |
| WO2004013777A1 (en) | System and method of parallel pattern matching | |
| CN105589894B (en) | Document index establishing method and device and document retrieval method and device | |
| CN113901474A (en) | A vulnerability detection method based on function-level code similarity | |
| CN113128213B (en) | Log template extraction method and device | |
| US8548979B2 (en) | Indexing for regular expressions in text-centric applications | |
| US12417311B2 (en) | Detection of personally identifiable information | |
| CN116149669A (en) | Method, device and medium for software component analysis based on binary files | |
| JP2007034777A (en) | Data retrieval device and method, and computer program | |
| CN113868485B (en) | Data loading method and apparatus | |
| CN112380445B (en) | Data query method, device, equipment and storage medium | |
| CN113946365B (en) | Page recognition method, device, computer equipment and storage medium | |
| WO2024235443A1 (en) | System and method for data processing for enhanced regex matching | |
| CN116401676A (en) | Automatic detection method and device for data loopholes, electronic equipment and storage medium | |
| CN119441547A (en) | Regular expression feature extraction and matching method, device, equipment and readable medium | |
| US11025650B2 (en) | Multi-pattern policy detection system and method | |
| US10853177B2 (en) | Performant process for salvaging renderable content from digital data sources | |
| Charalampopoulos et al. | Efficient computation of sequence mappability | |
| US12474907B2 (en) | Method and system for matching source code and binary code | |
| CN116303414B (en) | Branching of tree structures in database systems | |
| US20230385381A1 (en) | Method for detecting pattern and system thereof | |
| Dinari et al. | A method for improving graph queries processing using positional inverted index (PII) idea in search engines and parallelization techniques |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23727289 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |