US20220405274A1 - Method and system for detecting sensitive data - Google Patents
Method and system for detecting sensitive data Download PDFInfo
- Publication number
- US20220405274A1 US20220405274A1 US17/350,549 US202117350549A US2022405274A1 US 20220405274 A1 US20220405274 A1 US 20220405274A1 US 202117350549 A US202117350549 A US 202117350549A US 2022405274 A1 US2022405274 A1 US 2022405274A1
- Authority
- US
- United States
- Prior art keywords
- sensitive data
- keywords
- data
- matching
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Definitions
- the term “optimize” means to improve. It is not used to convey that the technology produces the objectively “best” solution, but rather that an improved solution is produced. In the context of memory access, it typically means that the efficiency or speed of memory access may be improved.
- computer-readable program instructions can be downloaded to respective computing or processing devices from a computer-readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- a network interface in each computing/processing device may receive computer-readable program instructions via the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium within the respective computing or processing device.
- these algorithms tend to have a time complexity proportional to n+k, where n is the length of the string being searched, and k is the length of the keywords, or faster (e.g., time complexity proportional to n) when the list of keywords is known in advance, such that setup for the search (e.g., such as construction of an Aho-Corasick automaton) can be done offline, prior to the search itself.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosed systems and methods are directed to detecting sensitive data on a computing device. This includes matching predetermined keywords in input data, to determine data in vicinities of matched keywords in the input data in which sensitive data is likely to be found, and matching predefined patterns associated with sensitive data to the data in vicinities of matched keywords to detect sensitive data. Matching the predetermined keywords occurs prior to matching the predefined patterns, and the data in vicinities of matched keywords is substantially shorter than the input data.
Description
- This is the first filing related to the disclosed technology. At the time of filing, there are no related patents or applications.
- The present disclosure relates generally to the field of computer data security, and more particularly to methods and systems for detecting sensitive data in free text and/or other digital media.
- Data can be categorized into three classes: structured data, semi-structured data, and unstructured data. The term “structured data” usually refers to database data that are clearly and strictly organized, such that it is easy to identify which row/column/table is storing which type of information. “Semi-structured data” usually refers to data that has some structure, which is either not clear or not easy to identify. Examples of semi-structured data are html, email, and log data. “Unstructured data” refers to data that are organized in arbitrary ways. Free text and media data are typical examples of unstructured data.
- Data may also be categorized according to sensitivity or degree of privacy or confidentiality. For purposes of explanation, this disclosure shall generally focus upon sensitive data. Sensitive data include, without limitation, data such as: financial data, including debit and credit card details; personal data, such as names, addresses, social insurance numbers, passport numbers, and/or any other information that may identify an individual; medical information, including information on medical conditions, medical insurance claims, genetic information, and other health-related information; education information, such as grades or other indications of educational performance; company-owned information, such as trade secrets and other intellectual property; and many other types of information. In general, sensitive data includes any data that, if released publicly, could cause financial loss or legal liability to the company or other organization that holds the information. In many instances, laws and regulations may dictate penalties for release of sensitive data. Such laws and regulations are becoming increasingly common throughout the world.
- Sensitive data can include structured data, semi-structured data, and unstructured data. For structured data, it is usually straightforward to identify database rows, columns, and/or tables that may contain sensitive data, and to sanitize any sensitive data (e.g., by removing, masking, or anonymizing the data) prior to providing access. For unstructured data and some semi-structured data, however, it can be more difficult to identify the sensitive data that may be included in the unstructured or semi-structured data. For these types of data, the first step in sanitizing the data is to detect the presence and positions of the sensitive data in the unstructured or semi-structured data.
- In many instances, structured and semi-structured data including sensitive data may be stored in files, often as text. For unstructured data and some portions of semi-structured data, these files may include substantial amounts of free-form text (also referred to as “free text”). In such instances, sensitive data detection will generally involve sensitive text detection.
- At present, state-of-the-art sensitive text detection tools, such as PRESIDIO, by Microsoft corporation of Redmond, Wash., define separate “recognizers” for each type of sensitive data that is to be detected. For example, there may be a recognizer for credit card numbers, a recognizer for social insurance numbers, a recognizer for address information, etc. Such a system may include hundreds of recognizers for various types of sensitive information.
- Unfortunately, these recognizers are typically invoked one-by-one (though parallelization may be possible), making the time complexity of such systems proportional to the number of recognizers. Thus, a system detecting 50 different types of sensitive data may be ten times slower than a system detecting only five different types of sensitive data. This may be particularly problematic when hundreds of different types of sensitive data are being detected.
- To address the problems discussed above, the present disclosure applies an efficient text searching algorithm, such as the Aho-Corasick algorithm, to searching for keywords that may indicate the presence of sensitive data in free text, and provides an arrangement of a detection pipeline that provides substantial efficiency gains in automated sensitive data detection. Using the disclosed technology, it is anticipated that the detection time complexity will be independent of the number of types of sensitive data to be detected. This may substantially improve the speed, scalability, and efficiency of systems that process large amounts of text for detecting sensitive data, improving the functioning of computing devices that perform such detection.
- It will be understood that although the disclosed technology is described as applied to detection of sensitive data inside free text, other uses are also possible. Similar methods and systems could be used, for example to detect sensitive data inside of images, audio, video, and other types of digital media.
- In accordance with one aspect of the present disclosure, the technology is implemented in an apparatus including a processor, a memory coupled to the processor, and a sensitive data detector in the memory and executed by the processor. The sensitive data detector includes: a keyword matcher that matches predetermined keywords in input text, and determines text in vicinities of matched keywords in the input text in which sensitive data is likely to be found; and a pattern matcher that matches predefined patterns associated with sensitive data to the text in vicinities of matched keywords to detect sensitive data. The keyword matcher executes prior to the pattern matcher, and the text in vicinities of matched keywords is substantially shorter in length than the input text.
- In some implementations, the sensitive data detector further includes a validator that validates the sensitive data detected by the pattern matcher. In some of these implementations, the validator uses a validation function specific to a detected type of sensitive data to validate the sensitive data detected by the pattern matcher. In some implementations, the validation function comprises includes a checksum.
- In some implementations the keyword matcher has a time complexity that does not depend on how many predetermined keywords are to be matched. In some of these implementations, the keyword matcher uses a pre-constructed Aho-Corasick automaton configured to match the predetermined keywords in a single pass over the input text.
- In some implementations, the predefined patterns include regular expressions. In some implementations, the pattern matcher includes a regular expression matching algorithm.
- In some implementations, at least one of the keyword matcher or the pattern matcher comprises at least one of pre-processing or post-processing.
- In accordance with another aspect of the present disclosure, a method of detecting sensitive data on a computing device is provided. The method includes: matching, on the computing device, predetermined keywords in input data, to determine data in vicinities of matched keywords in the input data in which sensitive data is likely to be found; and matching, on the computing device, predefined patterns associated with sensitive data to the data in vicinities of matched keywords to detect sensitive data. Matching the predetermined keywords occurs prior to matching the predefined patterns, and the data in vicinities of matched keywords is substantially shorter than the input data.
- In some implementations, the method further includes validating the detected sensitive data. In some of these implementations, validating the detected sensitive data includes applying a validation function specific to a detected type of sensitive data to the detected sensitive data. In some implementations, applying the validation function includes calculating a checksum.
- In some implementations, matching the predetermined keywords has a time complexity that does not depend on how many predetermined keywords are to be matched. In some implementations, matching the predetermined keywords includes using a pre-constructed Aho-Corasick automaton configured to match the predetermined keywords in a single pass over the input data. In some implementations, the method further includes assembling a list of keywords that frequently co-occur with sensitive data for sensitive data types that are to be detected, and constructing a single Aho-Corasick automaton that detects the keywords in the list of keywords. In some implementations, assembling the list of keywords comprises assembling the list of keywords using keywords from a plurality of virtual detectors, each virtual detector in the plurality of virtual detectors including at least one keyword associated with at least one sensitive data type.
- In some implementations, matching the predefined patterns includes matching regular expressions. In some implementations at least one of matching the predetermined keywords or matching the predefined patterns includes applying at least one of pre-processing or post-processing. In some implementations, the input data includes at least one of text, video, audio, or images.
- The features and advantages of the present disclosure will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
-
FIG. 1 is a block diagram of a computer system including an implementation of a sensitive data detector in accordance with the disclosed technology; -
FIG. 2 shows a block diagram of a conventional (prior art) system for detecting sensitive data; -
FIG. 3 shows a block diagram of a sensitive data detector in accordance with various implementations of the disclosed technology; -
FIG. 4 shows an example application of a sensitive data detector in accordance with various implementations of the disclosed technology; -
FIG. 5 shows a block diagram of a setup procedure for a sensitive data detector in accordance with various implementations of the disclosed technology; and -
FIG. 6 shown a flowchart of a method for detecting sensitive data in accordance with various implementations of the disclosed technology. - It is to be understood that throughout the appended drawings and corresponding descriptions, like features are identified by like reference characters. Furthermore, it is also to be understood that the drawings and ensuing descriptions are intended for illustrative purposes only and that such disclosures are not intended to limit the scope of the claims.
- Various representative embodiments of the disclosed technology will be described more fully hereinafter with reference to the accompanying drawings. The present technology may, however, be embodied in many different forms and should not be construed as limited to the representative embodiments set forth herein. In the drawings, the sizes and relative sizes of layers and regions may be exaggerated for clarity. Like numerals refer to like elements throughout.
- It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first element discussed below could be termed a second element without departing from the teachings of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
- It will be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. By contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present. Other words used to describe the relationship between elements should be interpreted in a like fashion (e.g., “between” versus “directly between,” “adjacent” versus “directly adjacent,” etc.). Additionally, it will be understood that elements may be “coupled” or “connected” mechanically, electrically, communicatively, wirelessly, optically, and so on, depending on the type and nature of the elements that are being coupled or connected.
- The terminology used herein is only intended to describe particular representative embodiments and is not intended to be limiting of the present technology. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
- The functions of the various elements shown in the figures, including any functional block labeled as a “processor,” may be provided through the use of dedicated hardware as well as hardware capable of executing instructions, in association with appropriate software instructions. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. In some implementations of the present technology, the processor may be a general-purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a digital signal processor (DSP). Moreover, explicit use of the term a “processor” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a read-only memory (ROM) for storing software, a random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.
- Software modules, or simply modules or units which are implied to be software, may be represented herein as any combination of flowchart elements or other elements indicating the performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown. Moreover, it should be understood that a module may include, for example, but without limitation, computer program logic, computer program instructions, software, stack, firmware, hardware circuitry, or a combination thereof, which provides the required capabilities. It will further be understood that a “module” generally defines a logical grouping or organization of related software code or other elements as discussed above, associated with a defined function. Thus, one of ordinary skill in the relevant arts will understand that particular code or elements that are described as being part of a “module” may be placed in other modules in some implementations, depending on the logical organization of the software code or other elements, and that such modifications are within the scope of the disclosure as defined by the claims.
- It should also be noted that as used herein, the term “optimize” means to improve. It is not used to convey that the technology produces the objectively “best” solution, but rather that an improved solution is produced. In the context of memory access, it typically means that the efficiency or speed of memory access may be improved.
- As used herein, the term “determine” generally means to make a direct or indirect calculation, computation, decision, finding, measurement, or detection. In some cases, such a determination may be approximate. Thus, determining a value indicates that the value or an approximation of the value is directly or indirectly calculated, computed, decided upon, found, measured, detected, etc. If an item is “predetermined” it is determined at any time prior to the instant at which it is indicated to be “predetermined.”
- The present technology may be implemented as a system, a method, and/or a computer program product. The computer program product may include a computer-readable storage medium (or media) storing computer-readable program instructions that, when executed by a processor, cause the processor to carry out aspects of the disclosed technology. The computer-readable storage medium may be, for example, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of these. A non-exhaustive list of more specific examples of the computer-readable storage medium includes: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), a flash memory, an optical disk, a memory stick, a floppy disk, a mechanically or visually encoded medium (e.g., a punch card or bar code), and/or any combination of these. A computer-readable storage medium, as used herein, is to be construed as being a non-transitory computer-readable medium. It is not to be construed as being a transitory signal, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- It will be understood that computer-readable program instructions can be downloaded to respective computing or processing devices from a computer-readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. A network interface in each computing/processing device may receive computer-readable program instructions via the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium within the respective computing or processing device.
- Computer-readable program instructions for carrying out operations of the present disclosure may be assembler instructions, machine instructions, firmware instructions, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network.
- All statements herein reciting principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof, whether they are currently known or developed in the future. Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present technology. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like represent various processes which may be substantially represented in computer-readable program instructions. These computer-readable program instructions may be provided to a processor or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like.
- The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like.
- In some alternative implementations, the functions noted in flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like may occur out of the order noted in the figures. For example, two blocks shown in succession in a flowchart may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each of the functions noted in the figures, and combinations of such functions can be implemented by special-purpose hardware-based systems that perform the specified functions or acts or by combinations of special-purpose hardware and computer instructions.
- With these fundamentals in place, we will now consider some non-limiting examples to illustrate various implementations of aspects of the present disclosure.
-
FIG. 1 shows anillustrative computer system 100 that includes asensitive data detector 116, as described in greater detail below. As will be understood by one of ordinary skill in the art, a sensitive data detector, such as thesensitive data detector 116, is generally a computer program that detects the presence and locations of sensitive data within a data file or data stream. Typically, such a sensitive data detector will be configured to detect numerous types of sensitive data, such as credit card numbers, social insurance numbers, passport numbers, names, addresses, medical information, and so on. In a typical sensitive data detection system, the sensitive data detector may be configured to detect hundreds of different types of sensitive information. Following detection of such sensitive information, thecomputer system 100 may apply modules (not shown) for sanitizing the data by, e.g., removing, masking, and/or anonymizing the sensitive data prior to providing access to the data. - The
computer system 100 may be a multi-user server or computer, a single user computer, a laptop computer, a tablet computer, a smartphone, an embedded control system, a network gateway or router, or any other computer system currently known or later developed. As shown inFIG. 1 , thecomputer system 100 includes one ormore processors 102, amemory 110, astorage interface 120, adisplay interface 130, and anetwork interface 140. These system components are interconnected via abus 150. - The
memory 110 may containdata 112, anoperating system 114, and asensitive data detector 116. Thedata 112 may be any data that serves as input to or output from any program in thecomputer system 100. Theoperating system 114 is an operating system such as MICROSOFT WINDOWS or LINUX. Thesensitive data detector 116 includes akeyword matcher 117, which matches keywords (e.g., finds like or corresponding keywords) that may indicate the presence of sensitive data, apattern matcher 118, which matches patterns associated with sensitive data, and avalidator 119, which uses techniques such as checksums, which may be associated with particular types of sensitive data to validate the identification of sensitive data. It will be understood by those of ordinary skill in the art that although thesensitive data detector 116 is shown as executing on thecomputer system 100, it is possible that thesensitive data detector 116 could execute on numerous computer systems, connected, e.g., by a network. Further, thekeyword matcher 117,pattern matcher 118, andvalidator 119 may reside on different computer systems. - The
storage interface 120 is used to connect storage devices, such as thestorage device 125, to thecomputer system 100. One type ofstorage device 125 is a solid-state drive, which may use an integrated circuit assembly to store data persistently. A different kind ofstorage device 125 is a hard drive, such as an electro-mechanical device that uses magnetic storage to store and retrieve digital data. Similarly, thestorage device 125 may be an optical drive, a card reader that receives a removable memory card, such as an SD card, or a flash memory device that may be connected to thecomputer system 100 through, e.g., a universal serial bus (USB). - In some implementations, the
computer system 100 may use well-known virtual memory techniques that allow the programs of thecomputer system 100 to behave as if they have access to a large, contiguous address space instead of access to multiple, smaller storage spaces, such as thememory 110 and thestorage device 125. Therefore, while thedata 112, theoperating system 114, and thesensitive data detector 116 are shown to reside in thememory 110, those skilled in the art will recognize that these items are not necessarily wholly contained in thememory 110 at the same time. - The
processors 102 may include one or more microprocessors and/or other integrated circuits. Theprocessors 102 execute program instructions stored in thememory 110. When thecomputer system 100 starts up, theprocessors 102 may initially execute a boot routine and/or the program instructions that make up theoperating system 114. Theprocessors 102 may also execute instructions that make up thesensitive data detector 116. - The
display interface 130 is used to connect one ormore displays 135 to thecomputer system 100. Thesedisplays 135, which may include, e.g., terminals, monitors, keyboards, pointer devices, touchscreens, and/or other human interface devices, provide the ability for users to interact with thecomputer system 100. Note, however, that although thedisplay interface 130 is provided to support communication with one ormore displays 135, thecomputer system 100 does not necessarily require adisplay 135, because all needed interaction with users may occur via thenetwork interface 140. - The
network interface 140 is used to connect thecomputer system 100 to other computer systems or networked devices (not shown) via anetwork 160. Thenetwork interface 140 may include a combination of hardware and software that allows communicating on thenetwork 160. The software in thenetwork interface 140 may include software that uses one or more network protocols to communicate over thenetwork 160. For example, the network protocols may include TCP/IP (Transmission Control Protocol/Internet Protocol). In some implementations, thenetwork interface 140 may be an Ethernet adapter. - It will be understood that the
computer system 100 is merely an example and that the compiler and optimizer according to the disclosed technology may execute on computer systems or other computing devices having different configurations. -
FIG. 2 shows a block diagram of aconventional system 200 for detecting sensitive data. The system includes apattern match module 202, akeyword match module 204, and avalidation module 206. - As can be seen, the first module that is applied to a free text stream or file is the
pattern match module 202. Thepattern match module 202 typically matches the text to regular expressions to find patterns indicative of sensitive data. A regular expression (or “regex”) is a sequence of characters that specifies a search pattern. Some simple examples include “a*” to match zero or more “a” characters, “a+” to match one or more “a” characters, “[ab]*c+” to match zero or more “a” or “b” characters followed by one or more “c” characters, and “generali[sz]e” matches “generalise” or “generalize”. Regular expressions have been known since the 1950s, and are commonly used in string searching. Regular expressions and algorithms for matching regular expressions will be well-understood by one of ordinary skill in the art. - Regular expression matching is computationally expensive. One common algorithm for regular expression matching, commonly known as Thompson's construction algorithm (see Thompson, K., “Programming Techniques: Regular expression search algorithm”, Communications of the ACM, 11(6): 419-422, June 1968) takes time proportional to mn, where n is the number of characters in the stream or file, and m is the length of the regular expression being matched. If there is a fixed set of regular expressions for each sensitive data type, then either by searching for all of the regular expressions for all of the sensitive data types in one pass, or by invoking the
pattern matching module 202 for each sensitive data type, the overall time complexity will be linear in the number of sensitive data types. - The
keyword match module 204 searches text in the vicinity of the regular expressions matched by the pattern match module for keywords associated with each type of sensitive data to add to the certainty that sensitive information of a particular type has been found. Each type of sensitive data may be associated with particular keywords and finding these keywords in text near the identified patterns helps to verify the presence of the type of sensitive data associated with those keywords. - It will be understood that there are many known algorithms for searching for keywords in text. A naïve approach, which involves checking for the presence of each keyword at each position in the text being searched, will result in a time complexity of kn, where k is the length of the keywords, and n is the length of the text being searched. If such an approach is used, then, as with the
pattern match module 202, the time complexity will be linear in the number of sensitive data types, assuming that each sensitive data type is associated with a fixed set of keywords. - There are other well-known algorithms for performing keyword searches that provide much better performance than the naïve approach. These include, for example, the Aho-Corasick algorithm (see Aho, A. and Corasick, M., “Efficient string matching: An aid to bibliographic search”, Communications of the ACM, 18 (6): 333-340, June 1975) and the Rabin-Karp algorithm (see Karp, R. and Rabin, M., “Efficient randomized pattern-matching algorithms”, IBM Journal of Research and Development, 31 (2): 249-260, March 1987), to name just a couple of the best-known such algorithms. Except in certain degenerate cases (such as when every substring is a match for the keyword in the Aho-Corasick algorithm), these algorithms tend to have a time complexity proportional to n+k, where n is the length of the string being searched, and k is the length of the keywords, or faster (e.g., time complexity proportional to n) when the list of keywords is known in advance, such that setup for the search (e.g., such as construction of an Aho-Corasick automaton) can be done offline, prior to the search itself.
- In practice, however, for the conventional sensitive
data detection system 200, use of these more efficient algorithms does not make much difference in overall performance. This is because the length of the text to be searched is substantially (non-trivially) shortened by thepattern match module 202, so the only text that needs to be searched is the text in the immediate vicinity of the matched patterns or regular expressions. This means that n—the length of the text to be searched—will be relatively small, so the keyword matching algorithm used in thekeyword match module 204 is likely to be fast enough, even if the algorithm is not particularly efficient. Indeed, because the size of the text to be processed is likely to greatly decrease as a result of thepattern match module 202, the time taken by thepattern match module 202 will typically dominate the overall time taken by the conventional sensitivedata detection system 200, with the time taken by thekeyword match module 204 and thevalidation module 206 being practically insignificant in comparison. For this reason, conventional sensitive data detectors have typically not employed particularly efficient keyword matching algorithms. - The
validation module 206 validates the correctness of the sensitive data that has been identified by thepattern match module 202 and that is associated with appropriate keywords for a sensitive data type, as determined by thekeyword match module 204. Thevalidation module 206 applies additional validation functions, such as checksums or other tests that may be dependent on the sensitive data type, to further validate the presence of sensitive data. This reduces the false positive rate of the conventional sensitivedata detection system 200. - Because each sensitive data type may have its own validation function, it is difficult to characterize the time complexity of the
validation module 206. It will be understood that there may be some sensitive data types for which there is no validation function. Additionally, it will be understood that use of thevalidation module 206 may be optional, and that use of thepattern match module 202 andkeyword match module 204 may provide adequate results without further validation in some use cases. -
FIG. 3 shows a block diagram of asensitive data detector 116 in accordance with some implementations of the disclosed technology. Thesensitive data detector 116 includes akeyword matcher 117, apattern matcher 118, and anoptional validator 119. Thesensitive data detector 116 takes a long string of text, such as a free text stream or file that may contain sensitive data as input and produces as output a list of positions (e.g., as a start index and end index within the input string) at which sensitive data occurs, and a corresponding list of labels, identifying the type of the sensitive data detected in each of these positions. - The
keyword matcher 117 finds keywords in the input text that may indicate the presence of sensitive data. In accordance with various implementations of the technology, thekeyword matcher 117 uses an efficient keyword matching algorithm to find keywords associated with all of the sensitive data types to be detected. This identifies possible ranges or portions of the input text in the vicinity of the keywords, in which sensitive data is likely to be found (in other words, vicinity typically indicates proximity, but the degree of proximity in which sensitive data is likely to be found may be dependent upon factors such as the types of sensitive data), along with the sensitive data type associated with the keywords that were found in these ranges or portions of the input text. - Because the
keyword matcher 117 is the first stage of thesensitive data detector 116, use of an efficient keyword matching algorithm may provide a significant performance gain relative to the conventional sensitivedata detection system 200 described above with reference toFIG. 2 . For example, by using the Aho-Corasick algorithm with a pre-built Aho-Corasick automaton for all keywords, thekeyword matcher 117 may achieve a time complexity for the keyword search that is proportional to n+o, where n is the length of the input text and o is the number of output matches. In most cases, the number of output matches is expected to be much lower than the length of the input text, making the time complexity of the algorithm essentially linear in the length of the input text. In theory, in certain degenerate cases, there could be a quadratic number of output matches, o. While theoretically possible, such cases are not expected to occur in this application of the Aho-Corasick algorithm. - As will be understood by one of ordinary skill in the art, the Aho-Corasick algorithm is a well-known string searching algorithm, known since the 1970s, which locates all occurrences of any of a finite number of keywords in a string. While one of ordinary skill in the art would understand the Aho-Corasick algorithm and the process of constructing an Aho-Corasick automaton, a brief overview is provided here. The Aho-Corasick algorithm constructs a finite state automaton (referred to as an “Aho-Corasick automaton”) to find keywords in an input string in a single pass. The Aho-Corasick algorithm constructs this finite state machine in three stages, which are commonly referred to as the “go-to” stage, the “failure” stage, and the “output” stage. In the go-to stage, a keyword tree (referred to as a “trie”) is constructed for the set of keywords. In this context, the trie is a tree in which the root is associated with an empty string, each node in the tree represents a state in a finite state automaton, and the edges represent transitions that occur when a single character is read from the input string. The children of any node in the tree have a common prefix, namely the string associated with that node, and each leaf node represents a keyword. In the failure stage, state transitions are added for the longest suffix of the string that is also the prefix on some other node, so that input characters will not need to be scanned more than once. In the output stage the end state for a keyword is linked to end states for other keywords that are proper suffixes of the keyword (e.g., the end state for the keyword “she” would be linked to the end state for the keyword “he”). Once such an Aho-Corasick automaton is constructed, searching an input string for all of the keywords may be performed by traversing the Aho-Corasick automaton.
- Advantageously, for implementations using a pre-built Aho-Corasick automaton, the time complexity of the search performed by the
keyword matcher 117 does not depend on the number of keywords or on the number of sensitive data types that are being detected. Consequently, for embodiments using the Aho-Corasick algorithm, thesensitive data detector 116 will be scalable to detecting large numbers of sensitive data types without significant degradation in performance. - In accordance with various implementations, the
keyword matcher 117 is applied to the input string before thepattern matcher 118. Thekeyword matcher 117 is, therefore, the only part of thesensitive data detector 116 that processes the entire input string. The pattern matcher 118 process the portions of the input string that have been identified by thekeyword matcher 117 as potentially including sensitive data—i.e., those portions of the input string that are in the vicinity of keywords that are associated with sensitive data. Because this text will be much shorter than the entire input text, the computationallycostly pattern matcher 118 will not have a great effect on the overall execution time of thesensitive data detector 116, which will instead be dominated by the execution time of the relativelyefficient keyword matcher 117. Placing an efficient keyword matcher at the beginning of a sensitive data detection process may lead to significant efficiency gains in the overall process with no difference in accuracy of sensitive data detection. - The pattern matcher 118 matches the portions of the input text that are in the vicinity of the keywords, as determined by the
keyword matcher 117, to predefined patterns associated with the sensitive data types that are being detected. Because the keyword matcher has already identified the type of sensitive data that may be in each portion if the input text, thepattern matcher 118 may apply only the patterns associated with the sensitive data type identified by thekeyword matcher 117 to each portion of the text. In some embodiments, the patterns may be specified as predetermined regular expressions. The output of thepattern matcher 118 is a list of positions in the input text that match the predefined patterns, along with information on which sensitive data types were detected in these positions. - If regular expressions are used for matching patterns in the
pattern matcher 118, then, as explained above, the time complexity of thepattern matcher 118 will be proportional to mn, where n is the number of characters in the text, and m is the length of the regular expressions being matched. While this is computationally expensive (at least compared to, e.g., keyword matching), because thekeyword matcher 117 has already dramatically reduced the amount of text to just the portions of the input text that are in the vicinity of the keywords, n is relatively small. Additionally, because the sensitive data type that may be found in each portion of the text has already been identified by thekeyword matcher 117, fewer regular expressions are applied to each portion of text. This may dramatically reduce m. Because the length of the text is much smaller than the length of the input text, and the number (and therefore the total length) of regular expressions that are applied to each portion of text may also be greatly reduced, thepattern matcher 118 executes in an efficient manner, and does not take much time to execute. In practical terms, thekeyword matcher 117 dominates the execution time of thesensitive data detector 116, and the execution time of thepattern matcher 118 is small in comparison. - The
validator 119 validates the correctness of the sensitive data that has been identified by thepattern matcher 118. Thevalidator 119 applies additional validation functions, such as checksums or other tests that may be dependent on the sensitive data type, to further validate the presence of sensitive data. In some implementations, only the text that pass the validation functions will be reported as sensitive data, to reduces the false positive rate. - Because each sensitive data type may have its own validation function, it is difficult to characterize the time complexity of the
validator 119, except to say that because the amount of text to which it is applied is relatively small, it is expected that its execution time will be small in comparison to the execution time of thekeyword matcher 117. It will be understood that in some implementations, there may be some sensitive data types for which there is no validation function. Additionally, it will be understood that in some implementations, use of thevalidator 119 may be optional. - It will be understood that in some implementations, optional pre-processing (not shown) may be performed on the input text prior to providing the input text to the
sensitive data detector 116. Additionally, there may be optional pre- and/or post-processing (not shown) for each of thekeyword matcher 117, thepattern matcher 118, and thevalidator 119. -
FIG. 4 shows an example application of a sensitive data detector in accordance with implementations of the disclosed technology. Theinput text 402 in this example is short, including only a few hundred characters. In practice, it is expected that the input text streams or files may include billions of characters. Additionally, for purposes of illustration, the example shown inFIG. 4 is searching for only a single sensitive data type—passport numbers. In practice, as discussed above, it is expected that thesensitive data detector 116 may search for hundreds of sensitive data types. - As shown in the example in
FIG. 4 , theinput text 402 is passed through thekeyword matcher 430, which finds a keyword 404 (“passport”) and identifies text 406 in the vicinity of thekeyword 404 for further processing. Additionally, the text 406 is marked (not shown) as potentially including sensitive data of the “passport number” sensitive data type. The text 406 includes six words before thekeyword 404 and six words after thekeyword 404, and includes only 91 characters (down from 388 characters in the input text 402). It will be understood that this determination of the vicinity of a keyword is used only for illustration, and there are many other ways of defining the vicinity of a keyword. Additionally, it will be recognized that the reduction in length between the input text and the text in the vicinity of the keywords will generally be far greater than is shown in this example. - Next, the text 406 is passed through the
pattern matcher 432, which searches for words that match an example pattern of a word having a first upper-case letter, followed by one or more digits. This pattern could be represented, for example, as a regular expression of the form “\<[A-Z][0-9]+\>”, where “\<” represents the start of a word and “\>” represents the end of a word. Searching for matches to this pattern in the text 406 results inmatches - Next, the
matches validator 434, which determines, using a validation function for passport numbers, that thematch 410 is not a passport number, but thematch 412 is a passport number. Accordingly, the sensitive data detector will identify the range of characters of theinput text 402 containing the word “N123456” as sensitive data of the “passport number” sensitive data type. Once this sensitive data has been identified, the system may then sanitize the sensitive text, by masking, removing, anonymizing, or otherwise making the original sensitive data unavailable. -
FIG. 5 shows a block diagram of a setup procedure for a sensitive data detector in accordance with various implementations of the disclosed technology. This procedure may be performed “offline”, prior to use of the sensitive data detector. For example, this procedure may be performed on a system different from the system in which the sensitive data detector is used. - For each type of sensitive data that is to be detected, a virtual detector, such as
virtual detectors Corasick automaton constructor 520 that uses well-known techniques, such as are briefly described above, to construct a single Aho-Corasick automaton 522 for detecting all of the keywords. This Aho-Corasick automaton 522 is then provided to the keyword matcher (not shown inFIG. 5 ) to detect all keywords. The patterns 506 a-506 c for all the sensitive data types are gathered to form apattern collection 530, which is provided to the pattern matcher (not shown inFIG. 5 ). The validation functions 508 a-508 c for all the sensitive data types that have a validation function are gathered to form avalidation function collection 540, which is provided to the validator (not shown inFIG. 5 ) -
FIG. 6 shows a flowchart of a method for detecting sensitive data in accordance with various implementations of the disclosed technology. The method includes asetup portion 602, and adetection portion 604. - The
setup portion 602 prepares the information on the sensitive data types that are to be detected by thedetection portion 604. Thesetup portion 602 may be performed on a different system than thedetection portion 604 and need only be executed when the set of sensitive data types to be detected changes. The input to thesetup portion 602 includes the keywords, patterns, and (optional) validation functions for the sensitive data types that are to be detected. In some implementations this input may be provided as a set of “virtual detectors,” with one such virtual detector for each sensitive data type. - At 610, a list of keywords that frequently co-occur with sensitive data is assembled for all of the sensitive data types that are to be detected. In some implementations, these keywords are used to construct a single Aho-Corasick automaton that detects all the keywords. It will be understood that in implementations that use other keyword matching algorithms, different setup for using the algorithm may be used. At 612, predefined patterns for sensitive data types that are to be detected are assembled into a collection of patterns. In some implementations, these patterns are represented by regular expressions. At 614, predefined validation functions for sensitive data types that are to be detected are assembled into a collection of validation functions. It will be understood that assembling the validation functions at 614 is optional in some implementations.
- The
detection portion 604 uses the information prepared in thesetup portion 602 to detect sensitive data in input text. The inputs to the detection portion includes the input text in which the sensitive data is to be detected, along with the Aho-Corasick automaton (or other keyword detection setup information) and the collections of patterns and validation functions that were prepared in thesetup portion 602. The outputs of thedetection portion 604 include a list of positions within the input text at which sensitive data was detected, and a list of labels indicating the sensitive data types detected in each of these positions. - At 630, a keyword match is performed to identify possible ranges of characters in the input text at which sensitive data might appear, as well as the types of sensitive data associated with the keywords found in the possible ranges. In some implementations, the keyword match is performed by running or traversing the Aho-Corasick automaton prepared in the
setup portion 602. Next, at 632, the ranges of characters identified by the keyword matching are matched against patterns (from the collection of patterns assembled in the setup portion 602) associated with the sensitive data types detected in these positions. This results in a list of positions that matched the patterns. In some implementations, matching the patterns includes matching regular expressions that are used to define the patterns. At 634, validation functions are run on the positions that matched the patterns to validate that the sensitive data in these positions have been properly identified. The positions that pass the validation functions will be reported as containing sensitive data. In some implementations this validation may be optional. - It will be understood that in addition to detecting sensitive data, the disclosed technology could be used in other applications, such as in information retrieval applications. For example, meaningful or relevant pieces of text inside larger text streams or files could be identified and/or extracted using the disclosed technology. Additionally, it will be appreciated that although the sensitive data detection is illustrated as applying to text, the disclosed technology may be applied to other types of media, such as audio, video, and/or images. For example, to detect sensitive information in audio, the audio may be converted to text, and an implementation of the disclosed technology could be used to detect sensitive and/or meaningful data in the resulting text. Similar methods could be used with other media. Additionally, in some implementations, there may be no need to reduce the media to text, since it will be understood by those of ordinary skill in the art that the disclosed technology could be modified to be directly applied to other types of data and/or media.
- It will also be understood that, although the embodiments presented herein have been described with reference to specific features and structures, various modifications and combinations may be made without departing from such disclosures. The specification and drawings are, accordingly, to be regarded simply as an illustration of the discussed implementations or embodiments and their principles as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present disclosure.
Claims (20)
1. An apparatus comprising:
a processor;
a memory coupled to the processor; and
a sensitive data detector in the memory and executed by the processor, the sensitive data detector comprising:
a keyword matcher that matches predetermined keywords in input text, and determines text in vicinities of matched keywords in the input text in which sensitive data is likely to be found; and
a pattern matcher that matches predefined patterns associated with sensitive data to the text in vicinities of matched keywords to detect sensitive data;
wherein the keyword matcher executes prior to the pattern matcher; and
wherein the text in vicinities of matched keywords is substantially shorter in length than the input text.
2. The apparatus of claim 1 , wherein the sensitive data detector further comprises a validator that validates the sensitive data detected by the pattern matcher.
3. The apparatus of claim 2 , wherein the validator uses a validation function specific to a detected type of sensitive data to validate the sensitive data detected by the pattern matcher.
4. The apparatus of claim 3 , wherein the validation function comprises a checksum.
5. The apparatus of claim 1 , wherein the keyword matcher has a time complexity that does not depend on how many predetermined keywords are to be matched.
6. The apparatus of claim 5 , wherein the keyword matcher uses a pre-constructed Aho-Corasick automaton configured to match the predetermined keywords in a single pass over the input text.
7. The apparatus of claim 1 , wherein the predefined patterns comprise regular expressions.
8. The apparatus of claim 7 , wherein the pattern matcher comprises a regular expression matching algorithm.
9. The apparatus of claim 1 , wherein at least one of the keyword matcher or the pattern matcher comprises at least one of pre-processing or post-processing.
10. A method of detecting sensitive data on a computing device, the method comprising:
matching, on the computing device, predetermined keywords in input data, to determine data in vicinities of matched keywords in the input data in which sensitive data is likely to be found; and
matching, on the computing device, predefined patterns associated with sensitive data to the data in vicinities of matched keywords to detect sensitive data;
wherein matching the predetermined keywords occurs prior to matching the predefined patterns; and
wherein the data in vicinities of matched keywords is substantially shorter than the input data.
11. The method of claim 10 , further comprising validating the detected sensitive data.
12. The method of claim 11 , wherein validating the detected sensitive data comprises applying a validation function specific to a detected type of sensitive data to the detected sensitive data.
13. The method of claim 12 , wherein applying the validation function comprises calculating a checksum.
14. The method of claim 10 , wherein matching the predetermined keywords has a time complexity that does not depend on how many predetermined keywords are to be matched.
15. The method of claim 14 , wherein matching the predetermined keywords comprises using a pre-constructed Aho-Corasick automaton configured to match the predetermined keywords in a single pass over the input data.
16. The method of claim 15 , further comprising assembling a list of keywords that frequently co-occur with sensitive data for sensitive data types that are to be detected, and constructing a single Aho-Corasick automaton that detects the keywords in the list of keywords.
17. The method of claim 16 , wherein assembling the list of keywords comprises assembling the list of keywords using keywords from a plurality of virtual detectors, each virtual detector in the plurality of virtual detectors including at least one keyword associated with at least one sensitive data type.
18. The method of claim 10 , wherein matching the predefined patterns comprises matching regular expressions.
19. The method of claim 10 , wherein at least one of matching the predetermined keywords or matching the predefined patterns comprises applying at least one of pre-processing or post-processing.
20. The method of claim 10 , wherein the input data comprises at least one of text, video, audio, or images.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/350,549 US11687534B2 (en) | 2021-06-17 | 2021-06-17 | Method and system for detecting sensitive data |
PCT/CN2022/090617 WO2022262447A1 (en) | 2021-06-17 | 2022-04-29 | Method and system for detecting sensitive data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/350,549 US11687534B2 (en) | 2021-06-17 | 2021-06-17 | Method and system for detecting sensitive data |
Publications (2)
Publication Number | Publication Date |
---|---|
US20220405274A1 true US20220405274A1 (en) | 2022-12-22 |
US11687534B2 US11687534B2 (en) | 2023-06-27 |
Family
ID=84489187
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/350,549 Active US11687534B2 (en) | 2021-06-17 | 2021-06-17 | Method and system for detecting sensitive data |
Country Status (2)
Country | Link |
---|---|
US (1) | US11687534B2 (en) |
WO (1) | WO2022262447A1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230214522A1 (en) * | 2022-01-05 | 2023-07-06 | Intuit Inc. | Automatic detection of personal identifiable information |
US20230325490A1 (en) * | 2022-04-07 | 2023-10-12 | Microsoft Technology Licensing, Llc | Agent-based extraction of cloud credentials |
US20230325489A1 (en) * | 2022-04-07 | 2023-10-12 | Microsoft Technology Licensing, Llc | Agentless extraction of cloud credentials |
US20230351045A1 (en) * | 2022-04-29 | 2023-11-02 | Microsoft Technology Licensing, Llc | Scan surface reduction for sensitive information scanning |
CN117009596A (en) * | 2023-06-28 | 2023-11-07 | 国网冀北电力有限公司信息通信分公司 | Identification method and device for power grid sensitive data |
CN117633867A (en) * | 2023-10-26 | 2024-03-01 | 唐山启奥科技股份有限公司 | Medical image desensitizing method, device, electronic equipment and readable storage medium |
KR102668190B1 (en) * | 2023-09-18 | 2024-05-23 | 주식회사 피앤피시큐어 | JavaScript engine checksum method and system for reducing personal information detection errors in personal information monitoring systems |
US12174997B2 (en) * | 2021-08-05 | 2024-12-24 | Blue Prism Limited | Data obfuscation |
US20250021692A1 (en) * | 2023-07-13 | 2025-01-16 | Demostack, Inc. | Obfuscation of personally identifiable information |
Citations (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5963942A (en) * | 1996-01-16 | 1999-10-05 | Fujitsu Limited | Pattern search apparatus and method |
US6272488B1 (en) * | 1998-04-01 | 2001-08-07 | International Business Machines Corporation | Managing results of federated searches across heterogeneous datastores with a federated collection object |
US20040030692A1 (en) * | 2000-06-28 | 2004-02-12 | Thomas Leitermann | Automatic search method |
US6820075B2 (en) * | 2001-08-13 | 2004-11-16 | Xerox Corporation | Document-centric system with auto-completion |
US20040254929A1 (en) * | 2000-05-19 | 2004-12-16 | Isaac Stephen John | Request matching system and method |
US20070118391A1 (en) * | 2005-10-24 | 2007-05-24 | Capsilon Fsg, Inc. | Business Method Using The Automated Processing of Paper and Unstructured Electronic Documents |
US7370034B2 (en) * | 2003-10-15 | 2008-05-06 | Xerox Corporation | System and method for performing electronic information retrieval using keywords |
US20080147790A1 (en) * | 2005-10-24 | 2008-06-19 | Sanjeev Malaney | Systems and methods for intelligent paperless document management |
US20090063557A1 (en) * | 2004-03-18 | 2009-03-05 | Macpherson Deborah L | Context Driven Topologies |
US20090094213A1 (en) * | 2006-02-22 | 2009-04-09 | Dong Wang | Composite display method and system for search engine of same resource information based on degree of attention |
US20100076919A1 (en) * | 2006-12-08 | 2010-03-25 | Hangzhou H3C Technologies Co. Ltd. | Method and apparatus for pattern matching |
US20100153440A1 (en) * | 2001-08-13 | 2010-06-17 | Xerox Corporation | System with user directed enrichment |
US7752222B1 (en) * | 2007-07-20 | 2010-07-06 | Google Inc. | Finding text on a web page |
US7805392B1 (en) * | 2005-11-29 | 2010-09-28 | Tilera Corporation | Pattern matching in a multiprocessor environment with finite state automaton transitions based on an order of vectors in a state transition table |
US20110185077A1 (en) * | 2010-01-27 | 2011-07-28 | Interdisciplinary Center Herzliya | Multi-pattern matching in compressed communication traffic |
US20110252030A1 (en) * | 2010-04-09 | 2011-10-13 | International Business Machines Corporation | Systems, methods and computer program products for a snippet based proximal search |
US20130124534A1 (en) * | 2011-11-15 | 2013-05-16 | Long Van Dinh | Apparatus and method for information access, search, rank and retrieval |
US20140236953A1 (en) * | 2009-02-11 | 2014-08-21 | Jeffrey A. Rapaport | Methods using social topical adaptive networking system |
US8838616B2 (en) * | 2008-08-26 | 2014-09-16 | Nec Biglobe, Ltd. | Server device for creating list of general words to be excluded from search result |
US20150356173A1 (en) * | 2013-03-04 | 2015-12-10 | Mitsubishi Electric Corporation | Search device |
US20160253679A1 (en) * | 2015-02-24 | 2016-09-01 | Thomson Reuters Global Resources | Brand abuse monitoring system with infringement deteciton engine and graphical user interface |
US20170038916A1 (en) * | 2015-08-07 | 2017-02-09 | Ebay Inc. | Virtual facility platform |
US20170235799A1 (en) * | 2016-02-11 | 2017-08-17 | Lexisnexis, A Division Of Reed Elsevier Inc. | Systems and methods for searching databases using graphical user interfaces that include concept stacks |
US20170374093A1 (en) * | 2016-06-28 | 2017-12-28 | Microsoft Technology Licensing, Llc | Robust Matching for Identity Screening |
US20180232528A1 (en) * | 2017-02-13 | 2018-08-16 | Protegrity Corporation | Sensitive Data Classification |
US20180276402A1 (en) * | 2017-03-23 | 2018-09-27 | Microsoft Technology Licensing, Llc | Data loss protection for structured user content |
US20180276401A1 (en) * | 2017-03-23 | 2018-09-27 | Microsoft Technology Licensing, Llc | Configurable annotations for privacy-sensitive user content |
US20190268379A1 (en) * | 2016-03-11 | 2019-08-29 | Netskope, Inc. | Small-Footprint Endpoint Data Loss Prevention (DLP) |
US10498355B2 (en) * | 2015-01-04 | 2019-12-03 | EMC IP Holding Company LLC | Searchable, streaming text compression and decompression using a dictionary |
US10678795B2 (en) * | 2018-05-24 | 2020-06-09 | People.ai, Inc. | Systems and methods for updating multiple value data structures using a single electronic activity |
US20210058395A1 (en) * | 2018-08-08 | 2021-02-25 | Rightquestion, Llc | Protection against phishing of two-factor authentication credentials |
US11308095B1 (en) * | 2015-11-18 | 2022-04-19 | American Express Travel Related Services Company, Inc. | Systems and methods for tracking sensitive data in a big data environment |
US20220179991A1 (en) * | 2020-12-08 | 2022-06-09 | Vmware, Inc. | Automated log/event-message masking in a distributed log-analytics system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9203623B1 (en) | 2009-12-18 | 2015-12-01 | Trend Micro Incorporated | Apparatus and methods for keyword proximity matching |
CN103617251A (en) | 2013-11-28 | 2014-03-05 | 金蝶软件(中国)有限公司 | Sensitive word matching method and system |
CN109614816B (en) | 2018-11-19 | 2024-05-07 | 平安科技(深圳)有限公司 | Data desensitizing method, device and storage medium |
CN110580416A (en) | 2019-09-11 | 2019-12-17 | 国网浙江省电力有限公司信息通信分公司 | A method for automatic identification of sensitive data based on artificial intelligence |
CN113051601B (en) | 2019-12-27 | 2024-05-03 | 中移动信息技术有限公司 | Sensitive data identification method, device, equipment and medium |
-
2021
- 2021-06-17 US US17/350,549 patent/US11687534B2/en active Active
-
2022
- 2022-04-29 WO PCT/CN2022/090617 patent/WO2022262447A1/en active Application Filing
Patent Citations (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5963942A (en) * | 1996-01-16 | 1999-10-05 | Fujitsu Limited | Pattern search apparatus and method |
US6272488B1 (en) * | 1998-04-01 | 2001-08-07 | International Business Machines Corporation | Managing results of federated searches across heterogeneous datastores with a federated collection object |
US20040254929A1 (en) * | 2000-05-19 | 2004-12-16 | Isaac Stephen John | Request matching system and method |
US20040030692A1 (en) * | 2000-06-28 | 2004-02-12 | Thomas Leitermann | Automatic search method |
US20100153440A1 (en) * | 2001-08-13 | 2010-06-17 | Xerox Corporation | System with user directed enrichment |
US6820075B2 (en) * | 2001-08-13 | 2004-11-16 | Xerox Corporation | Document-centric system with auto-completion |
US7370034B2 (en) * | 2003-10-15 | 2008-05-06 | Xerox Corporation | System and method for performing electronic information retrieval using keywords |
US20090063557A1 (en) * | 2004-03-18 | 2009-03-05 | Macpherson Deborah L | Context Driven Topologies |
US20080147790A1 (en) * | 2005-10-24 | 2008-06-19 | Sanjeev Malaney | Systems and methods for intelligent paperless document management |
US8176004B2 (en) * | 2005-10-24 | 2012-05-08 | Capsilon Corporation | Systems and methods for intelligent paperless document management |
US20070118391A1 (en) * | 2005-10-24 | 2007-05-24 | Capsilon Fsg, Inc. | Business Method Using The Automated Processing of Paper and Unstructured Electronic Documents |
US7805392B1 (en) * | 2005-11-29 | 2010-09-28 | Tilera Corporation | Pattern matching in a multiprocessor environment with finite state automaton transitions based on an order of vectors in a state transition table |
US20090094213A1 (en) * | 2006-02-22 | 2009-04-09 | Dong Wang | Composite display method and system for search engine of same resource information based on degree of attention |
US20100076919A1 (en) * | 2006-12-08 | 2010-03-25 | Hangzhou H3C Technologies Co. Ltd. | Method and apparatus for pattern matching |
US7752222B1 (en) * | 2007-07-20 | 2010-07-06 | Google Inc. | Finding text on a web page |
US8838616B2 (en) * | 2008-08-26 | 2014-09-16 | Nec Biglobe, Ltd. | Server device for creating list of general words to be excluded from search result |
US20140236953A1 (en) * | 2009-02-11 | 2014-08-21 | Jeffrey A. Rapaport | Methods using social topical adaptive networking system |
US20110185077A1 (en) * | 2010-01-27 | 2011-07-28 | Interdisciplinary Center Herzliya | Multi-pattern matching in compressed communication traffic |
US8458354B2 (en) * | 2010-01-27 | 2013-06-04 | Interdisciplinary Center Herzliya | Multi-pattern matching in compressed communication traffic |
US20110252030A1 (en) * | 2010-04-09 | 2011-10-13 | International Business Machines Corporation | Systems, methods and computer program products for a snippet based proximal search |
US20130124534A1 (en) * | 2011-11-15 | 2013-05-16 | Long Van Dinh | Apparatus and method for information access, search, rank and retrieval |
US8965904B2 (en) * | 2011-11-15 | 2015-02-24 | Long Van Dinh | Apparatus and method for information access, search, rank and retrieval |
US20150356173A1 (en) * | 2013-03-04 | 2015-12-10 | Mitsubishi Electric Corporation | Search device |
US10498355B2 (en) * | 2015-01-04 | 2019-12-03 | EMC IP Holding Company LLC | Searchable, streaming text compression and decompression using a dictionary |
US11328307B2 (en) * | 2015-02-24 | 2022-05-10 | OpSec Online, Ltd. | Brand abuse monitoring system with infringement detection engine and graphical user interface |
US20160253679A1 (en) * | 2015-02-24 | 2016-09-01 | Thomson Reuters Global Resources | Brand abuse monitoring system with infringement deteciton engine and graphical user interface |
US20170038916A1 (en) * | 2015-08-07 | 2017-02-09 | Ebay Inc. | Virtual facility platform |
US11308095B1 (en) * | 2015-11-18 | 2022-04-19 | American Express Travel Related Services Company, Inc. | Systems and methods for tracking sensitive data in a big data environment |
US20170235799A1 (en) * | 2016-02-11 | 2017-08-17 | Lexisnexis, A Division Of Reed Elsevier Inc. | Systems and methods for searching databases using graphical user interfaces that include concept stacks |
US20200145463A1 (en) * | 2016-03-11 | 2020-05-07 | Netskope, Inc. | De Novo Sensitivity Metadata Generation for Cloud Security |
US20190268379A1 (en) * | 2016-03-11 | 2019-08-29 | Netskope, Inc. | Small-Footprint Endpoint Data Loss Prevention (DLP) |
US20170374093A1 (en) * | 2016-06-28 | 2017-12-28 | Microsoft Technology Licensing, Llc | Robust Matching for Identity Screening |
US20180232528A1 (en) * | 2017-02-13 | 2018-08-16 | Protegrity Corporation | Sensitive Data Classification |
US20180276402A1 (en) * | 2017-03-23 | 2018-09-27 | Microsoft Technology Licensing, Llc | Data loss protection for structured user content |
US10671753B2 (en) * | 2017-03-23 | 2020-06-02 | Microsoft Technology Licensing, Llc | Sensitive data loss protection for structured user content viewed in user applications |
US20190354715A1 (en) * | 2017-03-23 | 2019-11-21 | Microsoft Technology Licensing, Llc | Annotations for privacy-sensitive user content in user applications |
US20180276401A1 (en) * | 2017-03-23 | 2018-09-27 | Microsoft Technology Licensing, Llc | Configurable annotations for privacy-sensitive user content |
US10678795B2 (en) * | 2018-05-24 | 2020-06-09 | People.ai, Inc. | Systems and methods for updating multiple value data structures using a single electronic activity |
US10922345B2 (en) * | 2018-05-24 | 2021-02-16 | People.ai, Inc. | Systems and methods for filtering electronic activities by parsing current and historical electronic activities |
US11265388B2 (en) * | 2018-05-24 | 2022-03-01 | People.ai, Inc. | Systems and methods for updating confidence scores of labels based on subsequent electronic activities |
US11343337B2 (en) * | 2018-05-24 | 2022-05-24 | People.ai, Inc. | Systems and methods of determining node metrics for assigning node profiles to categories based on field-value pairs and electronic activities |
US20210058395A1 (en) * | 2018-08-08 | 2021-02-25 | Rightquestion, Llc | Protection against phishing of two-factor authentication credentials |
US20220179991A1 (en) * | 2020-12-08 | 2022-06-09 | Vmware, Inc. | Automated log/event-message masking in a distributed log-analytics system |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12174997B2 (en) * | 2021-08-05 | 2024-12-24 | Blue Prism Limited | Data obfuscation |
US20230214522A1 (en) * | 2022-01-05 | 2023-07-06 | Intuit Inc. | Automatic detection of personal identifiable information |
US20230325490A1 (en) * | 2022-04-07 | 2023-10-12 | Microsoft Technology Licensing, Llc | Agent-based extraction of cloud credentials |
US20230325489A1 (en) * | 2022-04-07 | 2023-10-12 | Microsoft Technology Licensing, Llc | Agentless extraction of cloud credentials |
US12399976B2 (en) * | 2022-04-07 | 2025-08-26 | Microsoft Technology Licensing, Llc | Agentless extraction of cloud credentials |
US20230351045A1 (en) * | 2022-04-29 | 2023-11-02 | Microsoft Technology Licensing, Llc | Scan surface reduction for sensitive information scanning |
US12158974B2 (en) * | 2022-04-29 | 2024-12-03 | Microsoft Technology Licensing, Llc | Scan surface reduction for sensitive information scanning |
CN117009596A (en) * | 2023-06-28 | 2023-11-07 | 国网冀北电力有限公司信息通信分公司 | Identification method and device for power grid sensitive data |
US20250021692A1 (en) * | 2023-07-13 | 2025-01-16 | Demostack, Inc. | Obfuscation of personally identifiable information |
KR102668190B1 (en) * | 2023-09-18 | 2024-05-23 | 주식회사 피앤피시큐어 | JavaScript engine checksum method and system for reducing personal information detection errors in personal information monitoring systems |
CN117633867A (en) * | 2023-10-26 | 2024-03-01 | 唐山启奥科技股份有限公司 | Medical image desensitizing method, device, electronic equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2022262447A1 (en) | 2022-12-22 |
US11687534B2 (en) | 2023-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11687534B2 (en) | Method and system for detecting sensitive data | |
Demirkıran et al. | An ensemble of pre-trained transformer models for imbalanced multiclass malware classification | |
RU2722692C1 (en) | Method and system for detecting malicious files in a non-isolated medium | |
US10545999B2 (en) | Building features and indexing for knowledge-based matching | |
KR102790640B1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
Das et al. | Defeating SQL injection attack in authentication security: an experimental study | |
US11574053B1 (en) | System and method for detecting malicious scripts | |
Brengel et al. | {YARIX}: Scalable {YARA-based} malware intelligence | |
Li et al. | Cobra: interaction-aware bytecode-level vulnerability detector for smart contracts | |
US10719536B2 (en) | Efficiently finding potential duplicate values in data | |
Hameed et al. | SURAGH: Syntactic Pattern Matching to Identify Ill-Formed Records. | |
CN113627168A (en) | Method, device, medium and equipment for checking component packaging conflict | |
KR20250014244A (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
WO2016037167A1 (en) | Identifying mathematical operators in natural language text for knowledge-based matching | |
US20240264923A1 (en) | Identifying unknown patterns in telemetry log data | |
WO2024224367A1 (en) | Clustering-based data object classification | |
US20240126918A1 (en) | Techniques for data classification and for protecting cloud environments from cybersecurity threats using data classification | |
KR102499555B1 (en) | Methods and apparatus for disarming a link in pdf or hwp | |
Jain et al. | Two timin’: Repairing smart contracts with a two-layered approach | |
US20240331815A1 (en) | Named-entity recognition of protected health information | |
CN116975040A (en) | Dangerous chemical information management method, device, equipment and readable storage medium | |
KR20250014247A (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information | |
CN116401676A (en) | Automatic detection method and device for data loopholes, electronic equipment and storage medium | |
US20240220612A1 (en) | Methods and apparatus for disarming javascript in pdf or hwp | |
KR102864829B1 (en) | Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
AS | Assignment |
Owner name: HUAWEI TECHNOLOGIES CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SHANG, ZHIWEI;MO, ZHIJUN;SIGNING DATES FROM 20210617 TO 20210618;REEL/FRAME:062962/0855 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |