[go: up one dir, main page]

CN103793522B - Fast signature scan - Google Patents

Fast signature scan Download PDF

Info

Publication number
CN103793522B
CN103793522B CN201410055830.4A CN201410055830A CN103793522B CN 103793522 B CN103793522 B CN 103793522B CN 201410055830 A CN201410055830 A CN 201410055830A CN 103793522 B CN103793522 B CN 103793522B
Authority
CN
China
Prior art keywords
subcode
fixed length
feature
random
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410055830.4A
Other languages
Chinese (zh)
Other versions
CN103793522A (en
Inventor
王强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wang Ying
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201410055830.4A priority Critical patent/CN103793522B/en
Priority to CN201711339378.4A priority patent/CN108197470A/en
Priority claimed from CN200880127748.0A external-priority patent/CN101960469B/en
Publication of CN103793522A publication Critical patent/CN103793522A/en
Application granted granted Critical
Publication of CN103793522B publication Critical patent/CN103793522B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/564Static detection by virus signature recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method and system of scanning feature code on String field.In one embodiment, the invention provides a kind of signature scan method.Methods described includes one or more condition codes to be processed into one or more forms, the form includes one or more fingerprints of each fixed length condition code or feature subcode and one or more follow-up searching data structures, so that the number of the fingerprint of each fixed length condition code or feature subcode is equal to the step-length of signature scan operation, and the specific fixed length condition code or feature subcode can be identified on any position in any scanned String field, receive specific character string field, identify any condition code included by the specific character string field, it is included in each using scanning step to scan the fingerprint on the position of spacing, with the follow-up searching data structure is searched on the position for having one or more fingerprints matched, with any identified condition code of output.

Description

Fast signature scan
Technical field
The present invention relates to the condition code in scanning String field.
Background technology
The object of digital content(Such as file, program, webpage, Email, internet data bag, or digital picture)Can be with Include one or more String fields.One String field is a data value for typically representing word or executable code String.For example, an internet data bag can include network address, host name, Hypertext Transfer Protocol (HTTP) header, hypertext biography Defeated agreement message, e-mail attachment, Email Header and Email content.The big I of one String field is from several Individual byte is to millions of with last byte.One character string condition code can be a string of specific data values indicated completely or The expression formula of specific data value(Such as specific regular expression), the purpose is to for identifying a character string object(Such as spy Fixed computer virus or specific gene order).Condition code can be stored in a sign code database.One word levies code Database can include multiple condition codes.The big I of one character string condition code is from several bytes to thousands of individual bytes.
Character string condition code and String field are all the bit strings for including many elementary cells.One elementary cell Be it is minimum have the unit of semanteme, therefore scanning element is used as generally in signature scan technology.One elementary cell it is big It is small by application depending on.For example, the base unit of English character string is typically 8 bits(That is a byte), and a computer disease The base unit of malicious condition code is typically a byte or nybble.
The elementary cell of each condition code can be designated as being equal or different to some particular value, or specific at some In the range of(Such as in digital scope 0 to 9 or in English alphabet scope a to z).Each elementary cell can be case-insensitive It is or case sensitive.Each elementary cell can support simple logical operation(Such as " non-").In addition, each condition code can wrap Asterisk wildcard is included, for example, " * "(One random length asterisk wildcard)Or "”(One fixed length asterisk wildcard), wherein " * " represents zero or appointed Multiple any elementary cells of anticipating and "" represent any elementary cell.For each random length feature code sign, one can be entered Step indicates its random length scope.When a condition code includes random length character, the indefinite length of condition code.An if feature Code does not include random length character, and its length is fixed.
One typical signature scan process may include on all possible position in a String field, compare Corresponding condition code in the String field and condition code database.Sweep speed is generally by the size and complexity of condition code Property limitation.In addition, the energy power restriction that sweep speed is also updated one by one by condition code.
The content of the invention
The embodiments of the invention provide the method and system of the scanning feature code on String field.In general, this hair The embodiment of bright one side provides character string signature scan method, and methods described is included at one or more condition codes One or more forms are managed into, the form includes each fixed length feature subcode of each fixed length condition code or random length condition code One or more fingerprints and one or more follow-up searching data structures, it is special that one or more of fingerprints include specific fixed length Code or the j-th fingerprint of feature subcode are levied, the first elementary cell of the j-th fingerprint is in the specific fixed length condition code or spy The remainder of the step-length of position in a scanning direction divided by signature scan operation in sign subcode is equal to J, so that described The number of fingerprint is equal to the step-length of signature scan, and make it that the specific fixed length condition code or feature subcode are swept any It can be identified on any position in the String field retouched, wherein each fingerprint includes specific fixed length condition code or spy One or more fragments of subcode are levied, one or more of fragments have in the specific fixed length condition code or feature subcode From anywhere in ad-hoc location, receive a specific character string field being made up of data value, identify the specific character string Any condition code included by field, it is included in each using scanning step on the position of spacing, to scan the specific character string Field, to search one or more fingerprints of one or more condition codes, and there are one or more fingers matched On the position of line, the specific character string field is searched, to search one or more follow-up searching data structures, and described in output Any identified condition code in specific character string field.The other embodiments of the aspect of the present invention include methods described Corresponding system, device, and computer software product.
These and other embodiment alternatively includes one or more following characteristics.Each fixed length condition code or feature subcode Have multiple fingerprints, and the scanning be included in it is each using scanning step on the position of spacing, to scan the String field, with Search multiple fingerprints of one or more condition codes, including parallel search two or more fingerprints.The one of special characteristic code Each fingerprint in individual or multiple fingerprints is to indicate completely in former space or after one or more shadow spaces are projected to , the shadow space is to be arrived than the former wider array of space of Space format, the shadow space by introducing some ambiguities The former space, so that a fingerprint shadow in specific shadow space corresponds to one or more in the former space Fingerprint.
The character string signature scan method can further comprise on each position using scanning step as spacing, Specific character string field described in former spacescan, to search one or more fingerprints, and first each using scanning step as spacing Position on, in each shadow space in one or more shadow spaces of one or more fingerprints, scanning it is described specific String field, to search one or more fingerprints, then at least one shadow in one or more shadow spaces again On the position for there are one or more identified fingerprints in subspace, one in identified fingerprint is examined in the former space It is individual or two.Some ambiguities are introduced to further comprise the upper case and lower case letter in former space to become one to the former space Not case sensitive letter, all numerals from 0 to 9 in former space are become an identical numeral, and the sky former space Lattice and "-" become one or more of space or "-".
Scanning for one or more of fingerprints of one or more of condition codes can further comprise using one Individual or multiple hash tables and one or more cloth are grand(Bloom)One or more of filter is scanned.Scan for One or more of fingerprints of one or more of condition codes can further comprise using hashed value demultiplexer and One or more of one fingerprint length demultiplexer is scanned.The number of the different fingerprint length of multiple condition codes is comparable The number of the different characteristic code length of the multiple condition code is few, and the scanning can further comprise each with scanning step On the position of spacing, to scan the specific character string field, to search multiple fingerprints of the multiple condition code, including it is parallel Search two or more length identical fingerprints.One or more of fingerprints can be chosen, so that the fingerprint Length be restricted to the length of one group of one or more length being covered in one or more length ranges, to provide more points The finger scan of resolution.The length of each fingerprint can be the integral multiple of the step-length of signature scan operation.The character string Signature scan method can further comprise using one or more content adressable memorys(CAM)It is limited certainly with one or more Motivation(FA)One or more of be scanned, to search one or more of fingers of one or more of condition codes Line.
The character string signature scan method can further comprise each fingerprint of multiple fingerprints multiple condition codes One or more fingerprint sections are decomposed into, so that the number of the different fingerprint segment length of described document information is than described document information The number of different fingerprint length is few, on each position using scanning step as spacing, scans the specific character string field, with Search the multiple fingerprint section, including parallel search two or more fingerprint sections, and identified fingerprint section is synthesized Any fingerprint matching.All fingerprint sections can have an identical length, and the scanning specific character string field is to look into Look for the step-length of the signature scan operation for the integral multiple that multiple fingerprint sections can be the fingerprint segment length with one.One explanation is special Determining the fingerprint section figure of one or more possible positions of the fingerprint section in any fingerprint can be stored in the particular fingerprint section Together, for identified fingerprint section is synthesized any fingerprint matching.Illustrate the fingerprint length of one or more possible length It is any for identified fingerprint section is synthesized together with degree information can store with the first paragraph of each fingerprint or each section Fingerprint matching.One or more finite automatas(FA)It can be used to an identified fingerprint section and synthesize any fingerprint matching.
The character string signature scan method can further comprise the probability for the false positive matching for storing each fingerprint, On the position for there are one or more fingerprints matched, the probability of corresponding false positive matching is checked, and when one or more When the probability of false positive matching in the probability of individual false positive matching is insufficient to low, the specific character string field is searched, To search one or more of follow-up searching data structures.Methods described can further comprise with more corresponding to a fingerprint One or more different elementary cells of individual fixed length condition code or feature subcode build a difference searching data structure, and look into The specific character string field is looked for, to search one or more of follow-up searching data structures, including difference has searched one Multiple fixed length condition codes or feature subcode corresponding to the fingerprint of identification.Methods described can further comprise each fixed length is special Code or the one or more code film bits of encoded of feature subcode are levied, one or more of yards of film bits include being used to illustrate one Or one or more code film bits of multiple following matching conditions:Without matching, if case sensitive, logic NOT, predefine Scope, logical operation, and any range, one or more of yards of film bits include one or more following code film bits:One The code film bit of individual or multiple elementary cells or sub- elementary cell, the code film bit of one or more feature subcode sections, and one Or the code film bit of multiple fixed length condition codes or feature subcode, search the specific character string field, with search it is one or Multiple follow-up searching data structures, including to search the fixed length condition code or feature subcode by code film bits of encoded.It is described Method can further comprise the specific character string field to standardize, including encoded specific character string field decoding, The specific character string field compressed is decompressed, and one or more of unwanted string data deletion process.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, methods described bag Include and each condition code in multiple condition codes is resolved into one or more condition code sections, receive a spy being made up of data value Determine String field, scan the specific character string field, to search the multiple condition code section of the multiple condition code, bag Parallel search two or more condition code sections are included, identified condition code section are synthesized the matching of any condition code, and Export any identified condition code in the specific character string field.The other embodiments of the aspect of the present invention include System corresponding to methods described, device, and computer software product.
Specific implementation may include one or more following characteristics.Scanning for the multiple condition code section can further wrap Include use:One or more hash tables, one or more Bloom filters, one or more have hashed value multiplexing or length multiplexing Or both hash table, and it is one or more have hashed value be multiplexed or the Bloom filter of length multiplexing or both in one or It is multiple to be scanned.The condition code section of one or more possible positions of one explanation special characteristic code section in any condition code Bitmap can be stored together with the special characteristic code section, with for identified condition code section is synthesized any condition code Matching.Illustrate one or more possible feature code lengths condition code length information can further with each condition code first Section or each section of storage together, match for identified condition code section is synthesized any condition code.It is one or more Finite automata(FA)An identified condition code section can be used to and synthesize any condition code matching.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, methods described bag Include and one or more condition codes are processed into one or more forms, including each random length in one or more condition codes Condition code is decomposed into multiple fixed length feature subcodes and one or more random length feature subcodes, what reception one was made up of data value Specific character string field, identify any condition code included by the specific character string field, including the scanning specific character String field, to search multiple the fixed length condition codes or feature subcode, and there are one or more fixed length feature subcode quilts On the position of identification, fixed length feature subcode described in identified is synthesized any random length condition code, and export described specific Any identified condition code in String field.Handling one or more condition codes can be further into one or more forms Positional information including storing each fixed length feature subcode includes to static nature code composition rule database, the positional information One order and one arrive the distance range of next fixed length feature subcode, and including or do not include to each pair fixed length feature subcode it Between random length feature subcode description, and described identified fixed length feature subcode is synthesized any random length condition code Matching further comprises checking positional information and the verification or not examine each pair adjacent fixed of each identified fixed length feature subcode The random length feature subcode between long feature subcode, and one behavioral characteristics code synthetic state table of renewal.It is one or more Finite automata(FA)Identified fixed length feature can be used to and synthesize any random length condition code.The institute of the present invention Stating the other embodiments of aspect includes the system corresponding to methods described, device, and computer software product.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, methods described bag Include and select multiple fixed length condition codes, the specific character string thing for each character string object in one or more character string objects Multiple fixed length condition codes of part include j-th fixed length condition code, and the first elementary cell of the j-th fixed length condition code is described The remainder of the step-length of position in a scanning direction divided by signature scan operation in specific character string object is equal to J, so that So that the number of the fixed length condition code of the specific character string object is equal to the step-length of signature scan operation, and cause described Specific character string object can be identified on any position in any scanned String field, receive one by data It is worth the specific character string field of composition, identifies any character string object included by the specific character string field, is included in every It is individual using scanning step on the position of spacing, to scan the specific character string field, to search one or more of character strings The multiple fixed length condition code of object, including two or more fixed length condition codes of parallel scan, and described in output Any identified character string object in specific character string field.Methods described can further comprise for each character string object Multiple random length condition codes based on multigroup nonoverlapping orderly fixed length condition code are selected, in the multiple random length condition code Each random length condition code include a fixed length condition code in every group of fixed length condition code and one a pair of adjacent fixed length The random length condition code section that condition code connects, so that the number of the condition code of each character string object is equal to Sn, its Middle S is the number of the condition code of scanning step or every group of fixed length condition code and n is the group number of fixed length condition code, and identification institute State any character string object in specific character string field to further comprise scanning the specific character string field, to search one Or multiple fixed length feature subcodes of multiple random length condition codes of multiple character string objects, and having determining for one or more matchings On the position of long feature subcode, identified fixed length feature subcode is synthesized any random length condition code.
In general, the embodiment of one aspect of the present invention provides character string signature scan method, methods described bag Include and select one or more fixed length condition codes for each character string object in one or more character string objects, one Or one or more of fixed length condition codes of multiple character string objects are processed into one or more forms, the form includes every One or more fingerprints of individual fixed length condition code and one or more follow-up searching data structures, the specific character string object Multiple fingerprints include j-th fingerprint, and the first elementary cell of fingerprint described in the j-th is in the specific character string object Position divided by the remainder of the step-length of signature scan operation in a scanning direction is equal to J, so that the specific character string The fingerprint number of object is equal to the step-length of signature scan operation, and causes the specific character string object any scanned It can be identified on any position in String field, wherein each fingerprint includes the institute of the specific character string object State one or more fragments of the specific fixed length condition code in one or more fixed length condition codes, one or more of fragment tools There is the ad-hoc location from anywhere in the specific fixed length condition code, receive a specific character string being made up of data value Field, any character string object in the specific character string field is identified, be included in the position that each scanning step is spacing On, the specific character string field is scanned, to search multiple fingerprints of one or more of character string objects, wherein wrapping Parallel search two or more described fingerprints are included, and on the position for there are one or more fingerprints matched, are searched The specific character string field, to search the follow-up searching data structure, and appointing in the output specific character string field The character string object of what identification.Methods described can further comprise for the selection of each character string object it is multiple based on multigroup not overlapping Orderly fixed length condition code random length condition code, each random length condition code in the multiple random length condition code includes A fixed length condition code and a random length that a pair of adjacent fixed length condition codes are connected in every group of fixed length condition code Condition code section fingerprint, so that the number of the random length condition code of each character string object is equal to determining for each group fixed length condition code The product of the number of long condition code, and any character string object identified included by the specific character string field include sweeping The specific character string field is retouched, to search multiple fingerprints of one or more character string objects, including with parallel search two Or two or more fingerprint, and on the position for having one or more fingerprints matched, the specific character string field is searched, to look into Look for the follow-up searching data structure.The other embodiments of the aspect of the present invention include the system corresponding to methods described, Device, and computer software product.
In general, one aspect of the invention provides a character string signature scan system, and the system includes one The individual machine readable storage device including computer program product, and one or more executable computer program products and behaviour Make the processor for including following one or more modules:One can be processed into one or more condition codes including each fixed length spy Levy one or more fingerprints of each fixed length feature subcode of code or random length condition code and one or more follow-up searching datas The condition code pretreatment module of one or more forms of structure, one or more of fingerprints include specific fixed length condition code or The j-th fingerprint of feature subcode, the first elementary cell of the j-th fingerprint is in the fixed length condition code or feature subcode Position divided by the remainder of the step-length of signature scan operation in a scanning direction is equal to J, so that the number of the fingerprint Equal to the step-length of signature scan, and cause the fixed length condition code or feature subcode in any scanned String field In any position on can be identified, wherein each fingerprint include one of specific fixed length condition code or feature subcode or Multiple fragments, one or more of fragments have the spy from anywhere in the specific fixed length condition code or feature subcode Positioning is put, and one can be the form needed for one or more scannings an input String field processing being made up of data value Scanning pre-processing engine, and one can identify one or more of one or more condition codes on the input String field The finger scan engine of individual fingerprint, the identification are included on the position that each scanning step is spacing, scan the input word String field is accorded with, to search one or more of fingerprints of one or more of condition codes.The system can further comprise The fixed length of the fixed length feature subcode of fixed length condition code or random length condition code corresponding to one recognizable identified fingerprint is special Levy code Lookup engine.The system can further comprise a fixed length feature subcode included identified random length condition code Synthesize the random length condition code Lookup engine of the recognizable random length condition code of any random length condition code.The present invention's is described The other embodiments of aspect include the method corresponding to the system, device, and computer software product.
Specific implementation may include one or more following characteristics.One or more may be selected in described document information pretreatment module One or more fingerprints are simultaneously projected to one or more of shadow spaces and go to scan by shadow space.Condition code pretreatment module Each fingerprint in one or more fingerprints can be decomposed into one or more fingerprint sections of one or more length, and each The fingerprint composite signal of fingerprint section stores to a fingerprint database, and the finger scan engine and can recognize that the input character Multiple fingerprints of one or more of string field condition code, the identification are included in each position using scanning step as spacing On, the input String field is scanned, to search multiple fingerprint sections, and in the position for there are one or more identified fingerprint sections Put, identified fingerprint section is synthesized any fingerprint matching.
Described document information pretreatment module can use one or more code film bits, special to the one or more of a condition code Sign code section is encoded, and one or more condition code sections of one or more of yards of film bits and described document information are stored Together.Described document information pretreatment module can build a difference with one or more different elementary cells of multiple condition codes Divide searching data structure.Described document information pretreatment module, which can build one, includes a fingerprint database, a fixed length feature Code database, and a condition code regular data when described document information scanning system has at least one random length condition code The condition code database in storehouse.
The scanning pre-processing engine can further comprise a scanning conveyer, a projector, a character string word Segment memory, and a shadow field Memory.The scanning pre-processing engine can be handled one or more character string by block Field, the processing include conveying, decoded, normalization, and convert one or more, one or more described character string Each string chunk in block includes a finger scan region for being used for finger scan and condition code lookup, and one is used for feature The preceding condition code before finger scan region that code is searched searches region, and one be used for that condition code to search in finger scan Rear condition code after region searches region.Trizonal each region of one string chunk can be stored to one or more In individual size identical memory block, all trizonal all memory blocks individually or with one or more add-in memories blocks May make up one is searched the ring of the first memory block beginning in region by current preceding condition code, to reduce data in internal memory It is mobile.
The finger scan engine can with one or more hash tables and one or more Bloom filters one of those Or multiple detect one or more fingerprints.The finger scan engine can further comprise a finger scan controller, one Individual fingerprint hash calculator, a fingerprint finger, a fingerprint synthesizer, and a fingerprint database.The fingerprint hash Calculator can calculate multiple in order with an order hash function in the prefix fragment of multiple hashed keys not overlapped each other Multiple hashed values of hashed key.The finger scan engine may include one fingerprint section bit map of a use and fingerprint length letter One or more of breath concurrently or sequentially synthesizes multiple fingerprint sections the fingerprint synthesizer of the matching of any fingerprint.It is described Finger scan engine may include that further comprises an one or more finite automatas(FA)Fingerprint synthesizer.
One or more fingerprints of one or more length can be broken down into the fingerprint section of multiple formed objects, and by one Or multiple finger scan engines for having same scan step-length scan, each fingerprint of one or more of finger scan engines is swept Retouch the position of the one or more nonoverlapping input String fields staggeredly of engine covering so that it is one or Total scanning step of multiple finger scan engines is equal to the number of one or more of finger scan engines and single fingerprint is swept The product of the former scanning step of engine is retouched, or covers one or more partly overlapping input String fields staggeredly Position, so that total scanning step of one or more of finger scan engines is swept between the original of single finger scan engine Retouch step-length, and the product of the former scanning step of the number of one or more of finger scan engines and single finger scan engine Between.The number of the less finger scan engine of the product of scanning step and memory speed can be more than scanning step and memory speed The larger finger scan engine of product number.
Covering one or more internal memories used in one or more finger scan engines of a shorter fingerprint section can wait In or be faster than one or more internal memories used in one or more finger scan engines of one longer fingerprint section of covering, and cover Cover one used in one or more fixed length condition code Lookup engines corresponding to the shorter fingerprint of one or more average lengths Or multiple internal memories can be equal or faster than the one or more fixed length covered corresponding to the longer fingerprint of one or more average lengths One or more internal memories used in condition code Lookup engine.One or more covering is shorter than the finger scan of the fingerprint of length-specific Engine, one or more can individually or together with the Part I of corresponding fixed length condition code Lookup engine be used in the scanning Most fast internal memory in system.Multiple finger scan engines of scanning identical one or more fingerprint can be shared in a multiport Deposit.The finger scan engine can further comprise one or more content adressable memorys(CAM).
The fixed length condition code Lookup engine can further comprise a condition code search engine, and a condition code examines device, With a fixed length condition code database.Described document information search engine and described document information, which examine device, can use a condition code unit ratio Compared with the fragment for the condition code that device and a condition code section comparator carry out the one or more tape code films of comparison, with identification one or Multiple fixed length condition codes.Described document information search engine can the one or more fixed length condition codes of difference lookup or feature subcode.It is described Random length condition code Lookup engine can further comprise a condition code composition rule finger, a condition code state verification Device, a condition code composition rule database, and a condition code state form.The random length condition code Lookup engine can wrap Include a finite automata (FA).One or more engines may include one or more content adressable memorys(CAM)With one Or multiple finite automatas (FA) one or more.
Specific embodiment described in this specification can be used to realize following one or more advantages.The invention provides one to sweep Retouch the character string scanning system of the condition code in a condition code storehouse.The character string scanning system is flexible and is easily updated. Even if a character string signature scan engine is when the substantial amounts of condition code (such as hundreds of thousands condition code) of scanning, complicated condition code (as being up to several kilobytes, or with asterisk wildcard " * " and "", scope, case-insensitive, logic NOT), and a dynamic feature During code storehouse, the sweep speed (such as 100Gbps) being exceedingly fast is stilld provide.The sweep speed and condition code of the character string scanning system The size and complexity scalability in storehouse.In addition, the less internal memory of the character string scanning system demand and memory bandwidth.It is described Character string scanning system can use software or field programmable gate array(FPGA)Or application specific integrated circuit(ASIC)To implement.This Outside, the cost benefit of the character string scanning system is high, is not only suitable for high-end product, is also applied for low cost products.
One or more embodiments of the invention sees below specification and drawings.Other features and advantages of the present invention by Specification, accompanying drawing and claims are apparent.
Brief description of the drawings
Figure 1A shows the structure chart of an exemplary quick character string signature scan system;
Figure 1B shows the exemplary process diagram of a structure character string condition code database;
Fig. 1 C show the exemplary process diagram of a character string signature scan;
Fig. 2A -2C show the example data structure of a fingerprint database;
Fig. 2 D-E show the example data structure of Hash-entry block and the embodiment of a fingerprint synthesizer;
Fig. 2 F-2G show the example data structure of Hash-entry block and the reality of a corresponding fingerprint synthesizer Apply example;
Fig. 2 H-I show Hash-entry block example data structure and one corresponding to parallel fingerprint synthesizer Embodiment;
Fig. 3 A-B show the example data of the feature code group chained list searched for fixed length condition code and condition code chained list Structure;
Fig. 4 A-B show that one of the predefined global unit scope of support one of a String field is exemplary The structure chart of condition code unit comparator and a condition code section comparator;
Fig. 4 C show the structure chart of an example feature code unit comparator for supporting local condition code unit scope;
Fig. 5 A-C show the example of a selecting unit tree for being used for the lookup of fixed length condition code and condition code family chained list Property data structure;
Fig. 6 shows an example data structure for being used for the condition code regulation linked that random length condition code is searched;
Fig. 7 shows the example data structure of the condition code state-chain-table of a specific character string field;
Fig. 8 shows an example by the signified Hash-entry block of a condition code state Bloom filter or hash table Property data structure;
Fig. 9 shows an exemplary computer system.
The similar reference element similar with sign expression in the drawings.
Embodiment
General survey
The present invention is the method being scanned for a character string condition code database to a String field and is System.In one embodiment, the scan method of one " dividing and rule " is used to be scanned with multiple flow line stages.Each Random length condition code is broken down into multiple fixed length feature subcodes to scan first, and each fixed length condition code or each random length are special Each fixed length feature subcode of sign code is further broken down into multiple condition code sections to scan.In one embodiment, it is first " thick The method of " close scanning " is used to be scanned with multiple pipeline sweep phase afterwards for scanning ", to search character string condition code.Every In one scan position, one or more fingerprints of fixed length condition code or fixed length feature subcode are scanned first.Further inspection Looking into only needs to carry out on the position for having one or more fingerprint matchings.In addition, finger scan can be first in each scan position Upper scanning fingerprint shadow(The shadow space related to shadow will be described in detail below).Only there are one or more matchings On the position of fingerprint shadow, just the fingerprint need to comprehensively be checked.
The fingerprint shadow further can be first segmented to be scanned on each of the scanning positions, then again by only checking fingerprint shadow Possible position and possible fingerprint length of the subsegment in any fingerprint synthesize fingerprint shadow section.Fingerprint shadow it is complete The inspection in face only needs to carry out on the position for the fingerprint shadow for having one or more synthesis matched.In addition, fingerprint shadow section Scanning its hashed value need to be only scanned in each scan position.It is further to check that fingerprint shadow section only have Carried out on the position of the hashed value of one or more matching.In one embodiment, before the scan first for scanning character The character string condition code pretreatment of string field, and pretreated condition code is stored into condition code database, to be supplied to The scanning of multiple pipeline stage uses.
Figure 1A shows a quick character string signature scan engine 100.The scanning engine includes a condition code Pretreatment module 90, a scanning pre-processing engine 120, a finger scan engine 140, a fixed length condition code Lookup engine 160, and a random length condition code Lookup engine 180.Quick character string signature scan engine 100 is directed to one or more words Symbol string condition code database, is scanned, and may send the numbering 190 of the condition code matched back to and in institute to String field The position in String field is stated, to identify special characteristic code.In one embodiment, condition code database includes a fingerprint Database 148, a fixed length condition code database 166, and a condition code rule database 186.
Figure 1B shows the program 91 of an each character string condition code of pretreatment.In one embodiment, first by one Individual random length condition code is decomposed into multiple fixed length feature subcode and random length feature subcodes, and the pass between fixed length feature subcode The information of system is stored into condition code rule database 186(Step 92).
In order to quickly scan, step 92 export a fixed length condition code or feature subcode can be further broken down into it is more The individual fragment that can be examined by optimum ordered(Step 94).In one embodiment, first or preceding several condition code sections are especially heavy Will, can be as the fingerprint of fixed length condition code or feature subcode.The fingerprint of one character string condition code both can rapidly be swept Retouch, the probability of false negative or false positive matching can be reduced again.In one embodiment, the probability of false negative matching is zero. In one embodiment, the number of the numbers of the different fingerprint length of multiple condition codes than the different characteristic code length of multiple condition codes Lack, so as to accelerate the speed for the scan method that requirement is individually scanned to the pattern of different length.In one embodiment, When scanning step is more than an elementary cell, multiple fingerprints can be used for a character string condition code, wherein each fingerprint First elementary cell displacement of relatively previous fingerprint is one or more basic in a scanning direction for first elementary cell Unit.Scanning direction is a scan operation in a direction for inputting the movement of scan position in String field.
The fingerprint of one fixed length condition code or feature subcode can be further broken into fingerprint section, can enter one as needed Step projects to one or more shadow spaces, is then added to fingerprint database 148 and thereby into condition code database(Step Rapid 96).Because the fingerprint of different length may be by independent scan, fingerprint can further decompose into multiple fingerprint sections and carry out parallel Or serial scan.In certain embodiments, fingerprint is broken down into multiple fingerprint sections, so that the number of different fingerprint segment length For one or few more than the number of different fingerprint length, so as to accelerate to require the scanning individually scanned to the pattern of different length The speed of method.The scanning result of fingerprint section can be then synthesized to together, to detect fingerprint.
In one embodiment, in order to further improve scan efficiency and scan complex characteristic code ability, fingerprint and its Its condition code section can first be projected to one or more shadow spaces and be scanned, and then be verified again in luv space. Shadow space can select that scanning process can either be simplified and speeded up, while can also cover being possible to for fingerprint or fingerprint section Form.One shadow space can cover multiple fingerprints or the form of condition code section.For example, in order to support to distinguish size simultaneously The single character with case-insensitive is write, shadow space can be only small letter or the only space of capitalization.As a spy Example, shadow space can be exactly luv space.
The fingerprint indicated completely or any other condition code section are still to indicate completely in the shadow of any shadow space.Cause This, the fingerprint indicated completely can always scan in addition in luv space in any shadow space.In one embodiment, it is The number in the space of scanning needed for reducing, one all in one or more shadow spaces of all fingerprints indicated completely Shadow space scans so as to be scanned without fingerprint in luv space.In another embodiment, condition code database Shadow space only has one, and all fingerprints of described document information database all scan in unique shadow space.
In one embodiment, fingerprint database includes one or more Bloom filters or hash table.When former hashed key It is oversize or too expensive when being compared during finger scan, in order to further reduce the general of false positive matching and Hash collision Rate, in another embodiment, fingerprint database include one or more and have additional hashed value bit or a fingerprint length The modified Bloom filter of degree or both or hash table.
Finally, the fragment of the fixed length feature subcode of all fixed length condition codes or random length condition code is encoded and is stored in In fixed length condition code database 166, to search fixed length condition code or feature subcode(Step 98).In one embodiment, it is described Fragment can be encoded with unit membrane or subunit film, to support the specified conditions of character string condition code matching(As " do not have to With ", " equal ", " unequal ", " case-insensitive ", " case sensitive ", " within the scope of one ", " in a scope Outside ").In one embodiment, the fragment of tape code film can be then compiled into a chained list or any other lookup structure(Such as Tree etc.).In another embodiment, piece segment encode film or fixed length condition code or feature subcode code film can be compiled into lookup knot Structure, to save memory space.In another embodiment, one group of character string condition code can further utilize character string condition code Between different units carry out differential coding, to form the differential data structure that can quickly search(Such as difference tree).
Fig. 1 C demonstrate the scanning imaging system 101 of a character string condition code.One String field to be scanned first have to by Decode and be converted to(Such as with scanning pre-processing engine 120)Form required for one or more follow up scan stages(Step 102).The symbol string field passes through the shadow to its shadow and the fingerprint of one or more character string condition codes in shadow space first Son is compared and is scanned(Such as use finger scan engine 140), then any identified is referred in original fingerprint space again Schlieren(Step 104)Examined.Step 106 has checked whether a fingerprint matching.
After the scanning, without the output or one for matching or having a matching to produce the no condition code matching of an expression respectively The individual output for indicating the matching of a few condition code.In one embodiment, finger scan engine 140 can provide zero false negative Matching and a sufficiently small false positive matching probability.If without fingerprint matching, the scanning of present scanning position has been completed, can To be moved to next scan position(Step 108).If fingerprint matching, the condition code matched to a few is made into one The lookup of step(Such as use fixed length condition code Lookup engine 160), to be more clearly identified as fixed length condition code or random length feature The fixed length feature subcode of code(Step 110).
Step 112 has checked whether a fixed length condition code or the matching of feature subcode.If do not matched, Current Scan position The scanning put has been completed, and is movable to next scan position(Step 108).If one or more fixed length condition codes Match somebody with somebody, export the numbering of the fixed length condition code each matched, and terminate the scanning of present scanning position(Step 118).If making Fixed length feature subcode for a part for one or more random length condition codes is identified, and the feature subcode matched will be by dynamic Ground synthesizes(Such as use random length condition code Lookup engine 180)To detect one or more random length condition codes(Step 114).Step 116 have checked whether a random length condition code matching.If do not matched, the scanning of present scanning position has been completed, can be with It is moved to next scan position(Step 108).If one or more matches, the random length condition code each matched is exported Numbering, and terminate present scanning position scanning(Step 118).
In the preprocessing process 91 of condition code, in one embodiment, the probability quilt of the false positive matching of each fingerprint It is stored in fingerprint database 148.If step 106 has a fingerprint matching, the false positive matching of the fingerprint of the matching it is general Rate will be checked.If the probability of the false positive matching is sufficiently low(Such as less than the threshold value specified), present scanning position sweeps Retouch process to have completed, next scan position can be moved to(Step 108).In another embodiment, in all databases Fingerprint false positive matching probability it is all sufficiently low, therefore be not required to store and check fingerprint false positive matching probability.When The scanning process of preceding scan position has been completed after finger scan, can be moved to next scan position(Step 108).
In a step 102, scanning pre-processing engine 120 first decodes the String field, standardizes, and be transformed to With the condition code identical form in condition code database.In one embodiment, character string signature scan is in whole word Carried out on symbol string field.However, in another embodiment, limitation or low latency due to the memory space of some systems will Ask, it is impossible to store whole String field.Therefore, in step 102, the String field can be broken down into several pre- Determine the string chunk of size.Character string signature scan is carried out on the character string block of each predefined size.
After a string chunk is loaded, the string chunk will be decoded, normalization, and sweep phase institute after being transformed to The different-format needed.In one embodiment, decoding and normalisation process can support different compressed format(Such as LZS, PKZip, and gzip), different coding standard(Such as UU codings, MIME codings, HTML, and XML), and delete random " counter-scanning " Junk data.
In one embodiment, decoded String field will further be projected to one or more condition code data The shadow space that storehouse requires, to support complicated condition code.For example, decoded String field is converted into full small letter (Such as a shadow space), to support the character string signature scan of case-insensitive.Character string signature scan can be first Decode in full small letter and carried out on String field, then again with the former case sensitive String field of decoding and full small letter String field has been decoded to be verified.
At step 104, finger scan can identify that its shadow is the fingerprint indicated completely first.It is a large amount of in order to quickly scan And complicated condition code, in one embodiment, finger scan engine 140 can use one or more hash tables or the grand filtering of cloth Device scans multiple elementary cells simultaneously.In one embodiment, finger scan engine 140 can be multiplexed and fingerprint length with hashed value The probability for being multiplexed to improve the service efficiency of holder and reduce false positive matching and fingerprint collision.Hashed value is multiplexed and fingerprint length The matching of zero false negative can ensured by spending multiplexing(Miss a condition code matching)While, further reduce false positive matching (I.e. wrong condition code matching)Probability.
Step 110 scans fixed length condition code.The fixed length feature subcode of fixed length condition code or random length condition code can be described The fixed length signature scan stage identifies.Fixed length signature scan only need to be when at least one fingerprint matching during finger scan Just perform.Fixed length condition code or feature subcode corresponding to the fingerprint matched can one by one be matched or with other lookups Structure(Such as set)Matched.In one embodiment, the tape code film of elementary cell or sub- elementary cell can relatively be used for Support the specified conditions of character string condition code matching(Such as " do not have to matching ", " equal ", " unequal ", " case-insensitive ", " case sensitive ", " within the scope of one ", " outside a scope ").
Step 114 scans random length condition code.In one embodiment, random length signature scan only need to have one in scanning Just performed during the random length condition code of individual or multiple random length characters.The fixed length feature subcode of random length condition code is in fixed length feature Code sweep phase is identified, and identified fixed length feature subcode can be dynamically connected to one in the random length signature scan stage Rise, to synthesize one or more former random length condition codes.The synthesis can be dynamic with a static monitor rule list and one State synthetic state table is realized.The composition rule table indicates the rule that fixed length feature subcode is synthesized to random length condition code, The synthetic state table then safeguards current synthetic state according to the composition rule table.
The pretreatment of condition code database
In one embodiment, it is special before the String field is scanned in order to improve sweep speed and memory efficient Sign code pretreatment module 90 pre-processes to condition code.Before condition code database is stored in, condition code pretreatment module 90 First condition code can be decomposed, converted, and be encoded into one or more forms.In one embodiment, condition code pretreatment mould Block 90 can build and safeguard a fingerprint database 148, a fixed length condition code database 166, and a condition code rule Database 186.
When condition code database has one or more condition codes to contain one or more random length subcodes(Such as represent more than zero " * " of arbitrary elementary cell or represent " (bc) { 3-6 } " that " bc " repeats 3 to 6 times), each such random length condition code First multiple fixed length feature subcodes can be broken down into random length feature subcode.If for example, a condition code is " subcode 1* Code 2* subcodes 3 ", wherein subcode 1, subcode 2, and subcode 3 are all free from the fixed length subcode of random length character, and described document information can be with It is broken down into subcode 1, subcode 2, subcode 3.Each " * " in subcode 1* subcode 2* subcodes 3 can be by a random length subcode Substitution.In one embodiment, each fixed length feature subcode can be first scanned independently, and then be synthesized original again Random length condition code.
In one embodiment, condition code rule database 186 can use the positional information of each fixed length feature subcode(As follows Sequence, last subcode mark, to the distance or distance range of next fixed length subcode)To build, for synthesis fixed length feature Code.In one embodiment, it is described when the random length subcode between two continuous fixed length feature subcodes is not " do not have to matching " Condition code rule database 186 can be further containing the description to the random length subcode, for synthesizing fixed length feature subcode. In another embodiment, one or more finite automatas can be built with the fixed length feature subcode and random length feature subcode (FA), for synthesizing fixed length feature subcode, wherein each fixed length feature subcode is as a generally incoming symbol.
In one embodiment, a fixed length condition code or feature subcode will be further broken down into multiple can use most The fragment that good order is searched(These fragments include the fingerprint of described document information or feature subcode).The multiple Section can have different sizes or identical size.In order to prevent false negative from matching or not miss any one condition code own The set of the fragment is equal to original long condition code.During signature scan, with the increasing of the number of the fragment of matching Add, false positive matching value will reduce(The confidence level increase matched).The process of scanning is terminated at first unmatched Section or the fragment of last matching.In one embodiment, the selection of the fragment of condition code, which can allow, terminates or knows without matching The condition code matching of other zero false positive matching occurs as early as possible.
In one embodiment, the fingerprint includes multiple fragments.In another embodiment, the fingerprint only has one Fragment, and triple { fragment, length, dislocation } is encoded as, wherein fragment is first of character string condition code scanned Fragment is the fingerprint of described document information, and length is the length of the fingerprint, misplace as the fingerprint in a fixed length condition code or Dislocation in feature subcode.Particular fingerprint is the particular patch of the fixed length feature subcode of a fixed length condition code or random length condition code Section.
In one embodiment, the shadow space can be chosen simplify and accelerate the mistake of described document information scanning Journey, while the form of a variety of fingerprints or feature chip segment can be covered.In the ideal case, the phantom value can be direct As a hashed key.For example, in order to support case sensitive and case-insensitive unit simultaneously, the shadow space can With the space for being full small letter or capitalizing entirely.For example, in order to scan the drivers license number for being followed by 7 numerals by a letter and being formed, Wherein each letter and number can further specify that can use a code for the scope of arbitrary letter and number, shadow space Or any one letter(Such as " a ")Instead of all letters, and with another code or any one numeral(Such as " 0 ")Instead of all Numeral.For example, in order to scan the Social Security Number being made up of three groups of three bit digitals separated by space or "-" (SSN), wherein each numeral can further be designated as arbitrary digital scope, shadow space can use a code or any One numeral(Such as " 0 ")Space and "-" are replaced instead of all numerals, and with another code or space or "-".As a spy Example, the shadow space can be exactly former space.
In one embodiment, for fingerprint after shadow space has been scanned, the checking of fingerprint can detect fingerprint Carried out at once in former space after shadow.In another embodiment, the checking can formerly verify the part of other fragments Or carried out again after whole.If the fingerprint is covered by other fragments completely, it is not required to verify.
The selection of fingerprint should accelerate the speed of the finger scan, at the same also provide minimum after finger scan False positive matching probability.In one embodiment, the fingerprint can be arbitrary dimension and appointing in described document information Anticipate on position.In another embodiment, in order to meet the requirement of system, the size of the fingerprint or in described document information Position is restricted.For example, in order to meet the delay requirement of system, the dislocation of the fingerprint is no more than certain particular value.
In one embodiment, fingerprint can select to next or multiple conditions:
1)The shadow of the fingerprint does not have asterisk wildcard or scope, with by compared with short scan,
2)The probability very little that the fingerprint occurs in the String field to be scanned,
3)The number of the shared fingerprint of multiple condition codes is small as much as possible, and
4)There is the number of the fingerprint of an identical fingerprint section small as much as possible.
The condition of other selections can increase according to the requirement of system.In common network application and non-network application In, generally all or most numeric string condition code is all at least considerably long containing one section, is projecting to a selected shadow Without asterisk wildcard or the fragment of scope after in space.In one embodiment, first alternative condition is a necessary condition. In another embodiment, first alternative condition can be further limited to each fingerprint of all fingerprints at least in a shadow It is to indicate completely in subspace.Condition code without the fingerprint for meeting alternative condition, can be extended to it is multiple containing meet choosing Select the condition code of the fingerprint of condition, or be not required to the different scanning mode that condition code extends with one and scan.
The fingerprint can be selected by checking the fragment of all condition codes for meeting first choice condition.It is other The parameter of fingerprint can also be contemplated when fingerprint is selected.According to the second alternative condition, select described to be scanned String field occur probability very little condition code section as fingerprint, it is possible to reduce the probability of false positive fingerprint matching.This Outside, allow the number for the fingerprint for having an identical fingerprint section small as much as possible, can further reduce false positive fingerprint matching Probability.Although the length of the fingerprint can be less than 8 elementary cells or more than 32 elementary cells, 8 to 32 are that typically in Between elementary cell.
Because described document information can be very long(Such as hundreds of or thousands of individual elementary cells), the length of fingerprint may be also a lot. However, in one embodiment, the fingerprint of different length will be scanned individually, so as to which sweep speed is slower.In one embodiment, In order to reduce the complexity of scanning, the number of fingerprint length can be limited according to particular system requirement and system architecture, it is such as few In 16.In one embodiment, the length of fingerprint can be selected from a predetermined length table.In addition, the length of fingerprint can With according to system requirements and system architecture, by exponential increasing relation(Such as 2,4,8,16, and 32), linear increment relation(Such as 4 times Number:4,8,12,16,20,24,28, and 32), or another relation(Such as 2,3,5,8,13,21,34)To select.
In one embodiment, the fingerprint of a condition code can be selected with an algorithm.For example, following algorithm can be with For selecting the fingerprint of a fixed length condition code or feature subcode(Assuming that scanning step is equal to 1, the length of fingerprint is from being most as short as most Length is fixed as l0, l1, l2..., lm-1, lm, finger scan, which is segmented, to be carried out, and the shadow space of finger scan has given):
1. it in shadow space is the condition code subsegment that indicates completely to find all.
2. couple each l for being longer thanmDescribed subsegment, find all length and be equal to lmSubsegment.
3. pair each length is equal to lmSubsegment, write down the subsegment be chosen as in all condition codes fingerprint time Number NcWith the times N for having identical the first fingerprint section with other fingerprintss, and with one it is based on Nc, Ns, and lmCost function calculate Value at cost.
4. it is equal to l with length respectivelym-1..., l2, l1, and l0Subsegment repeat step 2 and 3.
5. from step 2 to 4, the subsegment for finding out minimum cost is a fingerprint.
Above-mentioned steps are relevant with the processing sequence of condition code.Several random process orders can be used for finding not as needed Same fingerprint.In one embodiment, value at cost is equal to handle(m-i),Nc, and NsLinked up from highest order to lowest order, wherein I=0,1,2 ..., m and i are the length of fingerprint.In one embodiment, if having found the fingerprint of length-specific, it is not necessary to again All shorter subsegments are selected.
In another embodiment, each mobile scanning step of finger scan engine 140.In each scan position On, finger scan engine 140 serially or parallelly scans the fingerprint of different length.Therefore, sweep speed and scanning step(Connect Continue the number of the elementary cell between two scan positions)It is directly proportional.In one embodiment, in order to improve sweep speed, refer to Line scanning can scan multiple elementary cells, rather than an elementary cell simultaneously.In order to ensure zero false negative matches, each word Symbol string condition code is equal to the scanning step with the number of multiple fingerprints and the fingerprint.In other words, a fixed length condition code Or condition code database can be put into multiple fingerprints for feature subcode and the number of the fingerprint is equal to scanning step.Specific spy First elementary cell for levying the j-th fingerprint of code is that (J+k*S) of the special characteristic code in scanning direction is individual substantially single Member, wherein S are scanning step, and k is a nonnegative integer, and J=0,1,2 ..., S-1.Described document information can is treated described Find any position in the string segments of scanning.For example, if special characteristic code is " [Rr] [Ee] [Aa] [Dd] [Mm] [Ee] 123.exe ", the scanning step are 4, and the fingerprint length includes 4,8 and 12, can select following four fingerprint:“[Rr] [Ee] [Aa] [Dd] [Mm] [Ee] 123.ex ", " [Ee] [Aa] [Dd] [Mm] [Ee] 123.exe ", " [Aa] [Dd] [Mm] [Ee] 123. ", and " [Dd] [Mm] [Ee] 123.e ", wherein [Rr], [Ee], [Aa], [Dd], and [Mm] is respectively case-insensitive English alphabet r, e, a, d, and m." [Rr] [Ee] [Aa] [Dd] [Mm] [Ee] 123.exe " is put into finger with four fingerprints Line database four times.When scanning step is 1, only needs a fingerprint and need to only be put into once.
The scanning of the multiple fingerprint can be from the preceding S unit of input String field(I.e. in the first scanning step It is interior)Any position start.For example, in one embodiment, the scanning will scan the since the 0th position(k*S)Position Put, wherein k is a nonnegative integer.In any input String field, the(k*S)Position is covered by the 0th fingerprint, the (k*S+1)Position is by(S-1)Individual fingerprint covering, the(k*S+2)Position is by(S-2)Individual fingerprint covering ..., with the(k*S+ S-1)Position is covered by the 1st fingerprint.In another embodiment, scan from(S-1)Position starts, and will scan(k*S+ S-1)Position, wherein k are a nonnegative integer.In any input String field, the(k*S)Position is by(S-1)Individual finger Line covers, the(k*S+1)Position is by(S-2)Individual fingerprint covering, the(k*S+2)Position is by(S-3)The covering of individual fingerprint ..., The(k*S+S-2)Position is covered by the 1st fingerprint, and the(k*S+S-1)Position is covered by the 0th fingerprint.
In order to which the scanning step is brought up to S elementary cell, in one embodiment, using scanning step of being taken in as 1 When select the algorithm of a fingerprint to make following modification for each fixed length condition code or feature subcode, with each for selecting The S fingerprint of fixed length condition code or feature subcode:Step 1 to 4 and in the past it is just the same, step 5 be then revised as from own The dislocation in scanning direction of fixed length condition code or feature subcode find in step 2, described for (J+k*S) subsegment in, The subsegment for finding out minimum cost value is the j-th fingerprint in the S fingerprint, and wherein J=0,1,2 ..., S-1 and k are one non- Negative integer.
Generally each character string object only has a condition code, to support the scanning step of S elementary cells, each fixed length spy The fixed length feature subcode of sign code or random length condition code needs S fingerprint.In another embodiment, it is support S elementary cells Scanning step, S fixed length condition code can be used for identifying a character string object, so that each fixed length condition code is being swept It is (J+k*S) that first elementary cell in direction, which is retouched, in the dislocation of the character string object, wherein J=0,1,2 ..., (S-1) and K is a nonnegative integer.Specific character string object can quilt on any position of any String field to be scanned Identification.In one embodiment, the scanning of S fixed length condition code can not have to fingerprint.In another embodiment, S fixed length Each condition code in condition code can carry out signature scan with a fingerprint.In one embodiment, it is multiple to be based on The random length condition code of multigroup nonoverlapping orderly fixed length condition code can further be selected to identify a character string Object.Every group of fixed length condition code has a S fixed length condition code, and wherein j-th fixed length condition code is basic at first of scanning direction Unit is (J+k*S) in the dislocation of the character string object, wherein J=0,1,2 ..., (S-1) and k be a nonnegative integer, from And form SnIndividual random length condition code, wherein n are the group numbers of nonoverlapping orderly code character for having S fixed length condition code.From every A fixed length condition code is selected in group fixed length condition code, then again by these fixed length condition codes and one or more indefinite long words Symbol string section is combined into SnA random length condition code in individual random length condition code, so that the specific character string object It can be identified on any position of any String field to be scanned.It is each original long in the n groups S fixed length condition codes Condition code is changed into a fixed length feature subcode of the random length condition code after multiple synthesis in post synthesis.In one embodiment, The scanning of the random length condition code can not have to fingerprint.In another embodiment, it is each in the random length condition code Individual fixed length condition code can select a fingerprint to scan.
In one embodiment, to support the scanning step of S elementary cells, specific character string object, which can select P, to be determined Long condition code, each fixed length condition code can further select one or more fingerprints, so that each character string object is total Fingerprint number is equal to S.J-th fingerprint in S fingerprint of the specific character string object is substantially single at first of scanning direction Dislocation of the member in the specific character string object be (J+k*S), wherein J=0,1,2 ..., (S-1) and k it is non-negative whole for one Number, so that the character string object can be identified on any one position of any String field to be scanned.
In another embodiment, multiple random length condition codes based on multigroup nonoverlapping orderly fixed length condition code can To be further selected to identify a character string object.I-th group of not overlapping and orderly fixed length condition code has PiIndividual fixed length Condition code, wherein each fixed length condition code further has one or more fingerprints, so that total finger of i-th group of fixed length condition code Line number is equal to S, wherein i=0,1,2 ..., n-1 and the group number that n is not overlapping and orderly fixed length feature code character.Every group of fixed length is special Levy code S fingerprint in j-th fingerprint scanning direction first elementary cell in the specific character string object Misplace as (J+k*S), wherein J=0,1,2 ..., (S-1) and k be a nonnegative integer.
A fixed length condition code is selected from every group of fixed length condition code, then again by these fixed length condition codes and one or Multiple random length character string combinations are into P0*P1*P2…*Pn-1A random length condition code in individual random length condition code, so that Obtaining the specific character string object can be identified on any position of any string segments to be scanned.The n groups fixed length is special Each original long condition code in sign code is changed into a fixed length feature subcode of the random length condition code after multiple synthesis.
In the condition code of each character string object of a scanning system, after fingerprint, and shadow space determine, a finger The shadow of line can be scanned as an entirety.The fingerprint shadow of different length can be scanned concurrently or sequentially.At one In embodiment, the shadow of the fingerprint of different length can be used as an entirety to carry out serial scan.In one embodiment, one Individual character string condition code is incorporated into described document information database and can be used to lower false code:
Wherein S is scanning step, hiIt is the length of the fingerprint of i-th of character string condition code, IV is original Hash value, and is dissipated It is an order hash function to arrange ().Fingerprint selection () optimal fingerprint for being selected for each shift position, condition code be incorporated into () Condition code is programmed into the database of the fixed length condition code Lookup engine 160 and random length condition code Lookup engine 180, and Fingerprint is incorporated into () and fingerprint is programmed into the fingerprint database 148.When scanning step is more than 1, the finger of each condition code Line will call fingerprint to be incorporated into () once.But looked into because all fingerprints of a condition code all point to same condition code Data structure is looked for, as long as all fingerprints of a condition code call condition code to be incorporated into () once altogether.
In one embodiment, following pseudo- generation can be used by deleting a character string condition code from described document information database Code:
Wherein S is scanning step, and hi is the length of the fingerprint of i-th of character string condition code, and IV is original Hash value, hash () is an order hash function, and fingerprint deletes () and deletes a fingerprint from the fingerprint database 148, and condition code is deleted Except () by the condition code searching data structure of described document information from the fixed length condition code Lookup engine 160 and random length feature Deleted in the database of code Lookup engine 180., be from spy because a condition code has multiple fingerprints when scanning step is more than 1 Levy and a condition code is deleted in code database, fingerprint is deleted () and is called repeatedly.But all fingerprints of special characteristic code As long as condition code is called to delete () once altogether.
Because number of the number of different fingerprint segment length generally more than different fingerprint length is few, in one embodiment, In order to improve scan efficiency, a fingerprint can be further broken down into multiple fingerprint sections.All fingerprint sections of one fingerprint can First concurrently or sequentially to scan, the scanning result of the fingerprint section concurrently or sequentially is synthesized again, to detect fingerprint.It is described The length of fingerprint section can be with identical or different, the length depending on the fingerprint that specific scanning engine is supported.
In one embodiment, the number of fingerprint section and length can be according to its of the length of fingerprint and specific scanning engine Depending on its sweep parameter.In another embodiment, between fingerprint length for linear relationship and all fingerprint sections length all It is identical.Generally, the length of a fingerprint section is equal to one or more scanning steps.In another embodiment, scanning step is The integral multiple of the length of one fingerprint section and multiple fingerprint sections will be synthesized concurrently.
In one embodiment, fingerprint database can include one or more Bloom filters and one or more hash One or more of table.The size and bandwidth of memory should be weighed usually using Bloom filter or hash table.Work as condition code Number can all be stored in the on-chip memory of sufficiently large bandwidth, Bloom filter or may be more even more than hash table It is good.On the contrary, when the yardage of condition code is more, so that when must use chip external memory, the bandwidth of internal memory just turns into main limitation, The hash table rather than the Bloom filter may be more preferable.
In one embodiment, the added bit of hashed value is stored in a Bloom filter or a hash table, To realize that hashed value is multiplexed.In another embodiment, fingerprint length is stored in a Bloom filter or a hash table In, to realize that fingerprint length is multiplexed.When former hashed key is oversize or it is too expensive be compared in scanning fingerprint when, hashed value multiplexing And the multiplexing of fingerprint length can further reduce the probability of false positive matching and fingerprint collision.
In one embodiment, shadow space is worked as, fingerprint, after fingerprint section and finger print data structure are determined to, one fixed The fixed length feature subcode of long condition code or random length condition code can be used as an entirety or segmentation to be programmed into shadow space In the fingerprint database 148 and condition code database.
In one embodiment, the fingerprint of a fixed length feature subcode of a fixed length condition code or random length condition code with Outer other fragments can be encoded and stored fixed length condition code database 166, whole to be used for scanning after fingerprint matching Fixed length condition code or feature subcode.In one embodiment, feature of a fixed length condition code or random length condition code One or more fragments of code can be encoded into multiple tape codes with the code film of one or more elementary cells or sub- elementary cell The fragment of film, to support the specified conditions of character string condition code matching(Such as " not having to matching ", " equal ", " unequal ", " not area Divide capital and small letter ", " case sensitive ", " within the scope of one ", " outside a scope ").In one embodiment, in order to improve Storage efficiency, one or more feature chip segment code films or one or more condition codes or feature subcode code film can individually make With, or be used together with one or more elementary cells or subunit this element number film.
In one embodiment, the condition code section of tape code film can be linked together in order or out of order with chained list, Or pre-process into other lookup structures(Such as tree).In one embodiment, the length of the condition code section of all tape code films can be complete It is equal or not congruent.The length of the condition code section of the tape code film can most preferably select according to the framework of particular memory.
In another embodiment, allowed to allow the false positive matching of character string condition code to converge to zero-sum as quickly as possible The condition code that may be matched converges to zero or one as quickly as possible, and one group of character string condition code can further carry out difference Coding, to build the differential data structure that can quickly search(Such as difference tree).Difference tree can utilize the difference between condition code Unit is built.
In one embodiment, when the one or more new condition codes of addition, or as the one or more existing spies of deletion When levying code, condition code database preprocessing can only be carried out in the initial establishment of condition code database.In another embodiment, The database of condition code can dynamically update in the scanning process of condition code.
The scanning pretreatment of String field
The scanning pre-processing engine 120 is different String field pretreatment according to character string condition code database Form, to simplify and speed up scanning flow line stage thereafter.In one embodiment, the condition code in condition code database Stored with uncoded form.Therefore, pre-processing engine 120 is scanned to decode encoded String field, with the spy The codec format for levying the condition code in code database is consistent.As shown in Figure 1A, it is defeated including a scanning to scan pre-processing engine 120 Send device 122, a String field memory 124, a format decoder 126, a decoding field Memory 128, one Projector 130, and a shadow field Memory 132.The scanning conveyer 122 is first by data to be scanned from the character string Field Memory 124 is loaded into the format decoder 126, and the format decoder 126 will decode to original field again, Parsing and decompression, wherein may include that MIME is decoded, UU decoding, foreign language decoding, removing includes skimble-skamble character string(Such as Extra white space)With counter-scanning garbage character string data(Such as the counter-scanning junk data of injection), HTML parsings, XML solutions Analysis, deflate decompressions, LZS decompressions, PKZip decompressions, and gzip decompression.The format decoder 126 can be with root Normalized character string field is needed according to particular system.Hereafter, the format decoder 126 is by the character after decoding and normalization In string field deposit decoding field Memory 128.
Decoded field is projected to one or more shadow spaces by the projector 130, and shadow data storage Into shadow field Memory 132.For example, a condition code database can include the condition code of case-insensitive.For The condition code of case-insensitive is supported, the projector 130 helps the data conversion decoded in field Memory 128 small Write, and the storage of full lowercase character string field into shadow field Memory 132.Full lowercase character string field can be swept by fingerprint Engine 140 is retouched to be used for carrying out finger scan.Case sensitive condition code and the condition code can of case-insensitive are simultaneously It is scanned.The matching of case sensitive condition code can be after the matching of its shadow, then with case sensitive decoded Field is verified.
In one embodiment, quick character string signature scan engine 100 includes to support to whole String field Carry out the computing resource or the network equipment of character string signature scan.However, in another embodiment, quick character string feature Code scanning engine 100 includes that the computing resource or the network equipment of whole String field can not be stored, due to for example by system Memory space limits and low time delay requirement.Therefore, the String field can be broken down into the character string of multiple predefined sizes Block.The scanning of character string condition code is carried out on the string chunk of each predefined size.
In one embodiment, the size of a string chunk depends on most long character string condition code.The string chunk Three regions can be further separated into:One finger scan region is to cover all signature scan positions, in fingerprint before one The region of scanning area is to provide the reference data before or during fingerprint, and a region after finger scan region is to provide After fingerprint or among reference data.The set in all finger scan regions of specific character string field is covered described All possible fingerprint original position in String field.The trizonal size can be with identical or different.At one In embodiment, the preceding minimum dimension in the region in finger scan region is characterized the maximum fingerprint of all condition codes in yard database Dislocation, the minimum dimension after the region in finger scan region are characterized the length of all condition codes and dislocation in yard database Maximum difference, the minimum dimension in finger scan region is scanning step.
In one embodiment, it is described scanning string chunk size can according to the parameter of the scanning system, for example, The length of most long condition code, internal storage structure, and sweep speed select.When most long condition code is longer, character string is scanned Block can be several times of the length of most long condition code, for example, 2 to 4 times.When most long condition code is shorter, string chunk is scanned It can be the more times of the length of most long condition code.
In one embodiment, scan that the trizonal size of string chunk is identical, the condition code for being equal to most to grow Length.Three regions can be stored in the ring formed with three pieces of internal memories, to reduce movement of the data in internal memory.Another Described trizonal of different sizes in one embodiment, finger scan region is smaller than other two regions, other two regions Size can be with identical or different.
In one embodiment, trizonal each region of a scanning block can be stored in one or more In individual an equal amount of memory block, all memory blocks in all three regions form a past in the region in finger scan region First internal memory BOB(beginning of block), to the ring of last internal memory block end in the region after finger scan region, to reduce number According to the movement in internal memory.After first memory block in finger scan region is scanned through, first memory block of the ring will The ring is exited, can be used for loading new data.After new data are loaded, the memory block will be used as last memory block Add the ring.In one embodiment, one or more add-in memories blocks can add the ring tail of the ring, new for loading Data.
In one embodiment, the String field memory 124 only includes last memory block.Therefore, it is described The size of String field memory 124 is equal to the size of a memory block.The decoding field Memory 128 and the shadow Field Memory 132 includes all memory blocks in all three regions, and its size is all internal memory block size sums.
Due to boundary condition, described first piece has specific condition with last block.In one embodiment, length is most " impossible " elementary cell of the length of long condition code can be filled into the preceding reference zone in the finger scan region Before first piece and after last block of the rear reference zone in the finger scan region." impossible " elementary cell is included not There is the elementary cell that condition code is started or ended up with it, therefore, the region of the filling can not possibly turn into the feature of a reality A part for code.If finger scan engine 140, fixed length condition code Lookup engine 160, and random length condition code Lookup engine There is scanning range checking mechanism in 180, the filling avoids the need for.Scanning range checking mechanism can prevent scanning from exceeding character The border of string field.
Finger scan
In one embodiment, the finger scan engine 140 can include one or more content adressable memorys (CAM)With one or more finite automatas(FA)One or more of.In another embodiment, when original fingerprint or refer to Line is in the shadow of a shadow space for when indicating completely, the finger scan engine 140 may be based on the engine of hash.Such as figure Shown in 1A, in one embodiment, the finger scan engine 140 includes a finger scan controller 142, and a fingerprint dissipates Column counter 144, a fingerprint finger 146, a fingerprint database 148, and a fingerprint synthesizer 150.In a reality Apply in example, a fingerprint is integral to scan, and therefore, the fingerprint synthesizer 150 is optional component.
In another embodiment, each fingerprint is broken down into multistage by independent scan.All fingerprint sections can first quilt Scanning, is then synthesized by serial or parallel again(Such as use fingerprint synthesizer 150)To produce the scanning result of fingerprint.In a reality Apply in example, the finger scan controller 142 controls whole scanning process.
140 exportable one, the finger scan engine without occurrence or with a knot matched in fingerprint database Fruit.The occurrence, which corresponds to one or more, then to be looked into by fixed length condition code Lookup engine 160 and random length condition code The character string condition code for looking for engine 180 to search.If without occurrence, scanning process is just completed.
In one embodiment, the fingerprint hash calculator 144 includes multiple independent universal hash functions, h0, h1..., hI, to support Bloom filter.For example, when internal memory size rather than internal memory Bandwidth-Constrained when, the grand mistake of cloth can be used Filter.For example, when the condition code database of scanning system is sufficiently small, can all be stored in on-chip memory, its internal memory it is big It is small to be restricted.In another embodiment, the fingerprint hash calculator 144 includes multiple independent universal hash functions, h0, h1..., hI, to support multiple hash tables.For example, when false positive matching needs are very low(Such as 10-3It is or smaller), the bandwidth of internal memory When sufficiently wide and size is sufficiently large, multiple hash tables can be used.For example, when the slower chip external memories of DRAM or other are used for The data structure in follow up scan stage is stored, and when on-chip memory stores multiple hash tables enough, it is desirable to false positive matching is non- It is often low.
In another embodiment, the fingerprint hash calculator 144 only includes a hash function h0.For example, when interior When the bandwidth deposited rather than the limited size of internal memory, a hash table can be used.For example, when the feature yardage of scanning system It is sufficiently large according to storehouse, so that when must use chip external memory, the Bandwidth-Constrained system of internal memory.
The fingerprint hash calculator 144 extracts the data of n byte from the shadow field Memory 132, and calculates Its hashed value.The data can be used for calculating working as all hash functions individually or together with random starting values or preceding hashed value Preceding hashed value.In one embodiment, one of hash function is (for example, first hash function h0) hashed value bit Digit is more more than the number of bits of the hashed value of other hash functions, and the number of bits phase of the hashed value of other hash functions Together.
The hashed value of hash function output can be used for searching fingerprint database 148 by fingerprint finger 146.In a reality Apply in example, the fingerprint finger 146 includes a Bloom filter and a hashed value demultiplexer.The Bloom filter Check the effective marker pointed by the hashed value.If all effective markers are all effective, the further root of hashed value demultiplexer A corresponding Hash block is found out according to the additional bit of hashed value.The hashed value demultiplexing is exactly to check particular Hash value Additional bit, hashed value demultiplexing can individually carry out or to checking fingerprint length and other information related with fingerprint while entering OK.Hash demultiplexing can further reduce the result of false positive condition code matching.In one embodiment, the grand filtering of the cloth Device can be condensed to a hash table.In another embodiment, the Bloom filter can expand as multiple hash tables.
The fingerprint of different length can be by parallel scan or serial scan.In one embodiment, multiple fingerprint fingers 146 can be used for carrying out parallel scan simultaneously.For example, a fingerprint finger 146 can be used for scanning each different length Fingerprint.
In another embodiment, a fingerprint finger 146 can carry out serial scan to the fingerprint of different length.Example Such as, the length of fingerprint can be the integral multiple of scanning step.Each quick character string signature scan engine 100 is supported one group and referred to Line length, { S, 2*S, 3*S ..., m*S }, wherein S is scanning step and m*S is the length of most long fingerprint.In each scan position On, the fingerprint of different length is by serial scan.The order hash function of one input S elementary cell can be used for described Scanning.In one embodiment, the serial finger scan of the finger scan engine 140 can be described with following false code:
Wherein S is scanning step, and m*S is most long fingerprint length, and t is the total length of String field to be scanned, IV It is original Hash value, and hash function () is an order hash function.Fingerprint searches () and uses fingerprint finger 146 and fingerprint Synthesizer 150 realizes that condition code searches () and uses fixed length condition code Lookup engine 160 and random length condition code Lookup engine 180 To realize.
Fig. 2A-C demonstrate a fingerprint database 148 when the fingerprint of different length is scanned as an entirety Data structure, including a Bloom filter table 200, a fingerprint hash block chained list 250, and a Hash-entry block 256 pieces.The one group of hashed value given with fingerprint hash calculator 144, { hashed value0A, hashed value1..., hashed valueI, fingerprint Finger 146 goes to read the Bloom filter table 200.Each of the Bloom filter table 200 includes an effective marker 202 With a Hash block chain table pointer 204.Effective marker 202 is the mark set when input has at least one condition code.
In one embodiment, the only hashed value of Hash block chain table pointer 2040AMeaning, and point to fingerprint hash block chained list 250 first term.If the signified effective marker 202 of all hashed values is all set, further finger scan can be in Hash block chain Carried out on the signified fingerprint hash block chained list 250 of list index 204.In another embodiment, Hash block chain table pointer 204 is equal to Hashed value0A, can be omitted, to reduce Bloom filter table or hash table, its cost is to increase the data of sweep phase thereafter Structure.
In one embodiment, hashed value0AIt is used to fingerprint to be added in Bloom filter table 200, and it is other scattered Train value can be used to reduce false positive matching.In addition, in order to not delete character string condition code, each single item of Bloom filter table 200 by mistake The number of its condition code is tracked with a counter.In another embodiment, the Bloom filter table 200 can be condensed to One hash table, only need a hashed value(Such as hashed value0A).In another embodiment, the Bloom filter table 200 can be with Expand as multiple hash tables.
In one embodiment, the fingerprint hash block chained list 250 is a chained list.Fingerprint hash block chained list 250 it is every One includes a last block 252, a next block pointer 254, and a Hash-entry block 256.Next block pointer 254 points to institute State the next item down of fingerprint hash block chained list 250.When being tail item for one of the fingerprint hash block chained list 250, last block 252 will be by Set.Because tail item can also by checking whether next block pointer 254 is null pointer to know, last block 252 be one be used for it is fast Speed detects the optional domain of tail item.Each Hash-entry block 256 includes most n fingerprints, wherein n be it is any be more than zero it is whole Number.In one embodiment, an optimal n can select according to the memory architecture of scanning system.For example, if interior save as SRAM, n can be equal to 1;If inside saving as DRAM, n>1.
In one embodiment, as shown in Figure 2 C, each fingerprint item of the Hash-entry block 256 includes a hashed value (hashed value0B) 260, a fingerprint length 262, a type 264, a feature code group pointer 266 or condition code pointer 268, With a dislocation 270.Hashed value0B260 and fingerprint length 262 be respectively intended to realize hashed value multiplexing and the multiplexing of fingerprint length.When To be that a feature code group exports a feature code group pointer 266 when type 264 is zero;Otherwise, will be that a condition code exports One condition code pointer 268.Dislocation 270 for fingerprint first elementary cell to next subsegment to be compared first base The dislocation of this unit.If without next data structure, dislocation 270 is just not required to.When being not required to dislocation 270, dislocation 270 can be with It is set as zero.In another embodiment, n is worked as>When 1, each fingerprint item can increase an effective marker.There is criterion when described When will is set, hashed value may compare0B260 and fingerprint length 262.
In one embodiment, each fingerprint item of Hash-entry block 256 stores a fingerprint.However, in another reality Apply in example, because fingerprint can be very long, size is again different, and because has the lookup of further feature code after finger print data library lookup Stage, original fingerprint can be not stored in each fingerprint item.Therefore, the fingerprint item of each matching may be due to false positive Matching corresponds to zero fingerprint, or because fingerprint collision corresponds to multiple fingerprints.When with a simple hash table rather than one Individual Bloom filter table scans fingerprint and (k/2m)<<When 1, the order of magnitude of the probability of false positive matching is (k/2m) and fingerprint touch The order of magnitude of the probability hit is (k2/22m), it in total fingerprint number and m is to include hashed value that wherein k, which is,0AAnd hashed value0BHashed value0 Total bit.
When with a Bloom filter table, two above probability will decline to a great extent.When with multiple hash tables, described two Individual probability will be further compromised.In order to as far as possible reduce after phase characteristic code search, false positive matching and fingerprint collision probability can To reduce to close to zero.In one embodiment, the number of sufficiently large m and hash function can be used for reducing false positive matching With the probability of fingerprint collision.
In order to reduce memory space, in one embodiment, multiple hashed values can be multiplexed into a fingerprint item, to reduce The probability of null term.The hashed value of each m positions can be divided into two parts:The hashed value of m1 positions1The hashed value of (m-m1) position2.Dissipate Train value1It is used as the address of fingerprint database, and hashed value2(Such as hashed value0B260)It can be used to solve Hash collision and vacation sun Property matching.M1 is smaller, and required memory space is fewer, but fingerprint hash block chained list 250 is longer.In one embodiment, when 2m1When being more than twice or more times of k, the average length of fingerprint hash block list is less than 1.
In order to save the complexity of the management of memory space and the reduction table, in another embodiment, all differences The fingerprint of length can be multiplexed into a fingerprint database.Fingerprint length 262 can further reduce false positive matching and fingerprint The probability of collision.
Searching data structure as shown in figures 2 a-c can be implemented with several different modes.Particular implementation can be according to spy Levy the size of code table, the size and type of free memory, such as SRAM on piece, depending on DRAM on piece, the outer SRAM of piece and the outer DRAM of piece. For example, in one embodiment, if feature number of codes is 128K, the effective marker 202 can be stored on a piece SRAM is to realize quick access, and Hash block chain table pointer 204 is then stored in the outer SRAM of a piece.Effective marker 202 can be used All hashed values access, and Hash block chain table pointer 204 can only use hashed value0AAccess.Only when the effective marker of all hashed values 202 all for it is effective when, need to just access Hash block chain table pointer 204.Last block 252 and next block pointer 254 can be stored in one In the outer SRAM of piece, hashed value0B260 and fingerprint length 262 can be stored in the outer SRAM or DRAM of a piece, type 264, feature Code group pointer 266 or condition code pointer 268, and dislocation 270 can be stored in the outer DRAM of a piece.Only work as hashed value0B260 Hes When fingerprint length 262 matches, access type 264, feature code group pointer 266 or condition code pointer 268, and dislocation 270 are just needed.
In one embodiment, in order to improve scan efficiency, the fingerprint of a character string condition code can resolve into multiple Fingerprint section.First these fingerprint sections can concurrently or sequentially be scanned, then use fingerprint synthesizer 150 again by the fingerprint of matching Section synthesizes the scanning result of fingerprint.Because the number of different fingerprint segment length is generally much smaller than the number of different fingerprint length, Fingerprint is divided into fingerprint section can accelerate finger scan to scan.Fingerprint is divided into fingerprint section to scan, makes support is longer to contain The fingerprint of more fingerprint sections is possibly realized, so as to reduce false positive matching.
In one embodiment, the synthesis of fingerprint section is accurate or complete non-false positive matching.In another implementation In example, in order to accelerate finger scan, the synthesis of fingerprint section is " coarse " or partial, there is false positive matching.In order to reduce vacation The information of the probability of positive match, one or more possible positions of each fingerprint section and the possibility length of fingerprint can be stored up Deposit and be used to fingerprint section concurrently or sequentially synthesizing any fingerprint matching.
When multiple fingerprint section serial scans, Hash-entry block fingerprint corresponding with its of " at least one fingerprint matching " closes The different implementations grown up to be a useful person are for example shown in Fig. 2 D-E.In one embodiment, as long as at least one fingerprint matching is just reported One matching, but the information of the length of the fingerprint about how many fingerprint matching and matching can not be provided.
As shown in Figure 2 D, in one embodiment, each of Hash-entry block 257 includes a hash of particular fingerprint section Value0B260, a fingerprint section Figure 27 2, an at least fingerprint composite signal 274, a type 264, a feature code group refer to Pin 266 or condition code pointer 268, and a dislocation 270.Fingerprint section Figure 27 2 is an effective marker bitmap array, its bit The number for the fingerprint section that digit is supported with the fingerprint synthesizer 151 is identical.When i-th of fingerprint of Duan Weiyi fingerprint of fingerprint Section, fingerprint section Figure 27 2 i-th bit will be set.It is all in all fingerprints that fingerprint section Figure 27 2 provides the fingerprint section Possible position.At least a fingerprint composite signal 274 provides the number of minimum steps of the fingerprint containing the fingerprint section, at least one The fingerprint synthesis of fingerprint section matching.In one embodiment, an at least fingerprint composite signal 274 is stored in the of the fingerprint In one fingerprint section.In another embodiment, an at least fingerprint composite signal 274 or any other fingerprint length information are deposited Storage is in each fingerprint section of the fingerprint.In another embodiment, at least a fingerprint composite signal 274 is omitted.
In one embodiment, hashed value0BHashed value in 260 and Fig. 2 C0B260 is identical.Type 264, feature code group refer to Pin 266 or condition code pointer 268, it is also identical with the respective value in Fig. 2 C with dislocation 270, it is stored in first fingerprint of fingerprint Duan Zhong, and used after " fingerprint hop count subtracts 1 " individual clock cycle is postponed.In one embodiment, due to type 264, condition code Group's pointer 266 or condition code pointer 268, and dislocation 270 are only stored in first fingerprint section of fingerprint, all first fingerprints Section identical fingerprint is stored in together.In one embodiment, in the selection course of fingerprint, to allow has identical first The probability of multiple fingerprints of fingerprint section is minimum.
Fig. 2 E demonstrate one when at least a fingerprint composite signal 274 is only stored in first fingerprint section of each fingerprint One embodiment of fingerprint synthesizer 151.In one embodiment, the size of fingerprint section is elected as identical with scanning step.For example, The size and scanning step of fingerprint section are all 4, so that the length of fingerprint is 4,8,12, and 16.The fingerprint synthesizer 151 Including 12 d type flip flops, 280,32 inputs and door 282, and the MUX 284 of 14 input.Fingerprint synthesizer 151 Input be the fingerprint composite signals 274 of fingerprint section Figure 27 2 and at least one.If finding a fingerprint, fingerprint synthesizer 151 will One matching 290 of output.The matching 290 is only when first fingerprint section of a synthesis fingerprint is first finger of a fingerprint It is effective during line section.
In one embodiment, can be 5 defeated by one in order to examine the MUXs 284 of the input of matching 290,4 The MUX substitution entered, wherein the input newly added is connected to zero and is chosen when fingerprint section is not first fingerprint section. In one embodiment, by increasing gate in the different time delay stages, fingerprint synthesizer 151 can be expanded, to support one The fingerprint length information of the fingerprint section in addition to the first fingerprint section of individual fingerprint.In another embodiment, when an at least fingerprint When composite signal 274 is not stored in any fingerprint section, fingerprint synthesizer 151 can be by deleting MUX 284 Simplify with the gates of all most short fingerprints being not used in condition code database.In another embodiment, fingerprint Synthesizer 151 can be easily revised as with other scanning steps, the number of fingerprint section, and fingerprint length.In an implementation In example, because multiple fingerprint matchings are merged into the matching of one " at least one fingerprint matching ", fingerprint synthesizer 151 can be Each clock cycle exports the result of a finger scan.
When multiple fingerprint section serial scans, the Hash-entry block for providing " all fingerprint matchings " of all matchings is right with its Another implementation for the fingerprint synthesizer answered is for example shown in Fig. 2 F-G.When the fingerprint of multiple different lengths is as a condition code During with being matched, it may be necessary to which the stage is scanned after one or more.
As shown in Figure 2 F, each single item of the Hash-entry block 258 includes a hashed value of particular fingerprint section0B260, one Individual fingerprint section Figure 27 2, all fingerprint composite signal 276, a type 264, a feature code group pointer 266 or feature Code pointer 268, and a dislocation 270.All fingerprint composite signals 276 are the fingerprint composite signals of all matchings, its digit and The number for the fingerprint section that fingerprint synthesizer 152 is supported is identical.When a fingerprint section is have i fingerprint section one section of fingerprint When, the i-th bit of all fingerprint composite signals 276 will be set.Corresponding domain phase in the other domains and Fig. 2 D of Hash-entry block 258 Together.In one embodiment, all fingerprint composite signals 276 are merely stored in first fingerprint section of fingerprint.Another In individual embodiment, all fingerprint composite signals 276 or any other fingerprint length information are stored in each of fingerprint In individual fingerprint section.In another embodiment, all fingerprint composite signals 276 are omitted.
Fig. 2 G demonstrate one when all fingerprint composite signals 276 are merely stored in first fingerprint section of each fingerprint There is one embodiment of the fingerprint synthesizer 152 of multiple fingerprint length.The fingerprint synthesizer 152 is scanned with Fig. 2 E identicals Step-length, fingerprint section size, and fingerprint length.The fingerprint synthesizer 152 includes 16 d type flip flops, 280,52 inputs and door 282, and 13 input with door 286.The fingerprint synthesizer 152 exports a matching0292, a matching1294, one Match somebody with somebody2296, and a matching3298, respectively with one, two, three, and the matching of the fingerprint of four fingerprint segment lengths.
In one embodiment, by increase in different time delay stage gate (as with door) or increase is existing and door Input, the fingerprint synthesizer 152 can be expanded, to support the attached of the fingerprint section in addition to the first fingerprint section of a fingerprint Adding fingerprint length information.In another embodiment, when all fingerprint composite signals 276 are not stored in any fingerprint section When, it can be deleted for input of all fingerprint composite signals 276 with door and all gates relevant with fingerprint length, With the simplification fingerprint synthesizer 152.In one embodiment, the Hash-entry block 258 in Fig. 2 F can be expanded, including more Set type 264, feature code group pointer 266 or condition code pointer 268, and dislocation 270;Every group of domain is used for each fingerprint length.This Outside, in one embodiment, the information of the exact length about fingerprint can also be used for the follow up scan stage.
Fingerprint synthesizer 152 can scan the fingerprint of all different lengths simultaneously.But due to type 264, feature code group Pointer 266 or condition code pointer 268, and dislocation 270 are merely stored in a fingerprint section, multiple fingerprints for sharing the fingerprint section It is stored together.The appropriate selection of the fingerprint of condition code can be preferably minimized the influence.However, in order to eliminate the shadow Ring, the type 264 of the fingerprint of the fingerprint section of all matchings, feature code group pointer 266 or condition code pointer 268, and dislocation 270 can To be stored in the table of another fingerprint section meaning matched.All addresses to the fingerprint section matched can be used for Search an entry in the table.
In another embodiment, in addition to an at least fingerprint composite signal 274 or all fingerprint composite signals 276, refer to Line section Figure 27 2 can be further omitted, and be stored in from without any fingerprint composite signal in any fingerprint section.Fingerprint Section can synthesizes according to the possible form of all fingerprints in described document information database.If multiple fingerprint sections meet all Any form of the form of fingerprint, it is considered as a fingerprint matching.For a kind of special circumstances, if multiple fingerprint sections expire The minimum requirements of all fingerprint formats of foot, is considered as a fingerprint matching.
When multiple fingerprint section parallel scans, the Hash-entry of " an at least fingerprint fingerprint matching " and " all fingerprint matchings " Block implementation different with one of its corresponding fingerprint synthesizer is for example shown in Fig. 2 H-I.As illustrated in figure 2h, the Hash-entry block Each of 259 includes a hashed value0B260, a fingerprint section matching 278, a type 264, a feature code group pointer 266 Or condition code pointer 268, and a dislocation 270.The fingerprint section matching 278 is the match bit of fingerprint section, other domains and figure Corresponding domain is identical in 2D.The fingerprint section matches some spy of 278 pairs of parallel fingerprint synthesizers as shown in figure 2i Fixed section adaptation is (such as section adaptation0, section adaptation1, section adaptation2, or section adaptation3) there is specific implication.When fingerprint section For i-th of fingerprint section of any fingerprint, the fingerprint section matching 278 of i-th section of adaptation will be set.In one embodiment, I-th section of adaptation only stores i-th of fingerprint section of fingerprint, so as to which fingerprint section matching 278 is always set, it is convenient to omit.One In individual embodiment, in order to further reduce false positive synthesis, a similar at least fingerprint composite signal 274 or all fingerprints synthesis letter The fingerprint length information of breath 276 can be stored in first fingerprint section of a fingerprint or all fingerprint sections.
Fig. 2 I demonstrate the one embodiment without the parallel fingerprint synthesizer 153 of fingerprint length information.The fingerprint Synthesizer 153 include one 2 input with door 282, one 3 input with door 286, one 4 input with door 288, and one 4 The OR gate 285 of input.The fingerprint synthesizer 153 is that all fingerprint matchings export a matching0292, a matching1294, one Individual matching2296, and a matching3298, and be that an at least fingerprint matching exports a matching 290.In one embodiment, one Individual overall fingerprint length filters can use matching0292, matching1294, matching2296, matching3298, and matching 290, with mistake Filter impossible fingerprint length.In one embodiment, by increasing gate, the fingerprint synthesizer 153 can be expanded Exhibition, to support the fingerprint length information stored in the first fingerprint section or all fingerprint sections of each fingerprint.In an implementation In example, parallel fingerprint section scanning can use a long scan step-length for being equal to fingerprint segment length several times, to accelerate sweep speed.
In general, in one embodiment, each fixed length condition code can be broken down into multiple fragments to carry out independence Scan on ground.As scan with synthesis fingerprint section as, all fragments of a condition code can be first scanned, then again by serially or It is parallel to close.Generally, the number of different characteristic code segment length, it is also more much smaller than the number of different characteristic code length even if not being one, So as to accelerate to need the speed of scan method individually scanned for each modal length.In one embodiment, Two or more length identical condition code sections are by parallel scan.
In another embodiment, one or more of one or more hash tables and one or more Bloom filters It is used to scan through all referring to bright condition code section.In one embodiment, when at least one condition code matches, the fingerprint Synthesizer 150 can be used to that identified condition code section is serially or parallelly synthesized any condition code matching.Shown in Fig. 2 D-I Data structure and embodiment can be used for composite character code section.In another embodiment, one or more finite automatas (FA)With one or more content adressable memorys(CAM)One or more of be used to the condition code section that has matched to synthesize Matched for any condition code.
Fixed length signature scan
As shown in figure 1, the fixed length condition code Lookup engine 160 includes a condition code search engine 162, a condition code Examine device 164, and a fixed length condition code database 166.Condition code search engine 162 can identify a possible fixed length feature Code, the fixed length feature subcode of a possible random length condition code, or one include multiple possible fixed length condition codes or fixed length The condition code family of feature subcode.Possible the fixed length condition code or feature subcode identified by condition code search engine 162 will be by Condition code is examined device 164 and examined.Fixed length condition code database 166 is then characterized yard search engine 162 and condition code examines device 164 Database.
Fixed length condition code database 166 can be implemented with plurality of data structures.In one embodiment, such as Fig. 3 A-B institutes Show, fixed length condition code database 166 is one and multiple condition code chained lists 350 are chained up with constitutive characteristic code group chained list 300 Two-dimensional chain table.The each single item of feature code group chained list 300 includes a next yard of pointer 302, a dislocation 304, and a spy Levy code pointer 306.Next yard of pointer 302 is the pointer of the next item down for pointing to feature code group chained list 300, condition code pointer 306 be a pointer for pointing to special characteristic code chained list 350.Dislocation 304 is to refer to from the first module of particular fingerprint to condition code The dislocation of the first module of the signified specific character string condition code of pin 306.
Each fixed length condition code or feature subcode can be broken down into multiple condition code sections 352, to form condition code chained list 350.Described document information section 352 can be suitable by scanning since the first elementary cell of a fixed length condition code or feature subcode Sequence connects.In one embodiment, the size of described document information section is different.In another embodiment, all spies The size of sign code section is identical, and its size can most preferably select according to system architecture.The each single item of condition code chained list 350 includes one Individual condition code section 352, a condition code section film 354, a latter end 356, a next segment pointer 358, a type 360, and One numbering 362.Next segment pointer 358 is a pointer for pointing to next section, and latter end 356 is latter end mark.When type 360 is When 0, numbering 362 is fixed length feature subcode numbering 364;Otherwise, numbering 362 is characterized code numbering 366.Condition code section film 354 is used In the specific matching condition for specifying each elementary cell or even sub- elementary cell, including " not having to matching ", " equal ", " not phase Deng ", " within the scope of one ", " outside a scope ", " case-insensitive ", and " case sensitive ".The matching condition It can be realized by the input source and output source of selecting unit comparator.If a fixed length condition code or feature subcode are not The integral multiple of the size of condition code section 352, it can be filled out in the afterbody of the fixed length condition code or feature subcode with for up to " feature The size of code section 352 subtracts one " individual " 0 " or other values, and the film of corresponding fills unit is arranged to " not having to matching ".
In one embodiment, the condition code section film 354 of each elementary cell is 3 bits." case-insensitive " When, the first bit is arranged to 0;When " case sensitive ", the first bit is arranged to 1.Latter two bit is set when " equal " 0 is set to, latter two bit is arranged to 1 when " unequal ", and latter two bit is arranged to 2 when " not having to matching ", latter two ratio Special position is 3 retained.More film bits can be used for realizing other matching conditions as needed, for example, pre-defined Scope(Numerical character or alphabetic character), symbol class, or any range.In another embodiment, in order to improve storage efficiency, The code film of one or more condition code sections or fixed length condition code or feature subcode can be used alone, or with elementary cell or subunit Used in the code film of this unit coordinates.
In one embodiment, described document information search engine 162 can search feature code group chained list 300 until the chained list Tail item, i.e. next yard of pointer 302 is null pointer.The each single item of feature code group chained list 300 all returns to one feature of a sensing The condition code pointer 306 of code chained list 350.
Described document information, which examines device 164, can be verified as each progress of condition code chained list 350 condition code checking.Condition code core Real device 164 is since the first paragraph of condition code chained list 350, by each condition code section of scanning sequency paragraph by paragraph examination.If mismatched, Condition code, which examines device 164, will stop hypomere inspection;Otherwise, condition code examine device 164 will check whole condition code chained list 350 until Its tail, i.e. latter end 356 are 1.If finding a matching, when type is 0, the fixed length subcode for random length condition code is matched, Condition code, which examines device 164, will export a fixed length feature subcode numbering 364;When type is 1, match as fixed length condition code, spy Sign code, which examines device 164, will export a fixed length condition code numbering 366.
In another embodiment, in order that false positive drops to zero with matching the rapid development with hop count, condition code section 352 It can be linked together by an optimal sequence.Condition code chain will be added to by representing the dislocation of current signature code section to next condition code section In each single item of table 350.Although the length of condition code section can differ, the length of a fixed condition code section can be chosen Select.
Fig. 4 A-B demonstrate an implementation of a condition code unit comparator 400 and a condition code section comparator 450 The block diagram of example.Described document information unit comparator 400 carries out unit comparison in the lookup of fixed length condition code.Condition code unit ratio Include a code film decoder 402 compared with device 400, one 2 input MUX 404, an equality comparator 406, two non- Door 408, one 4 input MUX 410, one 2 input OR gate 412, and a range comparator 414.Code film decoder 402 can be decoded as code film bit the control signal of input and the output of equality comparator 406 and range comparator 414.One In individual embodiment, range comparator 414 is optionally used to support the global scope or each of predefined each String field The global scope of condition code.In one embodiment, m range comparator 414 can be used for supporting m predefined global models Enclose.In another embodiment, units match 416 can be with " not having to match " bit logic or being output afterwards.
Multiple condition code unit comparators 400 and a multi input can be used to construction feature code section comparator with door 452 450.The data cell of condition code section comparator 450 is typically a byte, but is alternatively four bits or any other size. In another embodiment, condition code unit comparator 400 can be by a condition code unit comparator 480 as shown in Figure 4 C Substitution, with supported feature code unit local scope.The unit of each tape code film in described document information section 352 can expand to The unit scope of one tape code film, or the upper bound of given condition code unit of tape code film and the unit pair of lower bound.
In one embodiment, the fixed length condition code Lookup engine 160 shown in Figure 1A may require to look up multiple condition codes Chained list.However, it is necessary to the probability for searching multiple condition code chained lists is generally very low.When requiring to look up multiple condition code chained lists, The differential coding that multiple condition codes in a group condition code are mutually encoded can be used.
For example, in one embodiment, selection string elements tree 500 as shown in Figure 5A can be as shown in Figure 1A The searching data structure of condition code search engine 162.Selection string elements tree 500 includes burl 520a-520e.Select character string Each burl 520 of cell tree 500 has two branches, and a matching branch is by pointer1530 is signified, and another mismatches branch By pointer2532 is signified.As shown in Figure 5 B, string elements tree 500 is selected there are two kinds of different burls 520:Type 528 is 1 Leaf and the non-leaf that type 528 is 0.The non-leaf of matching always points towards another burl 520, and unmatched non-leaf points to tree On another burl 520 or empty section.The leaf of matching always points towards a condition code family chained list 550 as shown in Figure 5 C, no The leaf of matching points to another burl 520 or empty section on tree.
In one embodiment, as shown in Figure 5 B, each burl 520 of string elements tree 500 is selected to include a mistake Position 522, a unit 524, a unit membrane 526, a type 528, a pointer1530, and a pointer2532.Type 528 include leaf as described above and non-leaf.Selected unit 524 can be in any position of condition code, when burl 520 is not During tree root, its position provided by the dislocation 522 of previous burl 520 (for example, node 520b front nodal point is node 520a, but It is that 520a does not have front nodal point because node 520a is the tree root for selecting string elements tree 500), when burl 520 is tree root (for example, node 520a)When, its position is provided by the dislocation 270 of the occurrence in fingerprint hash block chained list 250 (see Fig. 2 B).Often Individual burl includes a unit membrane 526 corresponding (see Fig. 3 B) with the condition code section film 354 in condition code chained list 350.
At least one elementary cell of any two character string condition code is different, as long as one of character string condition code is not It is the subsegment of another character string condition code.Therefore, a unit 524 can be chosen to distinguish at least two character string features Code, so that after the unit 524, at least one character string condition code can be eliminated.By to selection The lookup of string elements tree 500, the different condition code of at least one elementary cell can be distinguished.In one embodiment In, shown selection string elements tree 500 is binary tree.In another embodiment, selection character corresponding to can building K points of trees of string location.K elementary cell of the k condition code on same elementary cell position in one feature code group can be made Divide a burls for tree for k, although each condition code can also contribute multiple elementary cells into each burls of k points of trees.
In one embodiment, a character string condition code can be a part for another character string condition code.These The condition code for having mother-child relationship (MCR) can not be chosen the differentiation of string elements tree 500.In one embodiment, as long as finding therein Any one condition code, just it is not required to further search for, it is not necessary that distinguishing has the condition code of mother-child relationship (MCR).Therefore, only need to scan Wherein most short condition code.But in another embodiment, it is necessary to distinguish all condition codes or need to identify most long spy Levy code.
In order to support the condition code family of mother-child relationship (MCR), in one embodiment, a condition code as shown in Figure 5 C Family's chained list 550 can be characterized the searching data structure that code examines device 164(See Figure 1A).Condition code family chained list 550 it is every One includes a type 552, a dislocation 554, a condition code section 556, a condition code section film 558, a next item down Pointer 560, a numbering type 562, and a numbering 564.In one embodiment, for supported feature code family, feature Code family chained list 550 has two kind items:The result items that type 552 is 0 lookup item and type 552 is 1.In inspection, each is looked into When looking for, condition code section 556 will be compared according to condition code section film 558.Condition code section 556 and condition code section film 558 and feature Corresponding domain in code chained list 350 is identical (see Fig. 3 B).But result items are not required to any condition code section and compared.
In one embodiment, the condition code of all matchings of the system searching, and export the condition code of each matching Numbering 564.However, the lookup of condition code will proceed to the afterbody or the next item down pointer 560 of condition code family chained list 550 always For null pointer, to find all female condition codes.Numbering 564 has two kinds:Feature subcode numbering 566 and condition code numbering 568.Compile Depending on numbers 564 type is by numbering type 562.When numbering type 562 is 0, numbering 564 is feature subcode numbering 566, is represented The fixed length feature subcode of one random length condition code;Otherwise, numbering 564 is condition code numbering 568, represents a fixed length feature Code.
In one embodiment, condition code family chained list 550 since a minimus generation or most short subcode, always It is connected to an oldest generation or most long subcode.Dislocation 554 is specified from the first module of existing condition code section 556 to lower condition code section The dislocation of 556 first module.If existing condition code section mismatches, the lookup of condition code family chained list 550 can stop.It is fixed to provide Any dislocation 554 between feature subcode makes to terminate into possibility in advance based on unmatched.
In one embodiment, every generation of condition code family chained list 550 can only support a condition code.If one is specific , there are multiple condition codes in generation, can use multiple condition code families chained list 550, and each condition code in the specific generation needs a spy Zheng Ma families chained list 550.Multiple condition code families chained list 550 can be distinguished with selection string elements tree 500.
In another embodiment, the condition code of tape code film can be content addressed including one with one or more memories Memory(CAM)To store and scan.
Random length signature scan
As shown in figure 1, the fixed length condition code Lookup engine 160 is exported the random length feature of all matchings by scanning sequency The numbering of the fixed length feature subcode of code, size, the position in the String field.With identified fixed length feature subcode Identified fixed length feature subcode is synthesized any random length condition code by information, random length condition code Lookup engine 180.One In individual embodiment, one or more finite automatas(FA)It is used to synthesize fixed length feature subcode.In another embodiment, no Fixed length condition code Lookup engine 180 includes a condition code rule searching device 182, a condition code state verification device 184, one Condition code composition rule database 186, and a condition code state form 188.Condition code rule database 186, which provides, how will The fixed length feature subcode of random length condition code synthesizes the static rule of random length condition code.The dynamic of condition code state form 188 Ground is the state of a process of an input String field storage synthesis.
Condition code rule searching device 182 finds out fixed length feature with having matched from condition code rule database 186 The related rule of code numbering, and the related condition code rule is supplied to condition code state verification device 184.Condition code state The fixed length feature subcode matched is synthesized random length condition code by validator 184 according to condition code composition rule, and updates spy Levy code state form 188.
Condition code rule database 186 and condition code state form 188 can have a variety of data structures.In a reality Apply in example, condition code rule database 186 can be condition code regulation linked 600 as shown in Figure 6.Condition code regulation linked Pointed by the numbering of the 600 feature subcodes found by fixed length condition code Lookup engine 160.Multiple random length condition codes can include phase Same fixed length feature subcode.Condition code regulation linked 600 can be special by all fixed length representative comprising special characteristic subcode numbering The random length condition code of sign subcode links together.
In one embodiment, the corresponding random length condition code of each single item of condition code regulation linked 600.Each single item bag Containing a condition code numbering 602, an order 604, a last code 606, a next yard of pointer 608, and a distance range Information 610.Condition code numbering 602 represents specific random length condition code.Sequentially 604, which be that feature subcode numbering is representative, determines Order of the long feature subcode in all fixed length feature subcodes of the random length condition code.Last code 606 indicates that the fixed length is special Sign subcode whether be random length condition code last fixed length feature subcode.Last code 606 indicates random length signature scan Terminate.Next yard of pointer 608 points to 600 the next item down.Distance range information 610 is to indicate the fixed length feature subcode with Distance range (the i.e. minimum and most unit number between two fixed length feature subcodes of fixed length feature subcode)Optional domain.For example, When the scope or longest distance or beeline it is previously given or for it is infinitely great when, distance range information 610 can be saved Omit or be reduced to beeline or longest distance.
In another embodiment, each single item of condition code regulation linked 600 can include one or more additional fields, with The one or more random length feature subcodes between two or more fixed length feature subcodes of description.For example, " a mould can be used One " pattern " of one repetition for being used for filling up distance range information 610 is added to feature by formula " or one " pointer of pattern " In each single item of code regulation linked 600, to describe random length feature subcode.
As shown in fig. 7, in one embodiment, condition code state form 188 can use one or more condition code states Chained list 700 is implemented.Each condition code state-chain-table 700 can be a String field of particular link, and dynamic memory is The condition code state of all fixed length feature subcodes of all random length condition codes of identification.Condition code state-chain-table 700 it is each Item includes a condition code numbering 702, a previous subcode order(lorder)704, a next subcode position 706 , and a next yard of pointer 708 (nptr) (nloc).Condition code numbering 702 is the numbering of specific random length condition code.Previous son Code order 704 be the specific random length condition code previous fixed length feature subcode numbering to specific fixed length feature son The order of code.Next subcode position 706 be next fixed length feature subcode of the specific random length condition code numbering to A fixed length feature subcode active position scope.
In one embodiment, each String field of each line has a condition code state-chain-table 700.Generally, At each moment, each line only has a String field to be scanned, only a condition code state-chain-table 700.Condition code State-chain-table 700 may include all subcodes matched of all random length condition codes of a String field of particular link Whole effective history.
Condition code state-chain-table 700 can be dynamic.In one embodiment, if the volume of the fixed length feature subcode Number signified fixed length feature subcode is first fixed length feature subcode of a random length condition code, i.e., its order 604 is 1, and Also the spy of a String field of particular link can be added into without the item of the specific random length condition code, a new item Levy in code state-chain-table 700.If the position of first unit of the fixed length feature subcode now matched is not in next subcode bits Put 706 to effective scope in or time-out occur, one of described document information state-chain-table 700 can be deleted.It is described One of condition code state-chain-table 700 can also be deleted after the random length condition code of based on the item matching is found Remove.In one embodiment, all items of the condition code state-chain-table 700 of a String field of particular link can be one The end of individual String field is deleted.
As shown in figures 1 to 6, in one embodiment, described document information rule searching device 182 is from the fixed length condition code Lookup engine 160 receives the numbering of a fixed length feature subcode.Described document information rule searching device 182 searches the whole fixed length The condition code regulation linked 600 of the numbering meaning of feature subcode, and each single item of described document information regulation linked 600 is given Information (such as { condition code numbering 602, order 604, last code 606, distance range information 610 }) and the fixed length condition code are searched and drawn Hold up 160 to information (as the position of the first module of fixed length feature subcode, the length of fixed length feature subcode, line numbering, String field is numbered }) it is sent to successively together in condition code state verification device 184, reach up to described document information rule chain The afterbody of table 600 (i.e. next yard of pointer 608 is null pointer).
The information for each single item that described document information state verification device 184 is given with described document information rule searching device 182 is looked into Look for the signified condition code state-chain-table 700 of the line numbering.To each single item of condition code state-chain-table 700, if condition code Numbering 602 and condition code numbering 702 are identical, and order 604 is equal to previous subcode order 704 plus one, and the of fixed length feature subcode The position of one elementary cell is just a matching in the range of the effective marker that next subcode position 706 is given is put.To feature Each occurrence of code state-chain-table 700, if last code 606 is 1, output described document information numbering 602, and by the Xiang Congte Deleted in sign code state-chain-table 700;Otherwise, the item is updated, and allows the value 704 of previous subcode order to be equal to the value of order 604, The value of next subcode position 706 be equal to fixed length feature subcode first unit position, the length of fixed length feature subcode, and away from From the summation of the value of range information 610.It is not required to do anything to unmatched item.
In one embodiment, when condition code state-chain-table 700 is short and particular link only needs scanning one in particular moment String field, it can be scanned with described document information state-chain-table 700.But if condition code state-chain-table 700 grows or spy Multiple String fields need to be scanned in particular moment by determining line, and other searching data structures can be used for described document information state form 188.In one embodiment, described document information state form 188 can be analogous to one of the data structure in Fig. 2A-C Condition code state Bloom filter or a condition code state hash table.Condition code state Bloom filter or condition code state dissipate The hashed key of list can be 3 tuples { line is numbered, String field numbering, condition code numbering }.In one embodiment, when Each line only has a String field in particular moment and is scanned, and is not required to be numbered with String field.
As shown in figure 8, in one embodiment, the Hash-entry block 256 in Fig. 2A-C can be hashed item 856 and take Generation.The each single item of Hash-entry block 856 includes special sign code numbering 860, a line numbering 862, a String field Numbering 864, a previous subcode order 866, and a next subcode position 868.Previous subcode order 866 and next subcode bits Put 868 definition and previous subcode 704, the hashed keys 3 yuan identical with the definition of next subcode position 706 of order in described Fig. 7 Group { condition code numbering 860, line numbering 862, String field numbering 864 } can be stored and be made to solve Hash collision. To each single item of Hash-entry block 856, if former hashed key is identical, order 604 is equal to previous subcode order 866 plus one, and fixed The position of the first module of long feature subcode is just one within the scope of the active position that next subcode position 868 is given Match somebody with somebody.To each occurrence of Hash-entry block 856, if last code 606 is 1, output described document information numbering 602, and by described in Item is deleted from Hash-entry block 856;If last code 606 is 0, the item of the Hash-entry block 856 is updated, and allows previous subcode Sequentially 866 are equal to order 604, and next subcode position 868 is equal to the position of the first module of fixed length feature subcode, fixed length feature The length of code, and the summation of distance range information 610.Mismatch item and then keep constant.
In one embodiment, when the order and distance range of two continuous fixed length feature subcodes meet condition code rule It is one or more in chained list 600, condition code state verification device 184 will further examine whether described two fixed length feature subcodes Between string segments and condition code regulation linked 600 one or more signified one or more random length feature subcodes Match somebody with somebody.When the character string in the specific character string field between described two continuous fixed length feature subcodes and the feature When one or more signified random length feature subcodes of code regulation linked 600 match, described document information state-chain-table 700 It will be updated with the new fixed length feature subcode.
The design and performance of scanning system
In one embodiment, speed of the speed of quick character string signature scan engine 100 by finger scan engine 140 Degree limitation, for example, when the false positive matches the sufficiently small and follow up scan stage and properly designed.When fingerprint is whole as one When body is scanned by length one by one, the speed of finger scan engine 140 depends on scanning step, the number of fingerprint length, and clock Velocity composition.In one embodiment, the speed of the scanning engine 100 is (s/m) * R, and wherein s is scanning step, and m is finger The number of line length, and R are clock speed.For example, if scanning step is 8 bytes, the byte of fingerprint length 4,8,16 and 32, when Clock frequency is 500MHz, and the speed of the scanning engine 100 is 8* (8/4) * 500MB/s=8Gbits/s.
In another embodiment, if fingerprint is first segmented carry out parallel scan, then " an at least fingerprint matching " is used again Serial synthesis, and fingerprint segment length is identical with scanning step, the speed of a single scanning engine 100 is s*R.In a reality Apply in example, when s and R is taken with precedent identical value, the scanning engine 100 can be with one character string of 32Gbps velocity scannings Field.In addition, in another embodiment, fingerprint can first be segmented and be scanned, then parallel synthesis again, scanning step is so as to sweeping Can further be improved by retouching speed by n times, and wherein n is the number of the fingerprint section of parallel scan and synthesis.When fingerprint segment length and R take During with precedent identical value, if n is 4, i.e. for 4 fingerprint sections by parallel scan and synthesis, scanning step is 32 bytes, described to sweep Retouching engine 100 can be with one String field of 128Gbps velocity scannings.
Sweep speed discussed above is the speed of a single signature scan engine.In one embodiment, when When carrying out parallel scan with several signature scan machines, signature scan speed can further improve several times.
In one embodiment, the selection of the general structure and parameter of a signature scan system can according to one or Multiple following factors:The fixed length feature subcode of the sweep speed of character string condition code, fixed length condition code or random length condition code Size, the similitude between multiple condition code or feature subcodes, and the size of condition code database select, to ensure that fingerprint is swept Specific scanning system can be met by retouching engine 140, fixed length condition code Lookup engine 160, and random length condition code Lookup engine 180 Requirement.For example, scanning step can select according to system requirements.As shown in table 1, scanning step is longer, quick character string The speed of signature scan engine 100 is faster.But the minimum dimension of fixed length condition code or feature subcode just need it is bigger, with And the number of insertion and deletion is more.The step-length that exposes thoroughly also limits the selection of the fingerprint of each condition code, and increases fingerprint and touch Hit the probability with the matching of fingerprint false positive.
Table 1:The selection of scanning step
In addition, scanning step can especially be limited so as to sweep speed by the size of minimum fixed length condition code or feature subcode. In one embodiment, in order to avoid short condition code is separately scanned, scanning step can be by table 1 according to fixed length condition code or spy The minimum dimension for levying subcode selects.
Table 1 assumes that each elementary cell of all fixed length condition codes and feature subcode can be with for fingerprint.In a reality Apply in example, all fixed length condition codes and feature subcode at least indicate completely in a shadow space.In addition, at another In embodiment, scanning step is so as to which sweep speed is further by the shadow indicated completely of all fixed length condition codes and feature subcode Minimum dimension limitation.Therefore, the 3rd column heading in table 1 can make into " all fixed length condition code or feature subcode it is complete All referring to the minimum dimension of bright shadow ".
In another embodiment, in order to improve sweep speed, larger scanning step can be selected.It is shorter than the scanning The condition code that step-length can scan can be scanned separately, for example, with above-mentioned scan method or any other scan method.When only few Number fixed length condition code or feature subcode are shorter, and increase scanning step is very effective.
In another embodiment, the engine number of different scanning flow line stage can be different.Engine can be according to institute The requirement of particular system is stated to select.For example, for particular system, a scanning pre-processing engine 120, four fingerprints can be used Scanning engine 140, a fixed length condition code Lookup engine 160, and two random length condition code Lookup engines 180.
In one embodiment, multiple finger scan engines 140 can be used, so that each finger scan engine 140 One group of fingerprint length is covered, to provide multiresolution finger scan.In one embodiment, all fingerprints are all broken down into phase Scanned with the fingerprint section of length and all fingerprint Duan Douyong identicals scanning step.For scanning all fingerprint length groups The number of the finger scan engine of every group of fingerprint length can be all identical.
In another embodiment, fingerprint is broken down into the fingerprint section of different length, and the fingerprint section of different length is according to one The average length of group fingerprint length is scanned with the scanning step of different length, so that the finger that one group of average length is longer Fingerprint longer fingerprint section and the larger scanning step of line length, the fingerprint of the shorter fingerprint length of one group of average length are used Shorter fingerprint section and less scanning step.For example, the fingerprint section and scanning step of 8 elementary cells can be used for length is 8,16,24,32, and the fingerprint of 40 elementary cells, and the fingerprint section of 2 elementary cells and scanning step can be used for length is 2,4, and the fingerprint of 6 elementary cells.
In one embodiment, drawn to balance with the finger scan of the fingerprint section of different scanning step scan different length 140 speed is held up, the shorter scanning step scanning of use can be more than compared with the number of the finger scan engine 140 of brachydactylia line section is swept with longer Retouch the number of the finger scan engine 140 of the longer fingerprint section of step scan.In another embodiment, it is not synchronized in order to balance use The speed of the finger scan engine 140 of the internal memory of degree, the number of the finger scan engine 140 of the slower internal memory of use can be more than with very fast The number of the finger scan engine 140 of internal memory.In general, in another embodiment, the number of finger scan engine 140 can Depending on by the product of scanning step and memory speed.The finger scan engine of the product of smaller scanning step and memory speed 140 number can be more than the number of the finger scan engine 140 of the product of larger scanning step and memory speed.
In one embodiment, the finger scan engine 140 of same group of fingerprint of multiple scanning step identicals scannings covers Multiple nonoverlapping positions staggeredly in one input String field, so that the multiple finger scan engine is always swept Retouch the number and the product of the scanning step of former single finger scan engine that step-length is finger scan engine.For example, in order to provide Identical sweep speed, scanning step are that the number of the finger scan engine of 2 elementary cells can be that scanning step is 8 bases 4 times of the number of the finger scan engine of this unit.In another embodiment, multiple scanning step identicals scan same group The finger scan engine 140 of fingerprint covers the overlapping position staggeredly of a some inputted in String field, so that The total scanning step for obtaining the multiple finger scan engine is more than the scanning step of former single finger scan engine, but is swept than fingerprint The product for retouching the number of engine and the scanning step of former single finger scan engine is small.
In one embodiment, the fingerprint database 148 of different fingerprint segment length is storable in the internal memory of friction speed, So that the internal memory for shorter fingerprint section is faster than the internal memory for longer fingerprint section.In one embodiment, it is different Fixed length condition code database 166 corresponding to fingerprint length group is storable in the internal memory of friction speed, so that for one The internal memory of the shorter fingerprint of group average length is faster than the internal memory of the fingerprint longer for one group of average length.
In one embodiment, the fingerprint database 148 for being shorter than the fingerprint of length-specific (for example, 9 elementary cells) can It is stored in one of most fast internal memory (for example, on-chip memory or CPU caching) of scanning system.In one embodiment, Be shorter than all or part of the fixed length condition code database 166 corresponding to the fingerprint of the length-specific also can and fingerprint database 148 are collectively stored in one of most fast internal memory of scanning system.In another embodiment, multiple fingerprints of same fingerprint group Scanning engine can share a fingerprint database being stored in the internal memory of a multiport 148.
In one embodiment, one or more engines discussed above in different flow line stages can be any Other scan methods are replaced.For example, in one embodiment, a content adressable memory(CAM)Finger scan can be used as Engine 140 is used to scan the shadow of fingerprint, and fixed length condition code Lookup engine 160 and random length condition code Lookup engine 180 It still can be used to further scanning feature code.In another embodiment, CAM can be used as finger scan engine 140 by with Come in former spacescan one or more fingerprint.In one embodiment, one determine or non-deterministic finite automaton (DFA or NFA) synthesis fingerprint section can be used to as fingerprint synthesizer 150.In another embodiment, a DFA or NFA can make A fixed length feature subcode, which is used to, for random length condition code Lookup engine 180 synthesizes random length condition code.
Other embodiments of the invention can scan other string datas.For example, in biosystem, a hereditary generation Code sequence can be used as a String field.Describing the condition code of specific gene can be used for forming from one by genetic data String field in identify the specific gene sequence.For example, specific gene can be with the scanning machine by specific special Code is levied to identify.
The present invention and all feature operations described by this specification, including structural approach or phase described by this specification When structural approach, or both combination, can use and Fundamental Digital Circuit or be implemented with computer software, firmware, or hardware. The present invention can be implemented on one or more computer program products, i.e., one or more be stored in one such as one it is machine readable Storage device or a transmitting signal information carrier in, a such as programmable processor, one or multiple stage computers Digital processing device is performed or the computer program that controls it to operate.One computer program(An also referred to as program, it is soft Part, application software, or code)Any programming language form for including compiling or interpretative code is can be written as, and can be deployed as Including as a stand-alone program or as a module, component, subprogram, or other units suitable for computing environment Any form.One computer program might not correspond to a file.One program can be with other programs or data together Deposit in one file, can be a Single-issue program file, can be multiple coordinations file (for example, storage one Or multiple files of multiple modules, subprogram, or partial code).One computer program can be deployed to a computer, The multiple stage computers in same place, or be distributed in the multiple stage computers interconnected with communication network in multiple places and perform.
Processing routine described by this specification and logic flow, include the method and step of the present invention, can use and perform one Or one or more programmable processors of multiple computer programs perform, with by operate input data and produce output come Implement the function of the present invention.The logic circuit of specific use, such as field programmable gate array (FPGA) or application specific integrated circuit (ASIC) it, may also be used for performing the processing procedure and logic flow and implement the device of the present invention.
Being adapted for carrying out the processor of computer program includes, such as microprocessor of general and special purposes, and any Any one or more processors of the digital computer of species.In general, processor will from read-only storage or with Machine access memory or both receives instruction and data.The basic module of one computer is a processing for being used for execute instruction Device and the memory of one or more store instructions and data.In general, a computer also includes or is efficiently couple to One or more is used for the mass-memory unit (e.g., disk, magneto-optic disk or CD) of data storage, to receive, sends, or receive Send out data.Include the non-volatility memorizer of all formats suitable for the information carrier of storage computer program instructions and data, Including such as EPROM, EEPROM, the semiconductor memory of quick flashing (flash) memory;Such as internal hard drive or the magnetic of moveable magnetic disc Disk;Magneto-optic disk;With CD-ROM and DVD-ROM CDs etc..Processor and memory can be mended by the logic circuit of specific use Fill or be included into the logic circuit of specific use.
In order to provide the interaction with user, the present invention can have display device at one, and a keyboard and a pointer are set Implement on standby computer.Display device, such as a cathode-ray tube(CRT)Or liquid crystal display(LCD)For showing for user Show information;Keyboard and pointing device, such as mouse or trace ball, then user is helped to provide input directly in computer.It is other types of Equipment can also be used for providing the interaction of user and computer;For example, it can be any form that computer, which is supplied to the feedback of user, Sense feedback, for example, visual feedback, audio feedback, or touch feedback;The form that user is input to computer can also be to appoint What form, including acoustics, voice or sense of touch.
The present invention can be implemented in a computer system, after the system includes one such as one data server Component is held, or includes the intermediate module (Middleware) of a such as application server, or is had comprising one such as one The client computer of one graphic user interface or by its user can be interactive with the implementation of the present invention network (Web) it is clear Look at the front end assemblies of device, or any this kind of rear end, it is middle, or the combination of front end assemblies.The component of the system can be by such as The digital data communications of any form or medium of communication network is connected with each other.The example of communication network includes LAN(LAN) And wide area network(WAN), such as internet.
This computing system can include client and server.Client and server is typically remote from other side, generally passes through One communication network interaction.The relation of client and server result from run respectively on respective computer have client and The calculation procedure of service relation.
Fig. 9 demonstrates an example of such computer, a device that can be used for being practiced or carried out the present invention Or the programmable processing system of method(Abbreviation system) 910 block diagram.The system 910 includes a processor 920, and one Individual random access memory(RAM)921, a read-only storage 922(As such as the writeable of flash ROM read-only is deposited Reservoir(ROM)), a hdd controller 923, an image controller 931, and an input/output(I/O)Control Device 924 processed, and by processor(CPU)Bus 925 couples.The system 910 can be programmed such as in ROM, can also be used another One program source(Such as floppy disk, CD-ROM, or another computer)A program is loaded so as to be programmed(Be reprogrammed).
Hdd controller 923 and hard disk 930, which are coupled, can be used for storing executable computer program.
The controller 924 of input/output is to be connected to an input/output interface by an input/output bus 926 927.Input/output interface 927 is in serial link, LAN, on wireless connection and parallel link etc. communication link, receive and Transmit analog or digital data(Such as one group of stage photo, picture, film and animation).
One display 928 and a keyboard 929 are also connected on input/output bus 926.In addition, different line (Different buses)It can be used for connecting input/output interface 927, display 928 and keyboard 929.
The present invention is described in a manner of specific embodiment.Other embodiments the scope of the appended claims it It is interior.For example, the step of invention can perform in differing order, and remain to reach desired result.

Claims (92)

1. a kind of character string signature scan method, methods described include:
One or more random length condition codes are processed into one or more forms, the processing is included one or more indefinite Each random length condition code in long condition code resolves into one or more fixed length feature subcodes and one or more random lengths are special Subcode is levied, and it is one or more for one or more fixed length feature subcodes structure of one or more of random length condition codes For searching the data structure of fixed length feature subcode and being used for one or more of random length condition codes structure is one or more In the data structure for searching random length condition code, wherein one or more of data structures for being used to search random length condition code It is signified by one or more of data structures for being used to search fixed length feature subcode;
Receive a String field being made up of data value;
On each of the scanning positions the word is scanned with one or more of data structures for being used to search fixed length feature subcode String field is accorded with to search one or more fixed length feature subcodes of one or more of random length condition codes;
In the scan position for having one or more fixed length feature subcodes matched with by it is one or more of matched determine The signified one or more of long feature subcode be used to searching random length condition code data structure check the String field with Search the one or more random length condition codes related to one or more of fixed length feature subcodes matched;With
Export the random length condition code of any matching in the String field.
2. method according to claim 1, one or more random length features of one or more of random length condition codes Subcode not only includes any base unit of random length.
3. method according to claim 1, described to be used to search fixed length with one or more of on each of the scanning positions The data structure of feature subcode scan the String field with search one of one or more of random length condition codes or Multiple fixed length feature subcodes are implemented in former space.
4. method according to claim 2, described to be used to search fixed length with one or more of on each of the scanning positions The data structure of feature subcode scan the String field with search one of one or more of random length condition codes or Multiple fixed length feature subcodes are implemented in former space.
5. method according to claim 1, described to be used to search fixed length with one or more of on each of the scanning positions The data structure of feature subcode scan the String field with search one of one or more of random length condition codes or Multiple fixed length feature subcodes, which are included in one or more shadow spaces, to be implemented, wherein the shadow space is by introducing ambiguity It is wider than former space to former space, its form, so that the shadow pair of a fixed length feature subcode in the shadow space Should be in fixed length feature subcode of the one or more in the former space.
6. method according to claim 2, described to be used to search fixed length with one or more of on each of the scanning positions The data structure of feature subcode scan the String field with search one of one or more of random length condition codes or Multiple fixed length feature subcodes, which are included in one or more shadow spaces, to be implemented, wherein the shadow space is by introducing ambiguity It is wider than former space to former space, its form, so that the shadow pair of a fixed length feature subcode in the shadow space Should be in fixed length feature subcode of the one or more in the former space.
7. method according to claim 5, described to be used to search fixed length with one or more of on each of the scanning positions The data structure of feature subcode scan the String field with search one of one or more of random length condition codes or Multiple fixed length feature subcodes further comprise corresponding to the shadow in the fixed length feature subcode matched described in the verification of former space Fixed length feature subcode.
8. method according to claim 6, described to be used to search fixed length with one or more of on each of the scanning positions The data structure of feature subcode scan the String field with search one of one or more of random length condition codes or Multiple fixed length feature subcodes further comprise corresponding to the shadow in the fixed length feature subcode matched described in the verification of former space Fixed length feature subcode.
9. method according to claim 1, one or more fixed length features of one or more of random length condition codes Code is the data value indicated completely entirely.
10. method according to claim 2, one or more fixed length features of one or more of random length condition codes Code is the data value indicated completely entirely.
11. method according to claim 3, one or more fixed length features of one or more of random length condition codes Code is the data value indicated completely entirely.
12. method according to claim 4, one or more fixed length features of one or more of random length condition codes Code is the data value indicated completely entirely.
13. method according to claim 1, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
14. method according to claim 2, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
15. method according to claim 3, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
16. method according to claim 4, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
17. method according to claim 5, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
18. method according to claim 6, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
19. method according to claim 7, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
20. method according to claim 8, one or more random length features of one or more of random length condition codes Subcode only character containing random length.
21. method according to claim 1, one or more fixed length features of one or more of random length condition codes Code indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity to former sky Between, its form is wider than former space, so that the shadow of a fixed length feature subcode in the shadow space corresponds to one Individual or multiple fixed length feature subcodes in the former space.
22. method according to claim 2, one or more fixed length features of one or more of random length condition codes Code indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity to former sky Between, its form is wider than former space, so that the shadow of a fixed length feature subcode in the shadow space corresponds to one Individual or multiple fixed length feature subcodes in the former space.
23. method according to claim 5, one or more fixed length features of one or more of random length condition codes Code indicates completely in former space or at least one shadow space, wherein the shadow space is arrived by introducing ambiguity Former space, its form are wider than former space, so that the shadow of a fixed length feature subcode in the shadow space is corresponding In fixed length feature subcode of the one or more in the former space.
24. method according to claim 6, one or more fixed length features of one or more of random length condition codes Code indicates completely in former space or at least one shadow space, wherein the shadow space is arrived by introducing ambiguity Former space, its form are wider than former space, so that the shadow of a fixed length feature subcode in the shadow space is corresponding In fixed length feature subcode of the one or more in the former space.
25. method according to claim 7, one or more fixed length features of one or more of random length condition codes Code indicates completely in former space or at least one shadow space, wherein the shadow space is arrived by introducing ambiguity Former space, its form are wider than former space, so that the shadow of a fixed length feature subcode in the shadow space is corresponding In fixed length feature subcode of the one or more in the former space.
26. method according to claim 8, one or more fixed length features of one or more of random length condition codes Code indicates completely in former space or at least one shadow space, wherein the shadow space is arrived by introducing ambiguity Former space, its form are wider than former space, so that the shadow of a fixed length feature subcode in the shadow space is corresponding In fixed length feature subcode of the one or more in the former space.
It is 27. one or more of to be used to search fixed length feature subcode according to one of claim 1-26 methods described Data structure includes one or more hash tables.
28. according to claim 27 methods described, one or more of hash tables further comprise that hashed value is multiplexed.
29. according to claim 27 methods described, one or more of hash tables further comprise that fingerprint length is multiplexed.
It is 30. one or more of to be used to search fixed length feature subcode according to one of claim 1-26 methods described Data structure includes grand (Bloom) filter of one or more cloth.
31. according to claim 30 methods described, one or more of grand (Bloom) filters of cloth further comprise hashed value Multiplexing.
32. according to claim 30 methods described, one or more of grand (Bloom) filters of cloth further comprise that fingerprint is grown Degree multiplexing.
It is 33. one or more of to be used to search fixed length feature subcode according to one of claim 1-26 methods described Data structure includes one or more deterministic stresses (DFA).
It is 34. one or more of to be used to search fixed length feature subcode according to one of claim 1-26 methods described Data structure includes one or more non-deterministic finite automatons (NFA).
It is 35. one or more of to be used to search fixed length feature subcode according to one of claim 1-26 methods described Data structure is including the use of one or more content adressable memorys (CAM).
36. according to one of claim 1-26 methods described, described to check String field same or multiple to search One or more random length condition codes of the fixed length feature subcode correlation matched include the fixed length feature matched described in inspection The positional information of subcode.
37. according to claim 36 methods described, the positional information of the fixed length feature subcode includes the fixed length feature subcode Order in related random length condition code.
38. according to claim 36 methods described, the positional information of the fixed length feature subcode includes the fixed length feature subcode Whether be related random length condition code first fixed length feature subcode.
39. according to claim 36 methods described, the positional information of the fixed length feature subcode includes the fixed length feature subcode Whether be related random length condition code last fixed length feature subcode.
40. according to claim 36 methods described, the positional information of the fixed length feature subcode includes the fixed length feature subcode To the distance or distance range of next fixed length feature subcode.
41. according to one of claim 1-26 methods described, described to check String field same or multiple to search One or more random length condition codes of the fixed length feature subcode correlation matched include the fixed length feature matched described in inspection The numbering of random length condition code corresponding to subcode.
42. according to one of claim 1-26 methods described, described to check String field same or multiple to search One or more random length condition codes of the fixed length feature subcode correlation matched include the state table of renewal random length condition code Lattice.
It is 43. one or more of to be used to search random length condition code according to one of claim 1-26 methods described Data structure includes one or more deterministic stresses (DFA).
It is 44. one or more of to be used to search random length condition code according to one of claim 1-26 methods described Data structure includes one or more non-deterministic finite automatons (NFA).
It is 45. one or more of to be used to search random length condition code according to one of claim 1-26 methods described Data structure is including the use of one or more content adressable memorys (CAM).
46. a kind of character string signature scan system, the system include:
One condition code pretreatment module that one or more condition codes can be processed into one or more forms, the processing bag Include each random length condition code in one or more random length condition codes resolve into one or more fixed length feature subcodes and One or more random length feature subcodes, and be one or more fixed length features of one or more of random length condition codes Subcode structure one or more is for searching the data structure of fixed length feature subcode and being one or more of random length features The one or more data structures for being used to search random length condition code of code structure, wherein one or more of indefinite for searching The data structure of long condition code is signified by one or more of data structures for being used to search fixed length feature subcode;
One can be used to search fixed length feature in each scan position in a String field with one or more of The data structure of subcode identifies the fixed length signature scan engine of one or more fixed length feature subcodes;With
One can be in the scan position for having one or more fixed length feature subcodes matched with one or more one Or the signified data structure identification for being used to search random length condition code of multiple fixed length feature subcodes for having matched with it is one Or the random length condition code Lookup engine of one or more random length condition codes of multiple fixed length feature subcode correlations matched.
47. the system according to claim 46, one or more random lengths of one or more of random length condition codes are special Sign subcode not only includes any base unit of random length.
48. the system according to claim 46, the fixed length signature scan engine is in character string word described in former spacescan Section is to search one or more fixed length feature subcodes of one or more of random length condition codes.
49. the system according to claim 47, the fixed length signature scan engine is in character string word described in former spacescan Section is to search one or more fixed length feature subcodes of one or more of random length condition codes.
50. the system according to claim 46, the fixed length signature scan engine is included in one or more shadow spaces The String field is scanned to search one or more fixed length feature subcodes of one or more of random length condition codes, its Described in shadow space it is wider than former space to former space, its form by introducing ambiguity so that one in the shadow The shadow of the fixed length feature subcode of subspace corresponds to one or more fixed length feature subcodes in the former space.
51. the system according to claim 47, the fixed length signature scan engine is included in one or more shadow spaces The String field is scanned to search one or more fixed length feature subcodes of one or more of random length condition codes, its Described in shadow space it is wider than former space to former space, its form by introducing ambiguity so that one in the shadow The shadow of the fixed length feature subcode of subspace corresponds to one or more fixed length feature subcodes in the former space.
52. the system according to claim 50, the fixed length signature scan engine further comprises examining institute in former space State the fixed length feature subcode corresponding to the shadow of the fixed length feature subcode matched.
53. the system according to claim 51, the fixed length signature scan engine further comprises examining institute in former space State the fixed length feature subcode corresponding to the shadow of the fixed length feature subcode matched.
54. the system according to claim 46, one or more fixed length features of one or more of random length condition codes Subcode is the data value indicated completely entirely.
55. the system according to claim 47, one or more fixed length features of one or more of random length condition codes Subcode is the data value indicated completely entirely.
56. the system according to claim 48, one or more fixed length features of one or more of random length condition codes Subcode is the data value indicated completely entirely.
57. the system according to claim 49, one or more fixed length features of one or more of random length condition codes Subcode is the data value indicated completely entirely.
58. the system according to claim 46, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
59. the system according to claim 47, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
60. the system according to claim 48, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
61. the system according to claim 49, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
62. the system according to claim 50, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
63. the system according to claim 51, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
64. the system according to claim 52, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
65. the system according to claim 53, one or more random lengths of one or more of random length condition codes are special Levy subcode only character containing random length.
66. the system according to claim 46, one or more fixed length features of one or more of random length condition codes Subcode indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity to original Space, its form are wider than former space, so that the shadow of a fixed length feature subcode in the shadow space corresponds to One or more fixed length feature subcodes in the former space.
67. the system according to claim 47, one or more fixed length features of one or more of random length condition codes Subcode indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity to original Space, its form are wider than former space, so that the shadow of a fixed length feature subcode in the shadow space corresponds to One or more fixed length feature subcodes in the former space.
68. the system according to claim 50, one or more fixed length features of one or more of random length condition codes Subcode indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity It is wider than former space to former space, its form, so that the shadow pair of a fixed length feature subcode in the shadow space Should be in fixed length feature subcode of the one or more in the former space.
69. the system according to claim 51, one or more fixed length features of one or more of random length condition codes Subcode indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity It is wider than former space to former space, its form, so that the shadow pair of a fixed length feature subcode in the shadow space Should be in fixed length feature subcode of the one or more in the former space.
70. the system according to claim 52, one or more fixed length features of one or more of random length condition codes Subcode indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity It is wider than former space to former space, its form, so that the shadow pair of a fixed length feature subcode in the shadow space Should be in fixed length feature subcode of the one or more in the former space.
71. the system according to claim 53, one or more fixed length features of one or more of random length condition codes Subcode indicates completely in former space or at least one shadow space, wherein the shadow space is by introducing ambiguity It is wider than former space to former space, its form, so that the shadow pair of a fixed length feature subcode in the shadow space Should be in fixed length feature subcode of the one or more in the former space.
72. according to one of the claim 46-71 systems, the fixed length signature scan engine including the use of one or Multiple hash tables.
73. the system according to claim 72, one or more of hash tables further comprise that hashed value is multiplexed.
74. the system according to claim 72, one or more of hash tables further comprise that fingerprint length is multiplexed.
75. according to one of the claim 46-71 systems, the fixed length signature scan engine including the use of one or Multiple grand (Bloom) filters of cloth.
76. the system according to claim 75, one or more of grand (Bloom) filters of cloth further comprise hashed value Multiplexing.
77. the system according to claim 75, one or more of grand (Bloom) filters of cloth further comprise that fingerprint is grown Degree multiplexing.
78. according to one of the claim 46-71 systems, the fixed length signature scan engine including the use of one or Multiple deterministic finite automatons (DFA).
79. according to one of the claim 46-71 systems, the fixed length signature scan engine including the use of one or Multiple non-deterministic finite automatons (NFA).
80. according to one of the claim 46-71 systems, the fixed length signature scan engine including the use of one or Multiple content adressable memorys (CAM).
81. according to one of the claim 46-71 systems, the random length condition code Lookup engine is including the use of The positional information for the fixed length feature subcode matched somebody with somebody.
82. the system according to claim 81, the positional information of the fixed length feature subcode includes the fixed length feature subcode Order in related random length condition code.
83. the system according to claim 81, the positional information of the fixed length feature subcode includes the fixed length feature subcode Whether be related random length condition code first fixed length feature subcode.
84. the system according to claim 81, the positional information of the fixed length feature subcode includes the fixed length feature subcode Whether be related random length condition code last fixed length feature subcode.
85. the system according to claim 81, the positional information of the fixed length feature subcode includes the fixed length feature subcode To the distance or distance range of next fixed length feature subcode.
86. according to one of the claim 46-71 systems, the random length condition code Lookup engine is including the use of described The numbering of the random length condition code corresponding to fixed length feature subcode matched.
87. according to one of the claim 46-71 systems, it is indefinite that the random length condition code Lookup engine includes renewal The state form of long condition code.
88. according to one of the claim 46-71 systems, the random length condition code Lookup engine is including the use of one Or multiple deterministic finite automatons (DFA).
89. according to one of the claim 46-71 systems, the random length condition code Lookup engine is including the use of one Or multiple nondeterministic finite automatons (NFA).
90. according to one of the claim 46-71 systems, the random length condition code Lookup engine is including the use of one Or multiple content adressable memorys (CAM).
91. according to one of the claim 46-71 systems, the number and random length of the fixed length condition code Lookup engine The number of condition code Lookup engine is identical.
92. according to one of the claim 46-71 systems, the number and random length of the fixed length condition code Lookup engine The number of condition code Lookup engine is different.
CN201410055830.4A 2008-10-20 2008-10-20 Fast signature scan Expired - Fee Related CN103793522B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410055830.4A CN103793522B (en) 2008-10-20 2008-10-20 Fast signature scan
CN201711339378.4A CN108197470A (en) 2008-10-20 2008-10-20 Fast signature scan

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410055830.4A CN103793522B (en) 2008-10-20 2008-10-20 Fast signature scan
CN200880127748.0A CN101960469B (en) 2008-10-20 2008-10-20 Quick signature scan

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN200880127748.0A Division CN101960469B (en) 2008-10-20 2008-10-20 Quick signature scan

Publications (2)

Publication Number Publication Date
CN103793522A CN103793522A (en) 2014-05-14
CN103793522B true CN103793522B (en) 2018-01-12

Family

ID=50669188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410055830.4A Expired - Fee Related CN103793522B (en) 2008-10-20 2008-10-20 Fast signature scan

Country Status (1)

Country Link
CN (1) CN103793522B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095809B (en) * 2016-05-30 2020-02-07 凯通科技股份有限公司 Data matching method and system
CN108614827A (en) * 2016-12-12 2018-10-02 阿里巴巴集团控股有限公司 Data segmentation method, judging method and electronic equipment
CN107609874B (en) * 2017-10-09 2020-06-30 恒宝股份有限公司 Transaction log data verification method and verification system
CN117709298B (en) * 2024-02-05 2024-05-07 中国电子信息产业集团有限公司第六研究所 Double character stream scanning method, electronic equipment, storage medium and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101140593A (en) * 2007-10-11 2008-03-12 中国科学院计算技术研究所 A keyword matching method and system
CN101243441A (en) * 2005-06-21 2008-08-13 国际字符股份有限公司 Method and apparatus for processing character streams

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7636703B2 (en) * 2006-05-02 2009-12-22 Exegy Incorporated Method and apparatus for approximate pattern matching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101243441A (en) * 2005-06-21 2008-08-13 国际字符股份有限公司 Method and apparatus for processing character streams
CN101140593A (en) * 2007-10-11 2008-03-12 中国科学院计算技术研究所 A keyword matching method and system

Also Published As

Publication number Publication date
CN103793522A (en) 2014-05-14

Similar Documents

Publication Publication Date Title
US10949641B2 (en) Fast signature scan
Yang et al. Pase: Postgresql ultra-high-dimensional approximate nearest neighbor search extension
US9805080B2 (en) Data driven relational algorithm formation for execution against big data
US6892207B2 (en) Method of updating data in a compressed data structure
US8156156B2 (en) Method of structuring and compressing labeled trees of arbitrary degree and shape
US20070198566A1 (en) Method and apparatus for efficient storage of hierarchical signal names
CN106797446B (en) Memory-based history search
US7634470B2 (en) Efficient searching techniques
US20140330861A1 (en) Fast identification of complex strings in a data stream
JPH08508123A (en) Language recognition collation system
US8082233B2 (en) Comparing data sets through identification of matching blocks
CN103793522B (en) Fast signature scan
CN108197470A (en) Fast signature scan
CN111813744A (en) File searching method, device, equipment and storage medium
Hadzic et al. UNI3-efficient algorithm for mining unordered induced subtrees using TMG candidate generation
US20080306948A1 (en) String and binary data sorting
US20080243840A1 (en) Comparing data sets through identification of matching blocks
Ferragina Dynamic text indexing under string updates
US12298952B1 (en) Multiple pass sort with subset splitting
JP3062119B2 (en) Character string search table, method for creating the same, and character string search method
de Carvalho Junior Sequence alignment algorithms
Roy et al. Parallel interval stabbing on the automata processor
CN104484381B (en) For searching for the method and system of multiple strings
Kujala Assembling approximately optimal binary search trees efficiently using arithmetics
AU2002241912A1 (en) Efficient searching techniques

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: WANG YING

Free format text: FORMER OWNER: WANG QIANG

Effective date: 20141115

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; TO: 200000 YANGPU DISTRICT, SHANGHAI

TA01 Transfer of patent application right

Effective date of registration: 20141115

Address after: 902 room 83, Lane 289, 200000 souvenir Road, Shanghai

Applicant after: Wang Ying

Address before: California, USA

Applicant before: Wang Qiang

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180112

CF01 Termination of patent right due to non-payment of annual fee