CN114138279B

CN114138279B - Fingerprint feature generation method and matching method for code segment

Info

Publication number: CN114138279B
Application number: CN202111449816.9A
Authority: CN
Inventors: 杨钦; 余浩翔; 许渊聪
Original assignee: Shanghai Anshi Information Technology Co ltd
Current assignee: Shanghai Anshi Information Technology Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2024-10-18
Anticipated expiration: 2041-11-30
Also published as: CN114138279A

Abstract

The application discloses a fingerprint feature generation method and a matching method of a code segment, which comprise the following steps: acquiring source codes of the code fragments; performing code cleaning on the source code to obtain a continuous character string carrying code line number information; sliding and selecting character string fragments in the continuous character strings one by a first window with preset character length; obtaining fixed-length codes of each character string segment to obtain a plurality of second fixed-length codes; sliding and selecting fixed-length code sets in the second fixed-length codes one by one according to second windows of the preset fixed-length code quantity; screening the third fixed-length codes from each fixed-length code set to obtain a plurality of third fixed-length codes; the plurality of third fixed length codes are encoded as fingerprint features of the code segment. And the source code is subjected to zero rounding, the character string fragments are represented by fixed-length codes, the data dimension reduction is realized to reduce the subsequent matching quantity, and the fixed-length codes are screened to further reduce the subsequent matching quantity.

Description

Fingerprint feature generation method and matching method for code segment

Technical Field

The application relates to the field of software analysis, in particular to a fingerprint feature generation method and a matching method of a code segment.

Background

With the application of the open source component in the software development, the software development work is greatly facilitated, so that the number of the software born in recent years is increased.

When an enterprise uses a software system developed by an outsourcing, for system security, it is necessary to determine which open source components are applied in the software. Enterprises typically use software composition analysis tools to analyze software.

At present, the mainstream software composition analysis mode is based on code fingerprints, namely, the mode is adopted to carry out MD5 coding on software codes, and then the MD5 coding is carried out on source codes and component details in each large open source library. And then matching whether the MD5 number corresponding to the software and the MD5 code corresponding to the source code are consistent or not. If the two components are consistent, the corresponding open source components are applied in the software.

According to the related art described above, the inventors consider that software composition analysis based on code fingerprints is basically performed from code lines. However, in the software development process, the open source component is often adaptively changed, and once the letters are modified, the codes of the whole row are changed, so that the row codes of the open source component are changed due to the change of the row, and further, the matching result is inconsistent, so that the analysis of the software group composition is inaccurate.

Disclosure of Invention

In order to improve analysis accuracy, the application provides a fingerprint feature generation method and a matching method of a code segment.

In a first aspect, the present application provides a method for generating fingerprint features of a code segment, which adopts the following technical scheme:

a fingerprint feature generation method of a code segment, comprising the steps of:

Acquiring source codes of the code fragments;

Performing code cleaning on the source code to obtain a continuous character string carrying code line number information, wherein the code cleaning comprises at least one of the following steps: removing line feed, removing blank space and removing annotation information;

Sliding and selecting character string fragments in the continuous character string one by a first window with preset character length to obtain a plurality of continuous character string fragments, and determining an end code line number corresponding to each character string fragment;

Obtaining fixed-length codes of each character string segment, and obtaining a plurality of second fixed-length codes corresponding to the plurality of character string segments one by one;

sliding and selecting fixed-length code sets in the second fixed-length codes one by using a second window with preset fixed-length code quantity to obtain a plurality of continuous fixed-length code sets;

screening third fixed-length codes from each fixed-length code set to obtain a plurality of third fixed-length codes, and determining the end code line number of the character string segment corresponding to each third fixed-length code;

And taking the third fixed-length codes and the end code line number of each third fixed-length code as fingerprint characteristics of the code segment.

By adopting the technical scheme, the source codes of the code fragments are split into character string fragments, so that the source codes are zero-integrated, the character string fragments are replaced by fixed-length codes, the dimension reduction is realized to reduce the subsequent matching quantity, and the fixed-length codes are screened to further reduce the subsequent matching quantity. In addition, the screened fixed-length codes are also corresponding to end code line numbers, and similar code tracing can be realized through the end code line numbers after matching is completed.

Optionally, the method further comprises the following steps:

acquiring a first fixed-length code of the source code;

And taking the first fixed-length code, the plurality of third fixed-length codes and the ending code line number of each third fixed-length code as fingerprint characteristics of the code segment.

By adopting the technical scheme, the first fixed-length code is used for representing the whole source code, is suitable for quick matching among the source codes, and is complementary with the segment matching mode corresponding to the third fixed-length code.

Optionally, screening the third fixed-length codes from each fixed-length code set to obtain a plurality of third fixed-length codes, including the following steps:

screening one fixed-length code from each fixed-length code set as a third fixed-length code;

judging whether the adjacent third fixed-length codes are the same,

If the adjacent third fixed-length codes are the same, only the third fixed-length code on the rightmost side is reserved.

By adopting the technical scheme, because the generation mode of the fixed-length code sets causes the content of the adjacent fixed-length code sets to have high overlapping, the screened third fixed-length codes are the same third fixed-length code, and the repeated third fixed-length codes have no meaning, so that only one code needs to be reserved, the weight of the third fixed-length codes can be reduced, and the swelling of data can be reduced.

Optionally, acquiring a storage path of a source code of the code segment;

The storage path of the source code of the code segment is added to each third fixed length code of the respective code segment.

By adopting the technical scheme, in order to facilitate tracing the code segments, the storage paths are synchronized into the third fixed-length codes, and the positions of the corresponding code segments can be clear according to the matched storage paths in the third fixed-length codes.

In a second aspect, the present application provides a fingerprint feature matching method for a code segment, which adopts the following technical scheme:

A fingerprint feature matching method of a code segment comprises the following steps:

And extracting fingerprint characteristics of the code segments based on the method, and matching similar code segments for the code segments based on the fingerprint characteristics.

By adopting the technical scheme, various existing codes also extract corresponding fingerprint features through the method, and the corresponding fingerprint features are matched through the fingerprint features of the code segments, so that the existing codes adopted by the code segments can be known, the tracing of the code segments is completed, and the code segments have higher accuracy.

Optionally, matching the code segments with similar code segments based on fingerprint features includes the steps of:

Establishing a result set corresponding to the existing code;

Traversing a third fixed-length code in the fingerprint characteristics of the code segment to match a fourth fixed-length code of a preset code corresponding to the third fixed-length code from fourth fixed-length codes stored in a preset database;

Determining whether the corresponding fourth fixed-length code is stored in the corresponding result set,

If the corresponding fourth fixed-length code is not stored in the result set, copying the corresponding fourth fixed-length code into the result set;

Similar code segments are determined based on the fourth fixed-length code in the result set.

By adopting the technical scheme, the existing codes are open source codes or other codes known by a user, one code segment is possibly formed by combining a plurality of segments of the existing codes, fingerprint features of a plurality of different existing codes can be matched in the matching process, and for distinguishing, a structure set corresponding to the existing codes one by one is pre-established to store fourth fixed-length codes of the matched corresponding existing codes.

Optionally, determining whether the corresponding result set stores the corresponding fourth fixed-length code, further includes the following steps:

if the corresponding fourth fixed-length code is not stored in the result set, copying the corresponding fourth fixed-length code into the result set, and adding the hit number which is initially one into the corresponding fourth fixed-length code;

if the corresponding fourth fixed-length code is stored in the preset result set, adding one to the hit number in the corresponding fourth fixed-length code;

Determining similar code segments based on the fourth fixed-length code in the result set, further comprising the steps of:

Code information of similar code segments is determined based on fourth fixed-length codes in the result set, and the code information is ordered according to the hit times.

By adopting the technical scheme, the fingerprint characteristics of the code fragments can be matched with the same fourth fixed-length code for multiple times, and the hit times are added, so that the matched times can be reflected for users to know; the higher the hit number, the more the code corresponding to the fourth fixed-length code is made in the code segment, that is, the higher the coupling.

Optionally, determining similar code segments based on the fourth fixed-length codes in the result set includes the steps of:

Counting the total length of the fourth fixed-length codes in all the result sets, judging whether the total length exceeds the first preset length,

If the total length exceeds the first preset length, filtering all fourth fixed-length codes based on a preset first filtering rule, and storing the filtered fourth fixed-length codes into a preset candidate set;

If the total length is lower than or equal to the first preset length, respectively filtering the fourth fixed-length codes in each result set based on a preset second filtering rule, and storing the filtered fourth fixed-length codes in a preset candidate set;

and determining the corresponding code segment according to the fourth fixed-length codes in the candidate set.

Through adopting above-mentioned technical scheme, through setting up first default length in order to do preliminary judgement to the quantity of fourth fixed length code to confirm what kind of filtering rule is adopted to filter fourth fixed length code, make the filtration more pertinence, reduce the quantity of fourth fixed length code that participates in the matching under the circumstances that avoids causing the influence to the rate of accuracy of matching as far as possible, and then reduce the calculated quantity.

Optionally, the fourth fixed-length code has the code line number of the corresponding character string segment;

filtering all fourth fixed-length codes based on a preset first filtering rule, and storing the filtered fourth fixed-length codes into a preset candidate set, wherein the method comprises the following steps of:

judging whether the number of code lines of the fourth fixed-length code exceeds a preset minimum number of code lines, judging whether the hit number of the corresponding fourth fixed-length code exceeds a preset minimum hit number,

And if the number of code lines of the fourth fixed-length code exceeds the preset minimum number of code lines and/or the hit number of the fourth fixed-length code exceeds the preset minimum hit number, adding the corresponding fourth fixed-length code into the candidate set.

Optionally, the fourth fixed-length code has an end code line number corresponding to the character string segment,

Filtering the fourth fixed-length codes in each result set based on a preset second filtering rule, and storing the filtered fourth fixed-length codes in a preset candidate set, wherein the method comprises the following steps of:

sequentially judging whether the number of the fourth fixed-length codes in the single result set exceeds a preset number,

If the number of the fourth fixed-length codes in the result set exceeds the preset number, continuing to judge whether the total length of the fourth fixed-length codes in the result set exceeds a second preset length;

If the total length of the fourth fixed-length codes in the result set exceeds the second preset length, sequentially judging whether the hit times of the fourth fixed-length codes in the result set exceed the preset times according to the sequence of the line numbers of the end codes;

If the hit number of the fourth fixed-length codes exceeds the preset number, adding the current fourth fixed-length codes into the candidate set, and executing a preset skip instruction to delete the preset number of fourth fixed-length codes arranged after the hit number of the current fourth fixed-length codes;

And if the hit number of the fourth fixed-length code is less than or equal to the preset number, adding the current fourth fixed-length code into the candidate set.

In summary, the present application includes at least one of the following beneficial technical effects: fingerprint features used for representing the code segments are generated through the modes of code cleaning, window conversion and fixed-length code brushing, a plurality of fourth fixed-length codes are determined through the mode of fingerprint feature matching, and according to the matched fourth fixed-length codes, the content of the current code segments can be determined, and the existing code segments can be related to, so that accurate matching results are obtained, and the loophole searching and protecting capabilities are improved.

Drawings

Fig. 1 is a block diagram showing steps of a fingerprint feature generating method according to an embodiment of the present application.

Fig. 2 is a block diagram of steps of a fingerprint feature matching method according to an embodiment of the present application.

Detailed Description

The application is described in further detail below with reference to fig. 1 to 2.

The embodiment of the application discloses a fingerprint feature generation method of a code segment. Referring to fig. 1, the fingerprint feature generation method of the code segment includes the steps of:

s100, acquiring source codes of the code fragments.

The source code may be manually input, or stored code may be directly retrieved from a preset database, so long as the source code is available.

And S200, performing code cleaning on the source codes to obtain continuous character strings carrying code line number information.

Wherein the code cleaning comprises at least one of: removing line feed, removing space, and removing annotation information. In addition to the several cleaning modes described above, objects that need additional removal can be set by the staff.

The purpose of the code cleaning is to convert the source code into a continuous string, facilitating the generation of string fragments in subsequent steps. However, in order to indicate the relationship between the subsequent memory string fragment and the source code, the relationship between the continuous string and the source code needs to be preserved by carrying code line number information in the continuous string.

For example, the source code is:

“Int；

sha；

ign”。

then, the cleaned continuous character string is: "INTSHAIGN", wherein the row number corresponding to "Int" is 1, the row number corresponding to "sha" is 2, and the row number corresponding to "ign" is 3.

S300, sliding and selecting character string fragments in the continuous character string one by using a first window with preset character length to obtain a plurality of continuous character string fragments, and determining the end code line number corresponding to each character string fragment.

The preset character length is set manually.

The sliding one by one means that after the current character string segment is acquired, the first window moves backward by one bit along the continuous character string, and one character string segment is acquired every time the first window moves until the character string segment containing the last character of the continuous character string is acquired.

The end code line number corresponding to the character string segment refers to the line number corresponding to the last character in the character string segment in the source code.

Here, taking the continuous string "INTSHAIGN" as an example, assuming that the preset character length is 5, several string segments of "Intsh", "ntsha", "tshai", "shaig" and "haign" can be obtained according to the continuous string "INTSHAIGN", and the string segments sequentially correspond to the line numbers of 2, 3 and 3.

Of course, in practical use, the preset character length is generally set to 30, which results in less occurrence of the situation that the line numbers corresponding to the adjacent character string segments are the same.

S400, obtaining fixed-length codes of each character string segment, and obtaining a plurality of second fixed-length codes corresponding to the plurality of character string segments one by one.

Fixed length encoding refers to an encoded value used to characterize a string, such as a message digest or hash value obtained by a hashing algorithm.

In some embodiments of the application, fixed length coding is preferably obtained using a Variable Automatic Encoder (VAE). The variational self-encoder is a generating model based on a neural network and is usually trained based on an unsupervised learning mode. Compared with a hash algorithm, the hash algorithm is to reduce the dimension of a target character in a compression mode to obtain a string of fixed-length coding values; the fixed-length code generated by the variable-length self-encoder does not adopt a compression mode to reduce the dimension, but adopts a generation mode to reduce the dimension, and the obtained code value carries semantic information, so that the variable-length self-encoder is particularly suitable for code generation of code segments with semantic information.

The second fixed-length code comprises a code value and an end code line number corresponding to the corresponding character string segment.

S500, sliding and selecting fixed-length code sets in the second fixed-length codes one by one according to a second window with the preset number of the fixed-length codes to obtain a plurality of continuous fixed-length code sets.

Before selecting the fixed-length code set, the fixed-length codes in the second fixed-length codes are sequenced according to the sequence of the line numbers to form a continuous fixed-length code string similar to a continuous character string.

And the method for obtaining the fixed-length coding set is similar to the method for obtaining the character string fragments.

Assuming that the length of the second window is 4, i.e. the number of preset long codes is 4, several fixed-length code sets of "11, 12, 13, 14, 15, 16, 17" can be formed for the consecutive fixed-length code strings "11, 12, 13", "12, 13, 14", "13, 14, 15", "14, 15, 16" and "15, 16, 17".

S600, screening the third fixed-length codes from each fixed-length code set to obtain a plurality of third fixed-length codes, and determining the end code line number of the character string segment corresponding to each third fixed-length code.

The third fixed-length code comprises a fixed-length code and an end code line number corresponding to the fixed-length code, and the fixed-length code of the third fixed-length code is the most representative fixed-length code in the corresponding fixed-length code set.

In one embodiment, the third fixed-length codes are screened from each fixed-length code set to obtain a plurality of third fixed-length codes, and the method comprises the following steps:

one fixed-length code is selected from each fixed-length code set as a third fixed-length code.

Judging whether the adjacent third fixed-length codes are the same,

If the adjacent third fixed-length codes are the same, only the third fixed-length code on the rightmost side is reserved;

If the adjacent third fixed-length codes are not the same, the adjacent third fixed-length codes are reserved.

The mode of selecting the length code may be to select the length code with the smallest value from the length code set, or may be to select the length code with the largest data from the length code set, so long as the function of reducing the number of the third length codes can be achieved.

In addition, the above-mentioned method of generating the fixed-length code aggregate set can certainly generate that the third fixed-length codes screened by the adjacent fixed-length code aggregate set are identical. In order to further reduce the number of third fixed-length codes, when adjacent third fixed-length codes are identical, only one third fixed-length code is reserved, and the other third fixed-length code is deleted. And according to the moving direction of the second window, the reserved third fixed-length codes are positioned at the rightmost side, so that the condition of missing deletion when at least three third fixed-length codes which are adjacent in sequence are the same is avoided.

In this embodiment, the fixed-length code for which the minimum value is selected is exemplified as the third fixed-length code. Assuming that the fixed-length code sets are (22, 23, 24), (23,24,11), (24,11,12), (11,12,24) and (12,24,25), respectively, the fixed-length codes screened by each fixed-length code set are 22,11,11,11,12 in turn, but since the middle three fixed-length codes are the same, only the last 11 is reserved. And finally, carrying an end code line number corresponding to the fixed-length code, so that the third fixed-length code is represented as (22, 3), (11, 8), (12, 14), wherein the former value is the fixed-length code, and the latter value is the corresponding end code line number.

In one embodiment, when the source code of a code segment is obtained, the storage path of the source code of the code segment is also obtained.

The storage path refers to the storage position of the source code, and different source codes correspond to different storage paths, so that after the storage path is added to the third fixed-length code, the corresponding source code can be traced through the storage path in the third fixed-length code.

S700, taking the third fixed-length codes and the end code line number of each third fixed-length code as fingerprint characteristics of the code segment.

In one embodiment, before the source code is cleaned, the method further comprises the following steps:

And acquiring a first fixed length code of the source code, and taking the first fixed length code as one of fingerprint features of the code segment.

The first fixed-length code is also a fixed-length code value corresponding to the source code calculated by a Variable Automatic Encoder (VAE) with the source code as an input. Except that instead of a single third fixed length code representing only a portion of the source code, the first fixed length code represents the entire source code.

The embodiment of the application also discloses a fingerprint feature matching method of the code segment, which is applied to tracing of the code segment. Code of a plurality of open source components can be used in the source code of the code segment, and the use of the open source components needs to satisfy corresponding open source protocols.

Different open source components correspond to different open source protocols, and the different open source protocols have different use requirements.

For example, after the GPL protocol requires that an open source component corresponding to the GPL protocol be used in a piece of software, then the software product must also be open source and free; the BSD protocol then allows for the use or development of commercial software releases and sales based on the BSD code.

Therefore, it is important to know the open source component to which a section of source code is applied, and the application adopts the characteristic fingerprint generation method to respectively generate the characteristic fingerprints of the code section and the characteristic fingerprints of the open source component, and can determine the open source component corresponding to the code section through matching between the characteristic fingerprints.

A fingerprint feature matching method of a code segment, see fig. 2, comprising the steps of:

S800, extracting fingerprint features of the code segments based on the fingerprint feature generation method of the code segments.

S900, matching similar code segments for the code segments based on fingerprint characteristics.

The fingerprint characteristics of the code segments include a first fixed length code and a third fixed length code, but not every code segment need to generate the first fixed length code and the third fixed length code.

For example, the source code of a code segment is obtained directly from the existing code without any modification, and the source code of the code segment can be matched to the same existing code as long as the first long code is generated. In an embodiment, the existing code is open source code.

And the source codes of the code fragments are generated based on the adaptive modification of the open source codes, so that the source codes of the code fragments needing tracing can be matched with similar open source codes only by generating third fixed-length codes.

Therefore, to improve efficiency, step S800 is split into:

S810, extracting a first fixed-length code of the code segment based on a fingerprint feature generation method of the code segment.

S820, extracting a third fixed-length code of the code segment based on the fingerprint feature generation method of the code segment.

And splits step S900 into:

S910, matching similar code segments for the code segments based on the first fixed length code.

S920, matching similar code segments for the code segments based on third fixed-length coding.

Wherein, step S810 and step S910 are performed prior to step S820. And if similar open source codes can be matched through the first fixed length code when step S910 is performed, the subsequent steps S820 and S920 are not required to be performed. Only in step S910, the matching to the similar open source code by the first fixed length code is impossible, and step S820 and step S920 are sequentially performed.

In one embodiment, matching similar code segments for the code segments based on a first length code comprises the steps of:

And matching a fifth fixed-length code corresponding to the first fixed-length code from fifth fixed-length codes stored in a preset database, and determining similar code fragments according to the fifth fixed-length code.

The fifth fixed-length code is a code value calculated by using a variable-offset self-encoder (VAE) for the open-source code, and is substantially the same as the forming method of the first fixed-length code, except that the first fixed-length code corresponds to a code segment to be traced, and the fifth fixed-length code corresponds to an open-source component.

In one embodiment, matching similar code segments for the code segments based on a third fixed length code comprises the steps of:

S921, establishing a result set corresponding to the existing code.

The result set is a virtual space for storing data.

One code segment corresponds to one result set.

The purpose of creating the result set is to store the fingerprint characteristics of a scattered and numerous similar code segments that are matched by the third fixed length code.

And S922, traversing the third fixed-length codes of the code fragments to match the fourth fixed-length codes of the preset codes corresponding to the third fixed-length codes from the fourth fixed-length codes stored in the preset database.

The fourth fixed-length code is generated in the same manner as the third fixed-length code, except that the source code acquired in step S100 is an open source code. Thus, the fourth fixed-length code also contains the corresponding fixed-length code, the corresponding end code line number, and the storage path of the corresponding open source component.

The fourth fixed-length code corresponding to the third fixed-length code is to match the fixed-length code in the third fixed-length code with the fixed-length code in the fourth fixed-length code, and if the same fixed-length code is matched, the corresponding third fixed-length code and the fourth fixed-length code are described.

S923, judging whether the corresponding fourth fixed-length codes are stored in the corresponding result set, and if the corresponding fourth fixed-length codes are not stored in the result set, copying the corresponding fourth fixed-length codes into the result set.

If the corresponding fourth fixed-length code is not stored in the result set, the hit number of initial one is added to the corresponding fourth fixed-length code in addition to copying the corresponding fourth fixed-length code to the result set.

If the corresponding fourth fixed-length code is stored in the preset result set, the hit number in the corresponding fourth fixed-length code is increased by one.

The hit number is used to indicate the number of corresponding third fixed-length codes to which the same fourth fixed-length code is matched. The higher the hit number, the higher the frequency at which the existing code corresponding to the corresponding fourth fixed-length code is used.

S924, determining similar code segments based on the fourth fixed-length codes in the result set.

And tracing the corresponding existing codes through the storage paths in the fourth fixed-length codes, and displaying the information of the component names, the licenses, the version numbers, the manufacturers, the included loopholes and the like of the existing codes.

Because the source codes of the code segments needing tracing are often longer, the number of the matched fourth fixed-length codes is large, and all the fourth fixed-length codes are used for determining similar code segments, although the result can be more accurate, the problem of large operation amount exists, and therefore, the fourth fixed-length codes need to be properly filtered.

In one embodiment, determining similar code segments based on a fourth fixed-length code in the result set includes the steps of:

The first preset length is set manually. In this embodiment, the first preset length is 36000.

When the total length of the fourth fixed-length codes in all the result sets exceeds the first preset length, the number of the fourth fixed-length codes is excessive, and the preset first filtering rule is needed to be used for filtering.

However, when the total length of the fourth fixed-length codes in all the result sets does not exceed the first preset length, the number of the fourth fixed-length codes is not reduced, and the filtering still needs to be performed by using the preset second filtering rule.

All fourth fixed-length codes are filtered based on a preset first filtering rule, and the filtered fourth fixed-length codes are stored in a preset candidate set, and the method comprises the following steps:

The number of code lines is obtained by subtracting the number of end code lines in the adjacent fourth fixed-length codes, for example, the number of end code lines corresponding to the former fourth fixed-length code is 6, the number of end code lines corresponding to the latter fourth fixed-length code is 16, and then the number of code lines corresponding to the latter fourth fixed-length code is 10.

The preset minimum code line number and the minimum hit number are set by workers. In order to perform effective filtering, in this embodiment, the preset minimum code line number is 9, and the minimum hit number is 4.

In addition, the criterion for judgment is set manually, and if strict filtering criterion is needed, the fourth fixed-length code is required to be added into the candidate set only when the number of code lines exceeds the preset minimum number of code lines and the number of hits exceeds the preset minimum number of hits.

If a looser filtering standard is adopted, the fourth fixed-length code is required to be added into the candidate set under the condition that the number of code lines exceeds the preset minimum number of code lines or the number of hits exceeds the preset minimum number of hits.

The second preset length, the preset number, the preset times and the preset number are all set by people, and the second preset length is smaller than the first preset length. In this embodiment, the second preset length is 4000, the predetermined number is 50, the predetermined number of times is 4, and the predetermined number is 5.

When the result set simultaneously meets that the number of the fourth fixed-length codes exceeds the preset number and the total length of the fourth fixed-length codes in the result set exceeds the second preset length, the result set indicates that the number of the fourth fixed-length codes in the corresponding result set is large, and therefore filtering can be performed in a skip mode.

If the result set cannot meet the two conditions, the fourth fixed-length codes in the result set are not more, and no skip is needed.

The premise of executing the skip instruction is that the hit number of the fourth fixed-length code exceeds the predetermined number, and the fourth fixed-length code with the hit number exceeding the predetermined number has a higher function of representing the corresponding existing code segment, even if the subsequent ones of the fourth fixed-length codes with high adjacency have higher hit numbers as well, the subsequent ones of the fourth fixed-length codes with high adjacency are not affected too much, so that the subsequent ones of the fourth fixed-length codes with high adjacency are deleted to reduce the number of the fourth fixed-length codes.

In addition, in order to show the priority of the dependency between the current code segment and each existing code, after determining the corresponding existing code through the fourth fixed-length code, the matched existing codes are sorted, wherein the sorting is based on at least one of hit times, number of hits, storage path and the like.

In one embodiment, the number of hits and the number of hits are weighted to obtain a weighted value.

The existing codes are ordered in order of the weight value from large to small. If the weighting values are the same, the codes are sorted according to the length of the storage path, and the existing codes with long storage paths are arranged behind the existing codes with short storage paths. If the lengths of the storage paths are still the same, sorting is performed according to the size of a single hit line, and the existing code with a large hit line is arranged in front of the existing code with a small hit line.

The weighted calculation formula is the number of rows k1+the number of hits k2, in this embodiment k1=k2=1.

The above embodiments are not intended to limit the scope of the present application, so: all equivalent changes in structure, shape and principle of the application should be covered in the scope of protection of the application.

Claims

1. A method for generating fingerprint features of a code segment, comprising the steps of:

Acquiring source codes of the code fragments;

obtaining fixed-length codes of each character string segment, and obtaining a plurality of second fixed-length codes corresponding to the plurality of character string segments one by one, wherein the fixed-length codes are obtained by a variable-division self-encoder;

Taking the end code line number of the third fixed-length codes as the fingerprint characteristics of the code segment;

acquiring a first fixed-length code of the source code;

taking the first fixed-length code, the plurality of third fixed-length codes and the ending code line number of each third fixed-length code as fingerprint characteristics of the code segment;

Screening the third fixed-length codes from each fixed-length code set to obtain a plurality of third fixed-length codes, comprising the following steps:

judging whether the adjacent third fixed-length codes are the same,

acquiring a storage path of a source code of a code segment;

2. A fingerprint feature matching method for a code segment, comprising the steps of:

The method of any one of claim 1, wherein fingerprint features of code segments are extracted, and wherein the code segments are matched to similar code segments based on the fingerprint features.

3. A method of fingerprint feature matching of code segments according to claim 2, characterized in that matching similar code segments for the code segments based on fingerprint features comprises the steps of:

Establishing a result set corresponding to the existing code;

4. A method of fingerprint feature matching for a code segment according to claim 3, wherein determining whether a corresponding fourth fixed length code is stored in a corresponding result set, further comprises the steps of:

5. The fingerprint feature matching method of a code segment according to claim 4, wherein determining similar code segments based on a fourth fixed-length code in the result set comprises the steps of:

6. The fingerprint feature matching method of a code segment according to claim 5, wherein the fourth fixed-length code has a code line number of a corresponding character string segment;

7. The method of claim 5, wherein the fourth fixed-length code has an ending code line number corresponding to the character string segment,