CN119045880B

CN119045880B - Code positioning method based on programming language migration

Info

Publication number: CN119045880B
Application number: CN202411536704.0A
Authority: CN
Inventors: 胡明光; 郭辉; 裴高翔; 沈铖涛; 董明; 赵鲁; 姚拓中; 叶宏武; 万旭成
Original assignee: Zhejiang Kingnet Chengdu Westone Information Industry Inc
Current assignee: Zhejiang Kingnet Chengdu Westone Information Industry Inc
Priority date: 2024-10-31
Filing date: 2024-10-31
Publication date: 2025-05-06
Anticipated expiration: 2044-10-31
Also published as: CN119045880A

Abstract

The invention discloses a code positioning method based on programming language migration, which relates to computer software, wherein functions are thinned into typical processes to construct a function library and a total keyword table, function labels are generated through quasi transformation, a subject word self-checking crawler system is designed, a large number of accurate main and auxiliary language code segments are obtained based on a subject correlation discriminant method and a hash matching method to construct a main and auxiliary code library, a countermeasure generator is pre-trained based on a countermeasure training strategy to realize main and auxiliary language characteristic alignment, a contrast learning micro-tuning large model encoder is combined with the main code library to realize the function label assignment of code segments, a to-be-positioned word and to-be-positioned code are obtained, the to-be-positioned code is split, the to-be-positioned word is identified based on a dual-channel query strategy to select a positive example, and the positive example and the code segments are input into a pre-training and micro-tuning network to realize the code positioning of a designated function or a designated typical process.

Description

Code positioning method based on programming language migration

Technical Field

The invention relates to computer software, in particular to a code positioning method based on programming language migration.

Background

In developing new required functions or maintaining existing code, a developer may attempt to acquire code fragments corresponding to a specific subject function and learn about their logical structure. When analysis work on the code hierarchy is involved, complex source code structures and undesirable code styles for the developer will result in the developer spending a significant amount of time locating code fragments and clearing the implementation logic. Therefore, finding a method that can quickly locate code fragments has important practical implications for developers.

The prior patent of the invention with the publication number of CN109240700B provides a key code positioning method and a system, which collect the function call relation from the entry function of the interface parameter constraint code under the scene of the preset input parameter by a program instrumentation mode, and analyze the key code of each function according to the function call relation so as to position all the constraint codes related to the interface parameter.

In the prior art, the positioning of key codes is often only aimed at a single programming language, the positioning of codes in multiple programming languages cannot be realized, the positioning of the codes is realized by inputting preset parameters for key code analysis, and related codes of a designated function or a designated typical process cannot be positioned rapidly.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a code positioning method based on programming language migration, so as to realize code positioning of a designated function or a designated typical process of a cross programming language.

In order to achieve the above purpose, the present invention provides the following technical solutions:

A code positioning method based on programming language migration comprises the following specific steps:

Construction of function library And a total keyword table;

Retrieval of subject language code fragments by a subject term self-checking crawler system with the addition of functional tags to build a subject code library;

Combining a subject word self-checking crawler system and sample enhancement to cooperatively acquire auxiliary language code segments to construct auxiliary code library;

Constructing a countermeasure generator and pre-training based on a countermeasure training strategy;

adopting contrast learning to finely tune a large model encoder, and further pre-training a Softmax regression classifier;

And acquiring the word to be positioned, splitting the code to be positioned based on an artificial intelligence or static code analysis tool, selecting a positive example based on a two-channel query strategy, and positioning the code.

Further, building a function libraryAnd a total keyword tableThe method comprises the following specific steps:

counting all functions which can be realized by the programming language;

acquiring all typical processes corresponding to a single function;

extracting keywords corresponding to a single typical process to obtain all keywords corresponding to a single function, and constructing a keyword list of the single function;

Sequentially storing single functions and corresponding keyword tables in a function library Until all single functions are stored, function libraryThe construction is completed;

Extracting function library Obtaining non-duplicate keywords by de-duplication and rearranging the non-duplicate keywords to construct a total keyword table。

Further, a main code base is constructedThe method comprises the following specific steps:

determining a main language and traversing the function library in sequence The single function and the main language are used as the main topic word delivery main topic word self-checking crawler system to construct a code set corresponding to the single function;

Acquiring function library Keyword list corresponding to single function in the list, and comparing the total keyword listTo generate a function code corresponding to a single function;

Converting the function codes corresponding to the single function into function labels through reversible transformation, and endowing the function labels with code sets corresponding to the single function;

sequentially storing code sets and assigned function labels until all code sets of single functions are stored, and a main code library And (5) finishing construction.

Further, constructing a secondary code baseThe method comprises the following specific steps:

determining a secondary language, and delivering the secondary language as a subject word to a subject word self-checking crawler system until retrieval is stopped;

Judging whether the number of the code segments of the auxiliary language reaches a total auxiliary threshold value, if so, directly constructing an auxiliary code library ;

If not, generating a secondary language code segment through sample enhancement supplement until reaching a total secondary threshold value, and constructing a secondary code library。

Further, the subject term self-checking crawler system comprises a scheduling module, a downloading module, an extracting module and a data pipeline module;

The scheduling module receives and judges whether the subject word is crawled, generates a crawler request based on the subject word which is not crawled, and decides to send the crawler request to the downloading module or a task queue inserted into Redis based on whether the crawling task is being executed;

The downloading module adopts random dynamic User-Agent and random time delay to disguise the crawler request, accesses the website and downloads the crawler response;

The extraction module introduces a webpage grabbing algorithm, extracts page elements, calculates a score based on text density, generates a text by standardizing the page elements with the highest score, and extracts a title, content and URL in the text by adopting find () and find_all () commands;

The data pipeline module adopts a topic relativity discriminant method to screen out a related text from the text, eliminates repeated related text by a hash matching deduplication method, and recognizes characters in the content by an OCR technology to automatically generate code segments.

Further, the topic relevance determining method comprises the following specific steps:

Acquiring all the thematic words and sequentially converting the thematic words into a thematic word vector;

Extracting a plurality of keywords by adopting a TextRank algorithm, and sequentially converting the keywords into keyword vectors;

The cosine distances of the single subject word vector and the single keyword vector are calculated, and the average value of all the cosine distances is used as the subject relativity;

And setting a correlation threshold, and judging the text as the correlation if the correlation of the subject is greater than or equal to the correlation threshold.

Furthermore, the hash matching deduplication method for eliminating repeated relevant texts comprises the following specific steps:

analyzing the new URL through URLparse () command, and generating a core segment by splicing the protocol and the domain name;

setting a fixed-length bit array, and compressing a core segment into a fingerprint sequence through an SHA1 algorithm;

Inputting the fingerprint sequence into a bloom filter to generate a plurality of hash values, and respectively modulo the length of the bit array to generate an index key;

it is decided whether the new URL is discarded or stored based on whether the index key exists in the dis.

Further, the construction and pre-training of the countermeasure generator includes the following specific steps:

From a master code library And a secondary code libraryRandomly extracting the same number of main language code segments and auxiliary language code segments respectively, and generating visual features through a BERT network to serve as input of a countermeasure generator;

inputting visual features into a confusion generation network, extracting potential features through multiple convolutions, and generating unified features through linear modulation and ReLU function mapping;

Inputting the unified features into a feature restoration network, expanding dimensions through deconvolution, and restoring through nonlinear transformation to obtain reconstructed visual features;

Inputting the unified features into a domain discrimination network, mapping the unified features to a separable space through nonlinear transformation, and judging that the unified features come from a main language or a secondary language through logistic regression;

Considering characteristic reduction loss and field discrimination loss, introducing balance parameters, and defining a loss function ;

The visual characteristics are repeatedly input into the countermeasure generator, and the loss function is updated based on the countermeasure training strategyUp to a loss functionAt a minimum, the primary language code segment and the secondary language code segment achieve feature alignment.

Furthermore, the countermeasure training strategy refers to forward updating of the confusion generating network and the feature restoring network, and the domain judging network executes bidirectional updating;

forward updating means that the hyper-parameters in the confusion generation network and the feature reduction network are updated along the gradient direction of the feature reduction loss reduction;

the bidirectional updating means that the super-parameters of the domain discrimination network are updated towards the gradient direction of the domain discrimination loss reduction, and the super-parameters in the confusion generation network and the feature reduction network are informed to be updated according to the gradient direction of the domain discrimination loss increase;

In the training process, the confusion generating network and the characteristic restoring network can continuously fight against the domain discrimination network to improve the self-capacity, and when the fight generator reaches Cheng Nashen balance, the function is lost Minimum.

Further, the method for fine tuning the large model encoder by adopting contrast learning comprises the following specific steps:

Selecting GraphCodeBERT a pre-training model as a large model encoder;

From a master code library Randomly extracting two main language code segments from all code sets of the code set respectively and extracting corresponding function labels;

Inputting all extracted main language code segments into a countermeasure generator to obtain unified features, and further inputting the unified features into a large model encoder to generate code characterization;

defining a loss function And optimizing code representation by contrast learning for multiple rounds to continuously reduce loss functionWhen the loss functionAnd when the code representation is minimum, the distinguishability of the code representation is maximum, and the super-parameters of the large model encoder are fixed.

Further, the pre-training Softmax regression classifier comprises the following specific steps:

setting a Softmax regression model, reducing dimension of the code representation through a full connection layer, and mapping the code representation to the Softmax activation function Within the interval, outputting a prediction probability vector;

setting multiple classification loss functions To represent the degree of similarity of the predicted probability vector to the actual probability vector distribution;

Code representation multi-round input Softmax regression classifier, and gradient descent method is adopted to update super parameters of full connection layer to make multi-classification loss function At a minimum, the hyper-parameters of the Softmax regression classifier were fixed.

Further, the code positioning comprises the following specific steps:

identifying words to be located based on two-channel query strategy and using the same from main code library Selecting a positive example;

sequentially inputting the positive examples and a plurality of code segments obtained by splitting the code to be positioned into a countermeasure generator to obtain the unified characteristics of all the single code segments;

The unified features of all the single code segments are further input into a large model coder, the output code representations are sequentially input into a Softmax regression classifier, and the prediction probability vectors corresponding to the single code segments are output;

And selecting the function label with the highest probability from the prediction probability vector to assign the single code segment, and acquiring all the single code segments consistent with the function label of the positive example.

Further, selecting a positive example based on a dual-channel query strategy comprises the following specific steps:

identifying words to be positioned, and judging that the words are designated functions or typical processes;

if the function is designated, querying the function library Acquiring a keyword list corresponding to a specified function;

comparing the function codes with the total keyword list, obtaining function codes corresponding to the designated functions and converting the function codes into function labels;

Querying a master code library Randomly selecting a main language code segment with the same function label as a positive example;

If the typical process is designated, acquiring keywords corresponding to the typical process;

Determining the position of the keywords in the total keyword list, generating a reference function code and converting the reference function code into a reference function label;

acquiring all the function labels which are larger than or equal to the reference function label, and restoring the function labels into corresponding function codes;

comparing the function codes with the reference function codes, and placing function labels corresponding to the function codes containing the appointed typical process into a label set to be positioned;

Selecting a functional tag from the set of tags to be located, and selecting a functional tag from the master code library And randomly selecting a main language code segment with the same function label as a positive example to perform code positioning until all the function labels in the label set to be positioned are selected.

Compared with the prior art, the invention has the remarkable advantages that:

1. the method comprises the steps of refining functions to typical processes to construct a keyword table, comparing the total keyword table to generate a function code, generating a function label through reversible transformation, and providing two code positioning modes of appointed function positioning and appointed typical process positioning by assisting a double-channel query strategy;

2. Designing a subject term self-checking crawler system, combining with Redis to realize quick data storage and distributed access, introducing a webpage grabbing algorithm to unify extraction rules, designing a subject relevance judging method and a hash matching duplication removing method to filter irrelevant webpages and repeated webpages, improving acquisition efficiency and acquisition precision, adopting sample enhancement to solve the problem that the subject term self-checking crawler system is insufficient in crawling magnitude of low-resource sub-language code segments, and ensuring that the magnitudes of a main code library and a sub-code library can effectively support pre-training and fine adjustment of a neural network;

3. The countermeasure generator is designed, the training strategy is based on the countermeasure, the characteristic alignment of the main and auxiliary language code segments is realized, and the large model encoder is finely tuned through contrast learning and is used for assisting the Softmax regression classifier to endow the code segments with accurate functional labels.

Drawings

FIG. 1 is a flow chart of a code location method based on programming language migration;

FIG. 2 is a flow chart of generating a function tag corresponding to a single function in the present invention;

FIG. 3 is a flow chart of the related text of the hash matching deduplication method for eliminating duplicates in the invention;

FIG. 4 is a diagram of a model of a countermeasure generator in accordance with the present invention;

FIG. 5 is a code positioning flow chart of the present invention;

FIG. 6 is a flow chart of a dual channel query strategy in the present invention.

Detailed Description

The invention will now be described in further detail with reference to the drawings and examples.

As shown in FIG. 1, the embodiment of the invention is a code positioning method based on programming language migration, which comprises the following specific steps:

Construction of function library And a total keyword table;

Constructing a countermeasure generator, and pre-training the countermeasure generator by adopting a countermeasure training strategy to realize alignment of main and auxiliary language features;

Adopting a contrast learning micro-tuning large model encoder to maximize the distinguishing property of the code representation, and further inputting the code representation into a Softmax regression classifier for pre-training;

and acquiring the word to be positioned and the code to be positioned, splitting the code to be positioned, identifying the word to be positioned selection positive example based on a double-channel query strategy, and positioning the code.

all functions that can be implemented by the statistical programming language, assuming commonality A function, denoted as;

Acquisition of the firstPersonal functionAll typical processes corresponding to this, assume a commonThe exemplary process includes description, looping, comparing, ordering, and computing;

extracting key words corresponding to each typical process in turn due to functions Corresponding toTypical procedure, therefore, is functionalCorresponding toKeywords, combinationsIndividual keywords to build functionsCorresponding keyword table, wherein,Is a function ofCorresponding firstA keyword;

Will be the first Personal functionAnd corresponding keyword tableSynchronous recording in function libraryIn the method, after all functions and corresponding keyword tables are recorded completely, a function library is obtainedThe construction is completed;

Extracting function library All keywords in the list are subjected to de-duplication combination, and the rest is assumed after de-duplicationNon-repeated keywords are rearranged according to dictionary sequence to construct a total keyword list。

Determining a main language to distinguish a secondary language, wherein the main language refers to a single programming language with a large amount of public resources and normalized data, the secondary language refers to all programming languages except the selected main language, and C++ is selected as the main language in the embodiment;

For function libraries Middle (f)Personal functionWill functionAnd the main language is used as the subject term, and the self-checking crawler system continuous searching function based on the subject termCorresponding main language code segment up to the functionThe number of corresponding main language code segments reaches a single main thresholdStopping at the time, and turning on the functionCorresponding toPackaging individual main language code segments to construct code sets;

As shown in fig. 2, the initialization generationBit all-zero code and function library inquiryTo obtain the function ofCorresponding keyword tableComparing the total keyword listIf the total keyword listMiddle (f)Individual keywordsIs located in the keyword tableIn (3), the first to all zero codesThe bit is set to 1, thereby generating a functionCorresponding functional codeAnd will functionCorresponding functional codeConversion to functional labels by reversible transformationAnd giving code setsThe original value and the converted value of the reversible transformation must be in one-to-one correspondence, and the situation of one-to-many or many-to-one cannot occur, namely the original value is unique, the converted value is unique, and the function is directly coded in the embodimentRegarded as binary codes, and then converted into decimal numbers as function labels through binary conversion;

Code setSum function labelSynchronous storage in main code baseIn the process, search function through self-checking crawler system of subject termCorresponding toIndividual subject language code segments to construct a code setAdditional function labelAnd synchronously stored in a main code baseIn (a) and (b);

When function library In (a)Code sets and function labels corresponding to the functions are stored in a main code baseIn the middle, the master code libraryAnd (5) finishing construction.

determining required secondary languages, and selecting Python and Rust as the secondary languages in the embodiment;

Because the secondary language may have insufficient resources on the network and the normative of writing and annotating is low, it is difficult to retrieve the secondary language code segments of the specified function and attach function labels, only the collection number is considered to be equal to the total secondary threshold The selected auxiliary language is directly used as the subject word without distinguishing functions, and the auxiliary language code segment is continuously searched based on the subject word self-checking crawler system until the search is stopped;

if the number of the sub-language code segments is smaller than the total sub-threshold Judging that the source of the auxiliary language public code for stopping is insufficient, and generating auxiliary language code segments through sample enhancement and supplement until reaching the total auxiliary threshold valueOtherwise, judging the stopping reason as that the search task is completed;

Will be Direct storage of individual sub-language code segments to build sub-code libraries。

Specifically, the sample enhancement supplemental generation of the secondary language code segment includes the following several ways:

The code back-translation mode is adopted, namely, a multi-language generation model trained on large-scale code data is adopted, the multi-language generation model comprises CodeGeeX capable of realizing zero sample code translation, a secondary language code segment is translated into code segments of other programming languages, the secondary language code segment is translated back again, the function of the code segment can be kept unchanged by the secondary language code segment after back translation, but the specific code display form is different from that of the secondary language code segment before back translation;

The variable name replacement mode, namely renaming the variable names and function names in the auxiliary language code segment, ensures that the presentation form of the auxiliary language code segment is changed on the premise of not changing functions, and the specific implementation mode comprises naming standardization and naming randomization, wherein the naming standardization comprises the steps of sequentially naming the newly-appearing variable names or function names in the auxiliary language code segment according to the appearance sequence The name randomization refers to randomly selecting an unselected name from a pre-established name pool for replacing a newly-appearing variable name or function name in the secondary language code segment.

Furthermore, the subject term self-checking crawler system is improved based on Scrapy crawler frames and comprises a scheduling module, a downloading module, an extracting module and a data pipeline module;

The scheduling module receives the subject words and judges whether the subject words are crawled, discards the crawled subject words and generates a crawler request based on the subject words which are not crawled, further judges whether a crawling task is being executed, if no crawling task is being executed, the crawler request is sent to the downloading module, otherwise, the crawler request is inserted into a Redis task queue, the traditional Scrapy crawler framework stores the crawler request in a disk space, the disk space cannot be shared by multiple hosts, the Redis can share the crawler request, the collaborative crawling of multiple hosts is realized, and the crawler efficiency is improved;

The downloading module adopts a random dynamic User-Agent and random time delay to disguise a crawler request so as to improve the success rate of a crawler system to access a website and download crawler responses, the User-Agent is used for informing the website of the properties such as an operating system, a browser and the like of a client, if the crawler request always adopts a fixed User-Agent, the User-Agent is easy to identify and inhibit the crawler by the website, the random dynamic User-Agent randomly selects the User-Agent by pre-configuring an identity list for storing common User-Agent information during single crawling, in addition, the default access interval in a Scrapy crawler frame is 2 seconds, the interval of a common client to access the website is random, the random time delay is generated by creating a class named as random time delay, and the crawler system simulates the random behavior of the client to access the website according to the random interval;

the extracting module introduces the existing webpage capturing algorithm into the Spider module in Scrapy crawler frames, acquires and optimizes crawler responses to generate texts, adopts find () and find_all () commands to extract titles, contents and URLs in the texts, and because the general quantity and distribution of page elements of different website webpages are different, the page elements comprise navigation links, headers, footers, advertisement popups and texts, and one-to-one adaptive extracting rules are required to be formulated when the Scrapy crawler frames face different websites, and the introduced webpage capturing algorithm extracts page elements by constructing a DOM tree, calculates scores based on text density of the page elements, and performs standardized processing on the page elements with highest scores to generate texts, so that the extracting rules of different websites are effectively unified;

The data pipeline module adopts a topic relativity discrimination method to screen out a related text from the acquired text, and eliminates repeated related text through a hash matching deduplication method, wherein the content in the related text possibly comprises a code segment presented in a text form and a code segment presented in a picture form, and characters in the content are identified through an OCR technology to automatically generate the code segment.

acquiring all the subject words, and sequentially converting the single subject word into a high-dimensional subject word vector through a Skip-gram model;

extracting a plurality of keywords from the title and the content of the text by adopting a TextRank algorithm, keeping the number of the extracted keywords consistent with the number of the subject words, and sequentially converting a single keyword into a high-dimensional keyword vector by using a Skip-gram model;

sequentially calculating cosine distances of a single subject word vector and a single keyword vector, and taking the average value of all cosine distances as the subject correlation degree of the subject word and the text;

As shown in fig. 3, further, the hash matching deduplication method for eliminating duplicate related text includes the following specific steps:

analyzing a new URL through a URLparse () command carried by the Scrapy crawler framework, acquiring a protocol and a domain name in the new URL, and splicing to generate a core segment;

Setting up A dimension bit array, setting the values of all bits to 0, and compressing the core segment into a fingerprint sequence through an SHA1 algorithm;

Inputting the fingerprint sequence into a bloom filter, the bloom filter passing through The independent hash functions calculate fingerprint sequences to generate corresponding fingerprint sequencesHash values;

Will be The hash values respectively align the array lengthsTaking a module, finding out the corresponding position in the bit array and setting the value as 1 to generate an index key;

Based on the quick inquiry of the index key in the Redis, judging whether the index key exists, if so, discarding the new URL, otherwise, storing the index key corresponding to the new URL in the Redis.

Still further, the advantages of the hash-matching deduplication method include:

The traditional deduplication scheme directly adopts a complete URL for detection, however, the complete URL consists of a protocol, a domain name, a port number, a path, query parameters and an anchor point, wherein the protocol and the domain name determine to access a webpage, and the rest only confirm the port number, a specific file, customized content and a specific position of the accessed webpage, so that a plurality of different URLs can be indexed to the same webpage, namely different homology, a hash matching deduplication method is used for pre-analyzing the URL to obtain a core segment, and the core segment is used for judging to solve repeated crawling caused by different homology;

The URL is detected by the traditional scheme through comparing with the URL stored in the database, however, the URL comprises numbers, characters, symbols and even messy codes, the length is long, the structure is complex, the calculation force required by traversing the database is high when the database is compared, the URL dimension is reduced by the Hash matching deduplication method through the SHA1 algorithm, and the complex URL comparison is converted into simple binary bit comparison based on the property of a bloom filter, so that the efficiency is greatly improved.

As shown in fig. 4, further, constructing and pre-training a countermeasure generator to achieve primary and secondary language feature alignment includes the following specific steps:

From a master code library Random decimation inIndividual main language code segmentsFrom the secondary code libraryRandom decimation inIndividual sub-language code segmentsAnd concatenate to generate an input code segment;

Will input code segmentsGenerating visual features through BERT network processingAs the input of the countermeasure generator, the visual characteristics refer to the lower characteristics which can be directly found through the observation means, and the visual characteristics have larger difference due to the difference of the main language and the auxiliary language;

will be intuitive features Input confusion generation networkExtracting potential features through multiple convolution upper positions, mapping the potential features to a unified space through linear modulation and a ReLU function, and generating unified featuresUnified features refer to the essential features of the code segments, which are high-similarity features shared by the main language and the auxiliary language;

will unify features Input feature restoration networkThe unified feature dimension is expanded through multiple deconvolution, and the space where the visual features are located is restored through linear modulation and ReLU functions, so that the reconstructed visual features are obtained, and a feature restoration network is obtainedAims to prevent confusion generation networkExcessive upper extraction leads to complete loss of visual features;

will unify features Input field discrimination networkThe unified feature is further mapped to a separable space through nonlinear transformation formed by linear modulation and nonlinear activation functions, and the unified feature is judged to come from a main language or a secondary language through logistic regression;

defining a loss function The specific formula is as follows:

,

Wherein, 、AndRespectively representing confusion generating networksFeature restoration networkSum field discrimination networkThe super-parameters involved in the process,In order to balance the parameters of the device,Is a domain label for indicating the firstPersonal unifying featuresFrom either the primary language or the secondary language,AndThe characteristic reduction loss and the domain discrimination loss are respectively, and MSE and cross entropy are respectively adopted in the embodiment;

will be intuitive features Repeatedly input countermeasure generator, generating network via confusionFeature restoration networkSum field discrimination networkAfter processing, the loss function is updated based on the countermeasure training strategyWhen the loss functionMinimum time, domain discrimination networkThe unified features cannot be distinguished from the main language or the auxiliary language, but the feature reduction networkThe visual characteristics can be reconstructed based on the unified characteristics, the super parameters of the countermeasure generator are fixed, and the main language code segment and the auxiliary language code segment are aligned in characteristics, namely, in the characteristic space where the unified characteristics are located, the main language code segment and the auxiliary language code segment can be regarded as code segments of the same language.

Further, the countermeasure training strategy refers to a confusion generating networkAnd feature recovery networkIn training process and domain discrimination networkAgainst each other, in particular, confusion generating networksAnd feature recovery networkForward updating and domain discrimination networkPerforming a bi-directional update;

forward updating means that, when After one round of training by the visual characteristic input countermeasure generator, the characteristic reduction loss is obtainedDuring the next training round, the confusion generating networkAnd feature recovery networkSuper parameters in (a)AndAll will be lost along the characteristic reductionUpdating the reduced gradient direction;

bidirectional updating means that when After one round of training is carried out on the visual characteristic input countermeasure generator, the field discrimination loss is obtainedDuring the next training, the domain discrimination networkSuper parameter of (2)Loss of discrimination toward the fieldThe reduced gradient direction is updated by multiplying by the balance parameterTo inform the confusion generating network of the opposite numbers of (a)And feature recovery networkSuper parameters in (a)AndAll according to the field discrimination lossUpdating the increased gradient direction;

confusion-generating networks under the impetus of countertraining strategies And feature recovery networkWill continue and the field will distinguish the networkCountering, confusion generating networkDiscriminating networks for interference domainDistinguishing the main language from the auxiliary language can improve the capability of generating unified features in the upper part of the user, and at the same time, the features restore the networkThe capability of reconstructing visual features can be improved to ensure that unified features are not too high-level, opposite, domain discrimination networkTo distinguish unified features from either the main or sub language, discrimination is also enhanced, and the final countermeasure generator reaches Cheng Nashen equalization, at which time the penalty functionMinimum.

the pre-training model GraphCodeBERT is selected as a large model encoder, and the main language code segment or the auxiliary language code segment can be regarded as the same language code segment after being processed by the countermeasure generator, so that the fine-tuning large model encoder only needs the main language code segment with the existing function label;

From a master code library A kind of electronic devicePersonal code setEach randomly extracts two main language code segments to the firstPersonal code setFor example, two main language code segments randomly extracted are respectively marked asAndAnd extracting the corresponding function labelCo-acquisition ofIndividual main language code segmentsAnd corresponding functional labels;

Will be the subject language code segmentInput countermeasure generator for obtaining unified characteristicsFurther unify featuresInputting large model encoder, generating code representationThe code characterization dimensions are all;

Defining a loss functionThe specific formula is as follows:

,

Wherein, Is used for the temperature super-parameter,Characterizing the ith codeAnd (d)Personal code characterizationCosine similarity of (C), loss functionThe relativity between the code segments of the same main language of the function label and the relativity between the code segments of different main languages of the function label are described;

Multiple round optimization through contrast learning Personal code characterizationTo continuously reduce the loss functionAfter each round of comparison learning, the correlation degree between the main language code segments with the same function label and the difference degree between the main language code segments with different function labels are increased at the same time, when the function is lostAnd at the minimum, the cosine similarity between the code characterizations corresponding to the same main language code segments of the function labels can be maximized, the cosine similarity between the code characterizations corresponding to different main language code segments of the function labels can be minimized, the difficulty of directly distinguishing different function labels based on the code characterizations is reduced to the minimum, the distinguishing property of the code characterizations is maximized, and the super parameters of the large model encoder are fixed.

Setting a Softmax regression model, wherein the feature dimension of the known code is as follows Functional label sharingThe input layer and the output layer of the Softmax regression model are respectively arrangedSum of allThe neuron is hidden layer is full connection layer, and Softmax activation function is set after output layer for regression classification, when the first layerPersonal code characterizationInputting into Softmax regression model, and reducing dimension by linear modulation of full connection layerMapping to by Softmax activation functionWithin the interval, the output of the Softmax regression model is the predictive probability vectorPredictive probability vectorThe dimension isWherein, the firstIndividual elementsRepresent the firstPersonal code characterizationFunctional labelIs used for predicting the probability of (1);

setting multiple classification loss functions To represent predictive probability vectorsAnd actual probability vectorsThe similarity of the distribution is as follows:

,

Wherein, Representing actual probability vectorsIs assumed to be the transpose of (1)Personal code characterizationFunctional labels should be givenActual probability vectorMiddle and remove the firstIndividual elementsIs 0 except 1, thus, the loss functionCan be further simplified into;

Characterizing codesMultiple rounds of input Softmax regression classifier, gradient descent method is adopted to update super parameters of full connection layer, so that multiple classification loss functions are realizedContinuously decreasing, multi-class loss functionThe closer to 0, indicate thePersonal code characterizationFunctional labelThe more the predictive probability of (2) approaches 1, the stronger the classification ability of the softmax regression classifier, as a multi-classification loss functionAt minimum, the hyper-parameters of the Softmax regression classifier were fixed.

Furthermore, the code to be positioned is a code source file provided by a user, and can be written by a main language or an auxiliary language, the code source file comprises a plurality of code segments, splitting the code to be positioned can be performed through an artificial intelligence or static code analysis tool, understand is adopted for analysis in the embodiment, a programming language is automatically identified after the code to be positioned is imported, and a hierarchical structure of the code to be positioned is displayed through a visual interface, at the moment, the code to be positioned can be split into a plurality of code segments according to the hierarchical structure through an automatic script or manual operation.

As shown in fig. 5, further, the code positioning includes the following specific steps:

identifying words to be located as specified functions or specified typical processes based on dual-channel query strategy and identifying words to be located from a main code library The positive examples refer to main language code segments of which the function labels are matched with specified functions or specified typical processes;

inputting the positive example and a plurality of code segments obtained by splitting the code to be positioned into a countermeasure generator together to obtain the unified characteristics of all the single code segments;

further inputting unified features of all single code segments into a large model encoder to obtain code characterizations of all single code segments;

The code representation of all the single code segments is sequentially input into a Softmax regression classifier, a prediction probability vector corresponding to the single code segment is output, a function label with the maximum probability is selected from the prediction probability vector to be endowed to the single code segment, and all the single code segments consistent with the function label of the positive example are the code segments with required specified functions or specified typical processes.

As shown in fig. 6, further, selecting a positive example based on a two-channel query strategy includes the following specific steps:

Further comparing with the total keyword list, obtaining the function code corresponding to the designated function and converting the function code into a function label;

Further determining the position of the keywords in a total keyword list, generating a reference function code, wherein the reference function code and the function code are in the same dimension, the position corresponding to the keywords is 1, the rest is 0, and the reference function code is regarded as a binary code and is converted into a reference function label;

Acquiring all function labels which are greater than or equal to a reference function label, restoring the function labels to corresponding function codes and comparing the function labels with the reference function codes, if the function codes and the reference function codes have 1 at the same position, placing the function labels corresponding to the function codes into a label set to be positioned, and assuming that the reference function codes are 0100, the corresponding reference function labels are 4, wherein the function codes containing a designated typical process are necessarily 1 at the same position as the reference function codes, but the rest positions can also be 1, for example, the function codes are 0111, and the corresponding function labels are 7 and are necessarily greater than or equal to the reference function labels;

selecting a functional tag from the set of tags to be located, and selecting a functional tag from the master code library And randomly selecting a main language code segment with the same function label as a positive example to perform code positioning, removing the label after the code positioning is completed, continuously selecting a new function label from the label set to be positioned, and repeating the steps until all the function labels in the label set to be positioned are selected.

The invention discloses a code positioning method based on programming language migration, which refines functions into typical processes to construct a function library and a total keyword list, generates function labels through quasi-transformation, designs a subject term self-checking crawler system, acquires a large number of accurate main and auxiliary language code segments based on a subject correlation discrimination method and a hash matching method to construct a main and auxiliary code library, pre-trains a countermeasure generator based on a countermeasure training strategy to realize main and auxiliary language characteristic alignment, adopts a contrast learning micro-tuning large model encoder to maximize the code representation separability in combination with the main code library, assists a pre-training Softmax regression classifier to realize the function label assignment of the code segments, acquires to-be-positioned words and to-be-positioned codes, splits to-be-positioned codes and identifies to-be-positioned words based on a dual-channel inquiry strategy to select positive examples, and inputs the positive examples and the code segments into a pre-training and micro-tuning network to realize the code positioning of specified functions or specified typical processes.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The code positioning method based on programming language migration is characterized by comprising the following specific steps:

constructing a function library and a total keyword list based on all typical processes corresponding to a single function;

Acquiring a main language code segment based on a subject term self-checking crawler system and attaching a function label to construct a main code library;

A sample enhanced auxiliary subject word self-checking crawler system is adopted to acquire a secondary language code segment so as to construct a secondary code library;

Constructing and pre-training a countermeasure generator based on a countermeasure training strategy, adopting a contrast learning fine tuning large model encoder to maximize the distinguishing property of the code representation, and further inputting the code representation into a Softmax regression classifier for pre-training;

acquiring words to be positioned, splitting codes to be positioned, and selecting a positive example to position the codes based on a double-channel inquiry strategy;

The construction and pre-training of the countermeasure generator based on the countermeasure training strategy comprises the following specific steps:

randomly extracting a main language code segment and a secondary language code segment, generating visual characteristics by adopting a BERT network, and inputting the visual characteristics into the countermeasure generator;

the confusion generation network extracts potential features in the visual features through multiple convolution upper positions, and generates unified features through nonlinear mapping;

the feature restoration network expands the dimension of the unified feature through deconvolution, and acquires a reconstructed visual feature through nonlinear transformation;

The domain discrimination network maps the unified features to the separable space through nonlinear transformation and judges the sources of the unified features;

introducing balance parameters to comprehensively consider characteristic restoration loss and field discrimination loss, defining a loss function, and updating the loss function to be minimum based on an countermeasure training strategy;

the micro-adjustment large model encoder adopting contrast learning comprises the following specific steps:

selecting GraphCodeBERT a pre-training model as the large model encoder;

randomly extracting two main language code segments from each code set in a main code library and extracting corresponding function labels;

inputting the main language code segment into a countermeasure generator and a large model encoder in sequence to generate code characterization;

And defining a loss function, optimizing the code representation through contrast learning to reduce the loss function, and fixing the super-parameters of the large model encoder when the distinguishing property of the code representation is maximum and the loss function is minimum.

2. The method for code location based on programming language migration of claim 1, wherein the countermeasure training strategy comprises forward updating and bi-directional updating;

The forward updating means that the hyper-parameters in the confusion generation network and the feature reduction network are updated along the gradient direction of the feature reduction loss reduction;

the bidirectional updating refers to updating of the super parameters of the domain discrimination network along the gradient direction of the domain discrimination loss reduction, and the introduction of the balance parameters enables the super parameters in the confusion generation network and the characteristic restoring network to update along the gradient direction of the domain discrimination loss increase.

3. The code positioning method based on programming language migration of claim 1, wherein the two-channel query strategy selection positive example comprises the following specific steps:

if the function is the appointed function, acquiring a function label corresponding to the appointed function, inquiring a main code library, and randomly selecting a main language code segment of the same function label as a positive example;

If the typical process is designated, acquiring keywords corresponding to the typical process and generating a reference function label;

And acquiring all the function labels which are greater than or equal to the reference function label, restoring the function labels into a function code, judging whether the function code contains a specified typical process, and randomly selecting a main language code segment of the same function label from a main code library based on the function label containing the specified typical process as a positive example.

4. The code location method based on programming language migration of claim 1, wherein the constructing of the function library and the total keyword table comprises the following specific steps:

acquiring all typical processes corresponding to a single function;

extracting keywords corresponding to the typical process, and constructing a keyword list corresponding to the single function;

Establishing a function library containing all the single functions and the keyword list;

And obtaining all non-repeated keywords in the function library, and rearranging the non-repeated keywords to construct a total keyword list.

5. The code location method based on programming language migration of claim 1, wherein the constructing the master code library comprises the following specific steps:

determining a main language, traversing single functions in the function library in sequence, and delivering the single functions and the main language as subject words to a subject word self-checking crawler system to obtain a main language code segment;

obtaining a keyword list corresponding to a single function in the function library, and comparing the total keyword list to generate a function code corresponding to the single function;

Converting the function codes corresponding to the single function into function labels through reversible transformation, and endowing all main language code segments corresponding to the single function;

And storing all the main language code segments and function labels corresponding to all the single functions to construct a main code library.

6. The code positioning method based on programming language migration of claim 1, wherein the constructing the secondary code library comprises the following specific steps:

judging whether the number of the auxiliary language code segments reaches a total auxiliary threshold value, if so, directly constructing an auxiliary code library;

if not, generating a secondary language code segment through sample enhancement supplement until reaching a total secondary threshold value, and constructing a secondary code library.

7. The code positioning method based on programming language migration of any one of claims 1, 5 and 6, wherein the subject term self-checking crawler system comprises a scheduling module, a downloading module, an extracting module and a data pipeline module;

The extraction module introduces a webpage grabbing algorithm, extracts page elements, calculates a score based on text density, generates a text by standardizing the page elements with the highest score, and further extracts a title, content and URL in the text;

8. The code positioning method based on programming language migration of claim 7, wherein the hash matching deduplication method eliminates repeated relevant texts, comprising the following specific steps:

Analyzing the URL, splicing the protocol and the domain name to generate a core segment;

setting a bit array, and compressing the core segment into a fingerprint sequence through an SHA1 algorithm;

The URL is determined to be discarded or stored based on whether the index key exists in Redis.