Disclosure of Invention
Aiming at the problems, the invention provides a network protocol entity extraction method and a system based on small sample learning, aiming at accurately extracting network protocol entities by fully learning the semantic features of samples, wherein the training effect on small samples is consistent with the training effect on large samples, and the method has high robustness and also has higher extraction precision on network protocol entities which do not appear in training samples.
The technical scheme adopted by the invention is as follows:
a network protocol entity extraction method based on small sample learning comprises the following steps:
1) constructing a network protocol document set according to expert knowledge;
2) extracting fields and description information contained in a network protocol entity from the network protocol document set, and forming a network protocol information data set by the fields and the description information;
3) carrying out block processing on the network protocol information data set to form a network protocol text block set;
4) training a traditional machine learning model on the network protocol text block set to obtain a trained potential network protocol entity classifier;
5) training a network protocol entity accurate identification model based on a neural network by utilizing the network protocol text block set;
6) fusing the potential network protocol entity classifier and the network protocol entity precise identification model to obtain a network protocol entity extraction model based on small sample learning;
7) and performing network protocol entity extraction on the network protocol text to be subjected to entity extraction based on the network protocol entity extraction model based on the small sample learning.
Further, step 1) uses heuristic rules or toolkits to preprocess the documents in the network protocol document set (i.e. RFC document set), and the steps include:
removing headers and footers in the text by a pattern matching method;
most charts consist of a symbol "+ -" or other special character, which is first located in the text on the line where the symbol is located, and then every line containing a special symbol is deleted from that line down until the single line word sparsity is above a threshold.
Further, the step 3) of performing block processing on the network protocol information data set includes: each sentence in the text is converted into a grammar tree structure by applying an NLP tool in the CoreNLP package, and each sentence can be segmented into a plurality of grammar phrases according to the grammar tree.
Further, step 3) divides description information in the network protocol text block set obtained after the block processing into positive and negative samples, and the samples are represented by vectorization and then used as input of the traditional machine learning model in step 4) to generate a classifier for predicting potential network protocol entities, namely the potential network protocol entity classifier.
Further, the potential network protocol entity in step 4) includes twelve parts of speech that most negative samples include, the positive samples do not include, and a tool kit is used to extract the parts of speech corresponding to the network protocol entity and remove entities including the parts of speech, where the twelve parts of speech include adverb, verb indefinite form, verb singular verb, exclamatory word, quantifier, verb modal verb, preposition, verb, conditional conjunctive word, non-third person named singular, verb primitive, and noun ownership.
Further, step 5) performing word embedding processing on the network protocol text blocks in the network protocol text block set, dividing the network protocol text blocks according to a result set, inputting the divided network protocol text blocks into a network protocol field model, namely a network protocol entity accurate identification model, and training the network protocol entity accurate identification model sensitive to a protocol header field by using a neural network; the network protocol entity precise identification model comprises a linear aggregation layer and a nonlinear layer; the descriptive semantic information of the field information is ensured to be checked separately through the nonlinear layer, so that valuable information of the field information is reserved; all hidden states, i.e. intermediate results from the non-linear layer, are connected by the linear aggregation layer to fully exploit the inference results of the network.
Further, step 7) comprises:
1) preprocessing the network protocol text to be subjected to entity extraction according to the method;
2) inputting the preprocessed network protocol text block set into the constructed potential network protocol entity classifier to obtain a potential network protocol entity set;
3) inputting the obtained potential network protocol entity set into the constructed network protocol entity accurate identification model;
4) and inputting the result after the accurate identification model of the network protocol entity into a classification layer for classification to obtain an entity extraction result.
A network protocol entity extraction system based on small sample learning, comprising:
1) the model module comprises a network protocol entity extraction model constructed by the method, and the model receives a network protocol text of an entity to be extracted as input;
2) the fusion module is used for fusing the potential network protocol entity classifier and the network protocol entity precise identification model to obtain a network protocol entity precise identification model;
3) and the classification module is used for inputting the result of the network protocol entity accurate identification model into a classification layer for classification to obtain an entity extraction result.
A storage medium having stored therein a computer program for executing the above-mentioned method of the present invention.
An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the above-described method of the invention.
Compared with the prior art, the invention has the advantages that: the invention provides a network protocol entity extraction method based on small sample learning, which can realize protocol entity extraction of unmarked RFC documents and keep higher identification precision by training with a small amount of RFC document samples with marks. The model is helpful for realizing the automatic analysis of network protocols in the future and providing help for the research on computer networks.
Detailed Description
The present invention will be described in further detail below with reference to specific examples and the accompanying drawings.
The main content of the invention comprises:
1. network protocol document set
RFC is a series of numbered documents that gather network protocol related information about the internet, and software documents for UNIX and the internet community, the basic internet communication protocol being specified in the RFC document. The chart and text in the RFC document all contain network protocol entities (such as Offset, Sequence Number, Reserved, etc.), and if a character matching method is used to extract the RFC protocol entities from the chart, the entities cannot be correctly extracted because the same entities in different documents have different representation forms. And the text content of the RFC document contains the detailed description of the protocol entity, and the entity can be searched through semantic features, so that the protocol entity can be extracted from the text.
2. Latent entity mining model (latent network protocol entity classifier)
The first stage model architecture is shown in fig. 1. The model input is each word in a block, and is converted into a vector (F in FIG. 1) through TF-IDF1~F5) And then inputting the data into an SVM classifier for classification, and outputting a label corresponding to the block, wherein the block in the figure is input with the Sequence Number of the sender, is an entity reference of the Sequence Number in the result set, and is marked as a positive label. After all the blocks are classified by the classifier, the positive label set is used as the input to the next stage.
3. Accurate entity recognition model (network protocol entity accurate recognition model)
To understand more physical features, the present invention extends the sample set. After sample expansion, it was found that the accuracy of extracting entities from the first stage of the method of the invention was significantly reduced. To analyze the cause of the decrease in accuracy, the result set is compared with the extracted entities and the extracted entity set is divided into positive and negative samples. The goal of the rigorous solid object identification model is to reduce the number of negative samples as much as possible while maintaining the number of positive samples as unreduced.
Through analyzing the composition of positive and negative samples, the fact that most negative sample blocks contain verb phrases or adverbs but the positive samples do not contain the verb phrases or adverbs is found, parts of speech which only appear in the negative samples are summarized and presented in a form of table 1, and partial negative samples are deleted in a part of speech noise reduction mode. In the experiment, the part of speech of each word in the block is extracted by applying an NLP tool in a CoreNLP package, and the block containing the part of speech in the table 1 in a sample set is deleted.
Table 1. list of parts of speech contained in negative examples. The first column is part-of-speech tags in the NLP tool.
| Part of speech tag
|
Means of
|
| RB
|
Adverb
|
| TO
|
Indefinite verb form
|
| VBZ
|
Singular verb
|
| UH
|
Exclamation word
|
| CD
|
Volume word
|
| MD
|
Emotional verbs
|
| IN
|
Preposition word
|
| VBG
|
Kinetic noun
|
| CC
|
Conditional conjunctions
|
| VBP
|
Non-third person calls singular
|
| VB
|
Verb original shape
|
| POS
|
Noun all lattice |
After the word property noise reduction is adopted for the entity set, half of the number of blocks in the original negative sample are deleted, the number of the positive samples is unchanged, and the remaining negative samples and the positive samples have noun phrases with the same word property, so that the noun phrases cannot be directly eliminated by using the word property.
The invention uses AttBi-LSTM as a classifier, positive and negative samples processed by the word characteristic as input, uses a word2vec model as word vector conversion for a deep learning model, and uses TF-IDF as vector conversion for SVM. The training classifier learns a score function S _2(Pos, Neg) to calculate the probability of a block being a positive sample, Pos representing a positive sample and Neg representing a negative sample. And applying the S _2 function to the new block set, accurately predicting positive samples in the block set, and selecting the positive samples as a final entity set, thereby further reducing the number of negative samples. Taking AttBi-LSTM (Bi-LSTM based on Attention mechanism) as an example, as shown in FIG. 2, the input is the firstAnd extracting a block left after the part of speech noise reduction of the entity in one stage, wherein the block consists of 5 words, and the positive and negative labels corresponding to the block are output after the AttBi-LSTM classification. The vector u in the figure represents the word importance, a
tTo normalize word weights. The sum of all the information in the sentence v is each
Is given in formula (a), wherein a
tAre the corresponding weights.
In FIG. 2, the Embedding Layer (Embedding Layer) is used for word Embedding, the Bi-LSTM Layer is used for vector encoding, the Attention Layer (Attention Layer) is used for feature focusing, and the fully-connected Layer + Softmax is used for tag classification.
Example (b):
the method comprises the following steps: acquisition of web protocol text
First, RFC documents are selected that comply with two specific rules: 1) the header field name is in the same row as the byte size it occupies, and the two are separated by special characters such as colon, space or parentheses, for example: (Type of Service:8 bits); 2) the detailed description of the header field is right below the field, the upper box of fig. 3 is an example that conforms to the above features, the lower box is a stand-alone text, the entity cannot be extracted by a heuristic method, and the extraction is required by the small sample-based network protocol entity extraction method of the present invention (abbreviated as FSL method in fig. 3, and the FLS is collectively called Few-Shot Learning), where the "ACK control bit" and "sequence number" are the header fields. Then, applying a heuristic-based method to extract entities from the RFC documents meeting the above conditions, and classifying the entities into a result set.
Step two: network protocol text preprocessing
1) Since the protocol entities are extracted from the RFC text, information that is not relevant to the text content of the description protocol field should be deleted in order to reduce text noise. The header and footer in the text are deleted first, the form of the part is basically fixed, and the part can be removed by a mode matching method. Next, the diagrams in the document are deleted, most of which consist of the symbol "+ -" or other special characters, and we can first locate the line in the text where the symbol is located and then delete the lines from that line down. Considering that the word sparsity of each line containing the graph is low, a threshold value is specified in advance, and the line-by-line deletion operation is stopped until a certain line does not contain a special symbol and the word sparsity is higher than the threshold value.
2) And partitioning the adjusted network protocol description text. The description part follows the header field definition, and the method of extracting the description part is as follows: after locating the header field, reading line by line from the lower part, stopping reading when the definition of the next header field is read, repeating the above operation to read the description part of the next field, and stopping reading the description part of the last field until the header field of the upper paragraph is matched. For each sample RFC, the extracted description text is saved separately for subsequent chunking. In the experiment, the NLP tool in the CoreNLP package is used to realize text segmentation, the NLP tool can convert each sentence in the text into a syntax tree structure, and each sentence can be segmented into a plurality of syntax phrases according to the syntax tree, as shown in fig. 4.
Step three: latent entity mining model based on traditional machine learning
Firstly, judging whether a training set text block is in an entity result set, and marking a positive label and a negative label on each block. And combining the text blocks of the training samples and the result set entity into an overall corpus, and converting each text block and the result set entity into vectors with the same dimension by using a TF-IDF method. In order to describe the characteristics of each label, each text block and all result set entities are subjected to cosine similarity operation, and finally all cosine values are combined to be used as the characteristics of the label of the text block. And (4) calculating a cosine value set of each block by using a matrix structure, and then sending the block characteristics and the labels into the SVM for training. The SVM classifier for classifying text blocks can be obtained preliminarily. Inputting a new cosine value vector between the blocks and the result set, predicting the labels of the blocks corresponding to the cosine vector characteristics, and using the text blocks predicted as positive labels as extraction entities of RFC documents in the first stage. The test set in the first stage is preprocessed and then sent to an SVM classifier, and a candidate entity set of the test set is obtained.
Step four: accurate entity recognition model based on neural network
Firstly, filtering out wrongly divided text blocks in the candidate entity set by using a word noise reduction method. And introducing a result set entity of the test set in the first stage, and dividing the block set subjected to the word de-noising treatment into a positive type and a negative type. The positive labelsets correspond to the set of blocks that contain the entities in any result set, and the negative labelsets are the opposite. The positive and negative label sets are converted into word vectors before being fed into the second stage classifier. We use the word2vec model for word vector transformation. The description text in the sample RFC document about the header field is merged as a total training corpus set. Word2vec obtains a language model by training a corpus, and converts input text into Word vectors. Also, because our corpus is not large, only the dimension of the word vector is set to 100. After the blocks in the positive and negative label sets are converted into word vectors through word2vec, the word vectors are divided into a training set and a testing set according to the ratio of 8:2 and sent into an AttBi-LSTM classifier for training. After training, a positive and negative label classifier can be obtained.
Step five: network protocol entity abstraction
For an unlabelled RFC document extraction protocol entity, firstly, an SVM classifier at the first stage screens the segmented text to obtain an extracted text block set at the first stage. And then, performing word de-noising on the text block set, screening the remaining text block set after word processing by using a second-stage deep learning model classifier, and extracting a positive sample block set. The positive sample set is taken as a network protocol entity extraction set of the un-labeled RFC document.
And (3) analysis: the invention provides a network protocol entity extraction method based on small sample learning to solve the entity extraction problem in the field of network protocols, and experiments prove that the accuracy (shown in table 2) of the method is obviously higher than that of a single entity extraction model, thereby proving that the method is feasible. Experiments show that when 5 artificially labeled RFC documents are used for training the model disclosed by the invention, the accuracy rate of network protocol entity extraction reaches 88.4%, and compared with the existing method, the method has higher precision and better robustness in the aspect of network protocol entity extraction, and also has better identification capability on network protocol entities which do not appear in a training set.
TABLE 2 Experimental results
| Model name
|
Accuracy of
|
Recall rate
|
F1
|
| SVM
|
80.2%
|
53.5%
|
64.1%
|
| AttBi-LSTM
|
76.1%
|
54.7%
|
63.7%
|
| Combined model
|
88.4%
|
58.5%
|
70.4% |
Based on the same inventive concept, another embodiment of the present invention provides a network protocol entity extraction system based on small sample learning, which includes:
the model module comprises a network protocol entity extraction model constructed by the method, and the model receives a network protocol text of an entity to be extracted as input;
the fusion module is used for fusing the potential network protocol entity classifier and the network protocol entity accurate identification model to obtain the network protocol entity extraction model;
and the classification module is used for inputting the result of the network protocol entity extraction model into a classification layer for classification to obtain an entity extraction result.
Wherein the specific implementation process of each module takes part in the description of the method of the present invention.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
Although specific details of the invention, algorithms and figures are disclosed for illustrative purposes, these are intended to aid in the understanding of the contents of the invention and the implementation in accordance therewith, as will be appreciated by those skilled in the art: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. The invention should not be limited to the preferred embodiments and drawings disclosed herein, but rather should be defined only by the scope of the appended claims.