Disclosure of Invention
The invention aims to provide an AI-based internet content detection method and system, which solve the problem that the existing single convolutional neural network or cyclic neural network is insufficient to cope with complex network environments.
In a first aspect, the present application provides an AI-based internet content detection method, the method comprising:
Collecting different types of collected data packets in an Internet network, and identifying the type of the data packet as an image type or a text type according to metadata carried by the collected data packet;
When the image type is identified, discretizing the data flow of the acquired data packet, sampling according to time domain continuity to obtain a discrete data flow after dimension reduction, and converting to obtain a gray image; according to the characteristic value distribution of the first intermediate result, determining the width of a sliding window, sampling the data flow of the acquired data packet again by using the sliding window, and directly extracting the second image characteristic from the data flow;
When the text type is identified, extracting sentences from the stream sequence of the collected data packet, inputting the sentences into a syntactic model, and performing preliminary sentence breaking to obtain a first word component; according to the preset mapping relation between phrase types and weight values, analyzing the first word components after all preliminary sentence breaking, clustering the first word components with weight values larger than a threshold value to form new sentences, and extracting sequence features from the new sentences;
The second image features and the sequence features are fused and sent into a convolution layer of an identification model, local feature components are selected by utilizing sliding windows with different sizes, the local feature components are spliced to obtain a first feature matrix, and the feature matrix is sent into a pooling layer of the identification model;
The second feature matrix is sent into a random forest of the identification model for classification, the random forest carries out n rounds of extraction on the second feature matrix to obtain n training sets, the extracted n training sets are used for training by using a specified quantity of feature values randomly through column sampling to obtain n decision trees, and the n decision trees obtain classification results according to a voting mode;
and managing the collected data packet according to the classification result.
In a second aspect, the present application provides an AI-based internet content detection system, the system comprising:
The acquisition and identification unit is used for acquiring different types of acquisition data packets in the Internet network, and identifying that the type of the data packet is an image type or a text type according to metadata carried by the acquisition data packet;
The image feature extraction unit is used for carrying out discretization processing on the data stream of the acquired data packet when the image type is identified, sampling according to time domain continuity to obtain a discrete data stream after dimension reduction, and converting the discrete data stream into a gray image; according to the characteristic value distribution of the first intermediate result, determining the width of a sliding window, sampling the data flow of the acquired data packet again by using the sliding window, and directly extracting the second image characteristic from the data flow;
The sequence feature extraction unit is used for extracting sentences from the stream sequence of the collected data packet when the text type is identified, inputting the sentences into a syntactic model, and performing preliminary sentence breaking to obtain a first word component; according to the preset mapping relation between phrase types and weight values, analyzing the first word components after all preliminary sentence breaking, clustering the first word components with weight values larger than a threshold value to form new sentences, and extracting sequence features from the new sentences;
The fusion unit is used for fusing the second image features and the sequence features, sending the second image features and the sequence features into a convolution layer of the recognition model, selecting local feature components by utilizing sliding windows with different sizes, splicing the local feature components to obtain a first feature matrix, and sending the feature matrix into a pooling layer of the recognition model;
The classification unit is used for sending the second feature matrix into a random forest of the identification model for classification, the random forest extracts the second feature matrix for n rounds to obtain n training sets, the extracted n training sets are used for training by using a specified quantity of feature values randomly through column sampling to obtain n decision trees, and the n decision trees obtain classification results according to a voting mode;
and the management unit is used for managing the acquired data packet according to the classification result.
In a third aspect, the present application provides an AI-based internet content detection system, the system comprising a processor and a memory:
The memory is used for storing program codes and transmitting the program codes to the processor;
The processor is configured to perform the method according to any one of the four possible aspects of the first aspect according to instructions in the program code.
In a fourth aspect, the present application provides a computer readable storage medium for storing program code for performing the method of any one of the four possibilities of the first aspect.
Advantageous effects
The invention provides an AI-based internet content detection method and system, which respectively adopt different characteristic extraction modes for image types or text types by identifying the types of different data packets. And adopting dimension reduction sampling and sliding window sampling aiming at the image type to obtain the image characteristics containing the high-dimension local characteristics. And adopting sentence breaking and clustering aiming at the text types to obtain sequence characteristics after sentence recombination. And fusing the image features and the sequence features according to a certain rule to obtain a feature matrix, and finally obtaining a classification result by adopting a random forest mode. Through the processing procedure, the problem that the existing single convolutional neural network or cyclic neural network is insufficient to cope with complex network environments is solved.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.
Fig. 1 is a general flowchart of an AI-based internet content detection method according to the present application, where the method includes:
Collecting different types of collected data packets in an Internet network, and identifying the type of the data packet as an image type or a text type according to metadata carried by the collected data packet;
When the image type is identified, discretizing the data flow of the acquired data packet, sampling according to time domain continuity to obtain a discrete data flow after dimension reduction, and converting to obtain a gray image; according to the characteristic value distribution of the first intermediate result, determining the width of a sliding window, sampling the data flow of the acquired data packet again by using the sliding window, and directly extracting the second image characteristic from the data flow;
When the text type is identified, extracting sentences from the stream sequence of the collected data packet, inputting the sentences into a syntactic model, and performing preliminary sentence breaking to obtain a first word component; according to the preset mapping relation between phrase types and weight values, analyzing the first word components after all preliminary sentence breaking, clustering the first word components with weight values larger than a threshold value to form new sentences, and extracting sequence features from the new sentences;
The second image features and the sequence features are fused and sent into a convolution layer of an identification model, local feature components are selected by utilizing sliding windows with different sizes, the local feature components are spliced to obtain a first feature matrix, and the feature matrix is sent into a pooling layer of the identification model;
The second feature matrix is sent into a random forest of the identification model for classification, the random forest carries out n rounds of extraction on the second feature matrix to obtain n training sets, the extracted n training sets are used for training by using a specified quantity of feature values randomly through column sampling to obtain n decision trees, and the n decision trees obtain classification results according to a voting mode;
and managing the collected data packet according to the classification result.
In some preferred embodiments, the recognition model minimizes the entropy loss function by reverse propagation while training, avoids oversaturation, and indicates that the recognition model training is complete when the accuracy of the recognition model meets the requirements of a threshold. And then available for data verification.
In some preferred embodiments, the fusing the second image features and the sequence features includes extracting the second image features one by one in units of rows or columns and writing the second image features into a matrix of a single dimension, extracting the sequence features one by one and writing the sequence features one by one into a matrix of another single dimension, and weighting or accumulating the matrices of the two single dimensions to obtain a fused feature matrix.
The classification capability of each decision tree has pertinence, the specified quantity characteristic values are obtained according to different classifications, and the same characteristic vector matrix is classified according to different angles through the decision tree, so that the integration function aiming at different classification capabilities is completed. Its classification performance is higher than that of a single classifier.
The average generalization error of a decision tree in a random forest is related to the regression function.
In some preferred embodiments, the voting approach involves weighted accumulation of the output results of each decision tree.
Fig. 2 is a schematic diagram of an AI-based internet content detection system according to the present application, where the system includes:
The acquisition and identification unit is used for acquiring different types of acquisition data packets in the Internet network, and identifying that the type of the data packet is an image type or a text type according to metadata carried by the acquisition data packet;
The image feature extraction unit is used for carrying out discretization processing on the data stream of the acquired data packet when the image type is identified, sampling according to time domain continuity to obtain a discrete data stream after dimension reduction, and converting the discrete data stream into a gray image; according to the characteristic value distribution of the first intermediate result, determining the width of a sliding window, sampling the data flow of the acquired data packet again by using the sliding window, and directly extracting the second image characteristic from the data flow;
The sequence feature extraction unit is used for extracting sentences from the stream sequence of the collected data packet when the text type is identified, inputting the sentences into a syntactic model, and performing preliminary sentence breaking to obtain a first word component; according to the preset mapping relation between phrase types and weight values, analyzing the first word components after all preliminary sentence breaking, clustering the first word components with weight values larger than a threshold value to form new sentences, and extracting sequence features from the new sentences;
The fusion unit is used for fusing the second image features and the sequence features, sending the second image features and the sequence features into a convolution layer of the recognition model, selecting local feature components by utilizing sliding windows with different sizes, splicing the local feature components to obtain a first feature matrix, and sending the feature matrix into a pooling layer of the recognition model;
The classification unit is used for sending the second feature matrix into a random forest of the identification model for classification, the random forest extracts the second feature matrix for n rounds to obtain n training sets, the extracted n training sets are used for training by using a specified quantity of feature values randomly through column sampling to obtain n decision trees, and the n decision trees obtain classification results according to a voting mode;
and the management unit is used for managing the acquired data packet according to the classification result.
The application provides an AI-based internet content detection system, which comprises a processor and a memory:
The memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method according to any of the embodiments of the first aspect according to instructions in the program code.
The present application provides a computer readable storage medium for storing program code for performing the method of any one of the embodiments of the first aspect.
In a specific implementation, the present invention also provides a computer storage medium, where the computer storage medium may store a program, where the program may include some or all of the steps in the various embodiments of the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
The same or similar parts between the various embodiments of the present description are referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments for the matters.
The embodiments of the present invention described above do not limit the scope of the present invention.