CN116167000B

CN116167000B - AI-based Internet content detection method and system

Info

Publication number: CN116167000B
Application number: CN202211592646.4A
Authority: CN
Inventors: 孙涛; 孙中民
Original assignee: Tianjin Guorui Digital Safety System Co ltd
Current assignee: Tianjin Guorui Digital Safety System Co ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2025-10-03
Anticipated expiration: 2042-12-13
Also published as: CN116167000A

Abstract

The present invention provides an AI-based Internet content detection method and system, which uses different feature extraction methods for image types or text types by identifying the types of different data packets. Dimensionality reduction sampling and sliding window sampling are used for image types to obtain image features containing high-dimensional local features. Sentence segmentation and clustering are used for text types to obtain sequence features after reorganization of sentences. After fusing image features and sequence features according to certain rules, a feature matrix is obtained, and finally a random forest method is used to obtain the classification result. Through the above processing process, the problem that the existing single convolutional neural network or recurrent neural network is not sufficient to cope with complex network environments is overcome.

Description

AI-based internet content detection method and system

Technical Field

The application relates to the technical field of network security, in particular to an internet content detection method and system based on AI.

Background

The existing content detection method mostly adopts a convolutional neural network or a cyclic neural network. The convolutional neural network can extract image features, the features are subjected to dimension reduction processing by using a pooling layer, and the features are subjected to nonlinear transformation by using a flattening layer and a dense connecting layer, so that a classification effect is achieved. The cyclic neural network is used for extracting features from the sequence of training data, and the sequence feature information is characterized by traversing all elements of the data and generating data with specific dimensions.

However, the internet network is complex, including both image type data and text type data. A single convolutional neural network or a recurrent neural network is insufficient to cope with a complex network environment.

Therefore, there is an urgent need for a targeted AI-based internet content detection method and system.

Disclosure of Invention

The invention aims to provide an AI-based internet content detection method and system, which solve the problem that the existing single convolutional neural network or cyclic neural network is insufficient to cope with complex network environments.

In a first aspect, the present application provides an AI-based internet content detection method, the method comprising:

Collecting different types of collected data packets in an Internet network, and identifying the type of the data packet as an image type or a text type according to metadata carried by the collected data packet;

When the image type is identified, discretizing the data flow of the acquired data packet, sampling according to time domain continuity to obtain a discrete data flow after dimension reduction, and converting to obtain a gray image; according to the characteristic value distribution of the first intermediate result, determining the width of a sliding window, sampling the data flow of the acquired data packet again by using the sliding window, and directly extracting the second image characteristic from the data flow;

When the text type is identified, extracting sentences from the stream sequence of the collected data packet, inputting the sentences into a syntactic model, and performing preliminary sentence breaking to obtain a first word component; according to the preset mapping relation between phrase types and weight values, analyzing the first word components after all preliminary sentence breaking, clustering the first word components with weight values larger than a threshold value to form new sentences, and extracting sequence features from the new sentences;

The second image features and the sequence features are fused and sent into a convolution layer of an identification model, local feature components are selected by utilizing sliding windows with different sizes, the local feature components are spliced to obtain a first feature matrix, and the feature matrix is sent into a pooling layer of the identification model;

The second feature matrix is sent into a random forest of the identification model for classification, the random forest carries out n rounds of extraction on the second feature matrix to obtain n training sets, the extracted n training sets are used for training by using a specified quantity of feature values randomly through column sampling to obtain n decision trees, and the n decision trees obtain classification results according to a voting mode;

and managing the collected data packet according to the classification result.

In a second aspect, the present application provides an AI-based internet content detection system, the system comprising:

The acquisition and identification unit is used for acquiring different types of acquisition data packets in the Internet network, and identifying that the type of the data packet is an image type or a text type according to metadata carried by the acquisition data packet;

The image feature extraction unit is used for carrying out discretization processing on the data stream of the acquired data packet when the image type is identified, sampling according to time domain continuity to obtain a discrete data stream after dimension reduction, and converting the discrete data stream into a gray image; according to the characteristic value distribution of the first intermediate result, determining the width of a sliding window, sampling the data flow of the acquired data packet again by using the sliding window, and directly extracting the second image characteristic from the data flow;

The sequence feature extraction unit is used for extracting sentences from the stream sequence of the collected data packet when the text type is identified, inputting the sentences into a syntactic model, and performing preliminary sentence breaking to obtain a first word component; according to the preset mapping relation between phrase types and weight values, analyzing the first word components after all preliminary sentence breaking, clustering the first word components with weight values larger than a threshold value to form new sentences, and extracting sequence features from the new sentences;

The fusion unit is used for fusing the second image features and the sequence features, sending the second image features and the sequence features into a convolution layer of the recognition model, selecting local feature components by utilizing sliding windows with different sizes, splicing the local feature components to obtain a first feature matrix, and sending the feature matrix into a pooling layer of the recognition model;

The classification unit is used for sending the second feature matrix into a random forest of the identification model for classification, the random forest extracts the second feature matrix for n rounds to obtain n training sets, the extracted n training sets are used for training by using a specified quantity of feature values randomly through column sampling to obtain n decision trees, and the n decision trees obtain classification results according to a voting mode;

and the management unit is used for managing the acquired data packet according to the classification result.

In a third aspect, the present application provides an AI-based internet content detection system, the system comprising a processor and a memory:

The memory is used for storing program codes and transmitting the program codes to the processor;

The processor is configured to perform the method according to any one of the four possible aspects of the first aspect according to instructions in the program code.

In a fourth aspect, the present application provides a computer readable storage medium for storing program code for performing the method of any one of the four possibilities of the first aspect.

Advantageous effects

The invention provides an AI-based internet content detection method and system, which respectively adopt different characteristic extraction modes for image types or text types by identifying the types of different data packets. And adopting dimension reduction sampling and sliding window sampling aiming at the image type to obtain the image characteristics containing the high-dimension local characteristics. And adopting sentence breaking and clustering aiming at the text types to obtain sequence characteristics after sentence recombination. And fusing the image features and the sequence features according to a certain rule to obtain a feature matrix, and finally obtaining a classification result by adopting a random forest mode. Through the processing procedure, the problem that the existing single convolutional neural network or cyclic neural network is insufficient to cope with complex network environments is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of an AI-based Internet content detection method of the present invention;

Fig. 2 is a schematic diagram of an AI-based internet content detection system according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.

Fig. 1 is a general flowchart of an AI-based internet content detection method according to the present application, where the method includes:

and managing the collected data packet according to the classification result.

In some preferred embodiments, the recognition model minimizes the entropy loss function by reverse propagation while training, avoids oversaturation, and indicates that the recognition model training is complete when the accuracy of the recognition model meets the requirements of a threshold. And then available for data verification.

In some preferred embodiments, the fusing the second image features and the sequence features includes extracting the second image features one by one in units of rows or columns and writing the second image features into a matrix of a single dimension, extracting the sequence features one by one and writing the sequence features one by one into a matrix of another single dimension, and weighting or accumulating the matrices of the two single dimensions to obtain a fused feature matrix.

The classification capability of each decision tree has pertinence, the specified quantity characteristic values are obtained according to different classifications, and the same characteristic vector matrix is classified according to different angles through the decision tree, so that the integration function aiming at different classification capabilities is completed. Its classification performance is higher than that of a single classifier.

The average generalization error of a decision tree in a random forest is related to the regression function.

In some preferred embodiments, the voting approach involves weighted accumulation of the output results of each decision tree.

Fig. 2 is a schematic diagram of an AI-based internet content detection system according to the present application, where the system includes:

The application provides an AI-based internet content detection system, which comprises a processor and a memory:

the processor is configured to perform the method according to any of the embodiments of the first aspect according to instructions in the program code.

The present application provides a computer readable storage medium for storing program code for performing the method of any one of the embodiments of the first aspect.

In a specific implementation, the present invention also provides a computer storage medium, where the computer storage medium may store a program, where the program may include some or all of the steps in the various embodiments of the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

The same or similar parts between the various embodiments of the present description are referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments for the matters.

The embodiments of the present invention described above do not limit the scope of the present invention.

Claims

1. An AI-based internet content detection method, the method comprising:

and managing the collected data packet according to the classification result.

2. The method of claim 1, wherein the recognition model is trained by minimizing the entropy loss function in a reverse propagation manner to avoid oversaturation, and wherein the recognition model training is completed when the accuracy of the recognition model meets a threshold.

3. The method of claim 1, wherein the fusing the second image features and the sequence features comprises extracting the second image features one by one in units of rows or columns and writing the second image features into a matrix with a single dimension, extracting the sequence features one by one and writing the sequence features into a matrix with another single dimension, and weighting or accumulating the matrices with the two single dimensions to obtain a fused feature matrix.

4. The method of claim 2 or 3, wherein voting comprises weighted accumulation of the output of each decision tree.

5. An AI-based internet content detection system, the system comprising:

6. An AI-based internet content detection system, the system comprising a processor and a memory:

the processor is configured to perform the method according to any of the claims 1-4 according to instructions in the program code.

7. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a program code for performing a method implementing any of claims 1-4.