CN110909531B

CN110909531B - Information security screening method, device, equipment and storage medium

Info

Publication number: CN110909531B
Application number: CN201910991165.2A
Authority: CN
Inventors: 杨冬艳; 王智浩
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2024-03-22
Anticipated expiration: 2039-10-18
Also published as: CN110909531A

Abstract

The invention relates to the technical field of artificial intelligence, and discloses an information security screening method, which comprises the steps of constructing a crawler system based on a distributed system framework and an internal memory computer engine to collect information from different channels, then continuously learning related terms of information security by utilizing a machine learning and semantic definition algorithm of the forefront in various industries, continuously expanding the data sources of acquired texts, analyzing network security information from a more comprehensive field and a deeper level, constructing an internal association relation between data, increasing the effectiveness and convincing of analysis results, and screening information transmitted in a network based on the internal association relation in a knowledge base of the learned terms; the invention also provides an information security screening device, equipment and a computer readable storage medium, which are used for mining the internal connection of the network information security knowledge, assisting in identifying fraud or vulnerability scenes existing in the network service and improving the security of network transmission information.

Description

Information security screening method, device, equipment and storage medium

Technical Field

The present invention relates to the field of network security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for information security screening.

Background

With the continuous development of network technology, networks have become a part of real life, users currently complete various demands through the networks, while realizing the demands, the users need to provide private information such as identity cards, bank information and the like, the information belongs to the private information, and the networks serve as a common platform for users to realize the demands and communication, if the information cannot be well protected, the information can be leaked, and if the information is obtained by lawbreakers, serious consequences can be caused. Therefore, the network information security becomes a great development point of the current network communication, especially for monitoring and protecting network attacks, and the implementation of the network information security is different from the threat form of the traditional security field, and has the characteristics of changeable form, uneasy detection and the like.

In the prior art, corresponding self-learning is mainly performed through a single model algorithm to perform security recognition on network information, but the conventional methods such as rule engine, data mining and machine discovery are difficult to recognize the internal connection of threats, especially the current method of pushing or changing some phrase information is used for realizing network attack, and if the internal relation is recognized through a single model, the recognition grasping degree of the internal relation on the information is not high, and misjudgment or misjudgment often occurs, so that a great potential safety hazard is brought to the use of users.

Disclosure of Invention

The invention mainly aims to provide an information security screening method, device, equipment and storage medium, and aims to solve the technical problem that the existing network information security identification mode is low in identification precision of potential safety hazards.

In order to achieve the above object, the present invention provides an information security screening method, which includes the following steps:

acquiring data information related to network security on each internet channel through a crawler platform, wherein the crawler platform is built based on a distributed system framework and a memory type computer engine, and the data information at least comprises text data and image data;

according to a preset machine learning algorithm and a semantic definition algorithm of the entry, performing machine learning of text semantics or image shape outlines on the data information to obtain a machine learning result;

converting the machine learning result into a feature matrix of a word vector, and establishing an internal association relation between different data information in the data information based on the feature matrix to obtain an information security recognition library, wherein the internal association relation comprises a semantic association relation between text data and a shape contour association relation between image data;

Acquiring a security event to be processed, and determining the data type of the security event, wherein the security event is network information received by a network terminal from a network server through a network;

and selecting a corresponding knowledge base from the information security recognition base according to the data type, and carrying out security classification and screening processing on the security event based on semantic association relations between text data and shape contour association relations between image data in the knowledge base.

Optionally, the step of obtaining the data information related to network security on each internet channel through the crawler platform includes:

acquiring interaction data acquired when the crawler platform monitors the Internet channel in real time;

extracting basic data related to network security from the monitored interaction data according to a rule of randomly extracting samples, and forming data samples for training the information security recognition library based on the basic data, wherein the Internet channel comprises at least one of Internet web pages and a data storage platform;

if the extracted basic data is text data, dividing the text data into a plurality of terms according to a semantic recognition technology to form the data information, wherein the terms are unit terms with definite semantics;

If the extracted basic data is image data, the image data is segmented into a plurality of maps according to a segmentation technology of an image shape minimum unit to form the data information, wherein the maps are image fragments with the determined complete outline with the single shape.

Optionally, the dividing the text data into a plurality of terms according to the semantic recognition technology, and forming the data information includes:

dividing the text data according to a forward segmentation method and a reverse segmentation method respectively to obtain a forward vocabulary entry set and a reverse vocabulary entry set;

calculating absolute frequency and relative frequency of each term in the forward term set and the reverse term set;

comparing the absolute frequency of the forward vocabulary entry set with the absolute frequency of the reverse vocabulary entry set, and comparing the relative frequency of the forward vocabulary entry set with the relative frequency of the reverse vocabulary entry set to obtain a comparison result of the absolute frequency and the relative frequency;

calculating a phase difference value between the absolute frequency and the relative frequency in the comparison result, and selecting any entry set of which the phase difference value is in a preset range as a segmentation set of the text data;

judging whether the absolute frequency and the relative frequency of the selected vocabulary entries in the vocabulary entry set are larger than corresponding preset statistical values or not;

If the judgment result is smaller than the preset statistical value, eliminating the vocabulary entries smaller than the preset statistical value from the vocabulary entries in a centralized manner to form final data information;

the absolute frequency is calculated by the following steps: dividing the number of times of occurrence of the term by the length of the text data to obtain the absolute frequency of the term.

Optionally, the machine learning algorithm includes a language learning model and a regression training model, and in the step of dividing the text data into a plurality of terms according to the semantic recognition technology to form the data information, the method further includes:

acquiring semantic definition rules of new entries and proper nouns in the Internet;

after the step of converting the machine learning result into a feature matrix of a word vector and establishing an internal association relationship between different data information in the data information based on the feature matrix to obtain an information security recognition library, the method further comprises the following steps:

according to semantic definition rules and the language learning model, re-segmenting and learning the text data in the text data to form a text knowledge base;

and carrying out regression analysis on the information security knowledge base according to the regression training model and the text knowledge base to obtain new vocabulary entries and proper nouns meeting regression conditions in the text knowledge base, and adding the new vocabulary entries and proper nouns into the information security knowledge base.

Optionally, the re-segmenting and learning the text data in the text data according to the semantic definition rule and the language learning model to form a text knowledge base includes:

if the language learning model is a TF-IDF model, re-segmenting the text data according to the semantic definition rule to obtain a new entry set;

according to the characteristic evaluation method of the TF-IDF model, evaluating each term in the new term set;

and adjusting the new entry set according to the evaluation result to form the text knowledge base.

Optionally, the feature evaluation method of the TF-IDF model includes:

calculating characteristic frequency TF and inverse document frequency IDF of each term in the new term set in the text data;

and determining the evaluation grade P of each entry according to the characteristic frequency and the inverse document frequency.

Optionally, the converting the learned result into a feature matrix of the word vector, and establishing an internal association relationship between different data information in the data information based on the feature matrix includes:

training each term in the text data after learning according to a machine learning model in the machine learning algorithm to obtain a term characteristic;

Vector training is carried out on the word characteristics and the corresponding semantics thereof through word2vec, so as to generate word vectors of the word characteristics;

expanding multi-dimensional semantics of the word features, and carrying out vector training on the dimensions according to the training mode of the word vectors to generate corresponding dimension vectors;

calculating a position vector of the word feature in the entry according to the word vector and a dimension vector corresponding to the word vector;

and constructing a three-dimensional spatial position relation diagram of the word characteristics according to the position vector, wherein the three-dimensional spatial position relation diagram comprises internal association relations among entries in the text data.

In addition, in order to achieve the above object, the present invention further provides an information security screening apparatus, including:

the data acquisition module is used for acquiring data information related to network security on each internet channel through a crawler platform, wherein the crawler platform is built based on a distributed system framework and a memory type computer engine, and the data information at least comprises text data and image data;

a processing module configured to:

according to a preset machine learning algorithm and a semantic definition algorithm of an entry, learning text semantics or image shape outlines of the data information, converting a learning result into a feature matrix of a word vector, and establishing internal association relations between different data information in the data information based on the feature matrix to obtain an information security recognition library, wherein the internal association relations comprise semantic association relations between text data and shape outline association relations between image data;

and selecting a corresponding knowledge base from the information security recognition base according to the data type, and carrying out security classification and screening processing on the security event based on semantic association relations between text data and shape outline association relations between image data in the knowledge base.

In another embodiment of the invention, the data acquisition module comprises a monitoring unit and a grasping unit, wherein,

the monitoring unit is used for setting the crawler platform to continuously monitor the interactive data of the Internet channel in real time, extracting basic data related to network safety from the monitored interactive data according to a rule of randomly extracting samples, and forming a data sample for training the information safety recognition library based on the basic data, wherein the Internet channel comprises at least one of Internet web pages and a data storage platform;

the grabbing unit is used for dividing the text data into a plurality of entries according to a semantic recognition technology when the extracted basic data are the text data to form the data information, wherein the entries are unit words and sentences with definite semantics; when the extracted basic data is taken as image data, the image data is divided into a plurality of maps according to the dividing technology of the minimum unit of the image shape, the data information is formed, and the maps are image fragments with the determined complete outline with the single shape.

In another embodiment of the present invention, the capturing unit is configured to divide the text data according to a forward segmentation method and a reverse segmentation method, respectively, to obtain a forward vocabulary entry set and a reverse vocabulary entry set; calculating absolute frequency and relative frequency of each term in the forward term set and the reverse term set; comparing the absolute frequency of the forward term set with the absolute frequency of the reverse term set, comparing the relative frequency of the forward term set with the relative frequency of the reverse term set to obtain a comparison result of the absolute frequency and the relative frequency, calculating a phase difference value between the absolute frequency and the relative frequency in the comparison result, and selecting any term set with the phase difference value within a preset range as a segmentation set of the text data; judging whether the absolute frequency and the relative frequency of the selected vocabulary entries in the vocabulary entry set are larger than corresponding preset statistical values or not; if the judgment result is smaller than the preset statistical value, eliminating the vocabulary entries smaller than the preset statistical value from the vocabulary entries in a centralized manner to form final data information; the absolute frequency is calculated by the following steps: dividing the number of times of occurrence of the term by the length of the text data to obtain the absolute frequency of the term.

In another embodiment of the present invention, the capturing unit is further configured to obtain a semantic definition rule for new terms and proper nouns in the internet;

the processing module is also used for re-dividing and learning the text data in the text data according to semantic definition rules and the language learning model to form a text knowledge base; and carrying out regression analysis on the information security knowledge base according to the regression training model and the text knowledge base to obtain new vocabulary entries and proper nouns meeting regression conditions in the text knowledge base, and adding the new vocabulary entries and proper nouns into the information security knowledge base.

In another embodiment of the present invention, if the language learning model is a TF-IDF model, the processing module user re-segments the text data according to the semantic definition rule to obtain a new vocabulary entry set; evaluating each term in the new term set according to a characteristic evaluation method of the TF-IDF model; and adjusting the new entry set according to the evaluation result to form the text knowledge base.

In another embodiment of the present invention, the feature evaluation method of the TF-IDF model includes:

and determining the evaluation level P of each entry according to the characteristic frequency and the inverse document frequency.

In another embodiment of the present invention, the processing module is configured to perform word segmentation training on each term in the text data after learning according to a machine learning model in the machine learning algorithm, so as to obtain a word feature; vector training is carried out on the word characteristics and the corresponding semantics thereof through word2vec, so as to generate word vectors of the word characteristics; expanding multi-dimensional semantics of the word features, and carrying out vector training on the dimensions according to the training mode of the word vectors to generate corresponding dimension vectors; calculating a position vector of the word feature in the entry according to the word vector and a dimension vector corresponding to the word vector; and constructing a three-dimensional spatial position relation diagram of the word characteristics according to the position vector, wherein the three-dimensional spatial position relation diagram comprises internal association relations among entries in the text data.

In addition, in order to achieve the above object, the present invention also provides an information security screening apparatus, including: the information security screening method comprises the steps of a memory, a processor and an information security screening program which is stored in the memory and can run on the processor, wherein the information security screening program is executed by the processor to realize the information security screening method.

In order to achieve the above object, the present invention provides a computer-readable storage medium having stored thereon an information security screening program which, when executed by a processor, implements the steps of the information security screening method according to any one of the above.

The information security screening method provided by the invention is based on a distributed system architecture hadoop and a memory type computer engine spark, a crawler system is built, information is collected from different sub-channels through the crawler system, then related entries of information security are continuously learned by utilizing the machine learning and deep learning technologies of the forefront in various industries, data sources of acquired texts are continuously enlarged, network security information is analyzed from a more comprehensive field and a more deep angle, the effectiveness and persuasion of analysis results are increased, information transmitted in a network is screened based on the knowledge base of the learned entries, so that fraud or vulnerability scenes existing in network business are identified, and the security of network transmission information is improved.

Drawings

Fig. 1 is a schematic structural diagram of an operating environment of a mobile terminal according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of the method for screening information security according to the present invention;

FIG. 3 is a schematic structural diagram of a natural semantic processing platform provided by the invention;

fig. 4 is a schematic diagram of functional modules of a screening apparatus for providing information security according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides an information security screening device which can be a plug-in unit in a mobile terminal and is used for executing the information security screening method provided by the embodiment of the invention, as shown in fig. 1, fig. 1 is a schematic structural diagram of an operating environment of the mobile terminal, which is related to the scheme of the embodiment of the invention and can realize information security screening.

As shown in fig. 1, the mobile terminal includes: a processor 101, e.g. a CPU, a communication bus 102, a user interface 103, a network interface 104, a memory 105. Wherein the communication bus 102 is used to enable connected communication between these components. The user interface 103 may comprise a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the network interface 104 may optionally comprise a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 105 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 105 may alternatively be a storage system separate from the aforementioned processor 101.

It will be appreciated by those skilled in the art that the hardware configuration of the mobile terminal shown in fig. 1 does not constitute a limitation on the information security screening apparatus or device provided by the present invention, and may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a screening program for realizing security of network information may be included in the memory 105 as one type of computer-readable storage medium. The operating system is a program for managing and controlling information security screening devices and software resource calls in a memory, and supports information security screening programs and other software and/or program operations.

In the hardware architecture of the mobile terminal shown in fig. 1, the network interface 104 is mainly used for accessing the network; the user interface 103 is mainly used for face portrait data to be recognized, and some parameters such as requirements when recognizing a face, and the processor 101 may be used to call the information security screening program stored in the memory 105 and perform operations of the following embodiments of the information security screening method.

Based on the hardware structure of the mobile terminal, the invention provides an information security screening method which is mainly applied to small-sized terminal equipment, such as mobile equipment including mobile phones, IPAD (internet protocol security), and the like, and referring to FIG. 2, FIG. 2 is a flow chart of the information security screening method provided by the embodiment of the invention. In this embodiment, the information security screening method specifically includes the following steps:

Step S210, acquiring data information related to network security on each Internet channel through a crawler platform;

in the present embodiment, the data information acquired here includes: at least one of text data and image data, wherein the crawler platform is realized based on a distributed system framework hadoop and a memory type computer engine spark, the platform comprises a base layer, an algorithm layer, a capability layer and a functional layer, the base layer is a deep learning technology which can be realized by the platform, an algorithm for processing acquired data is arranged in the algorithm layer, and functions and capabilities defined in the capability layer and the functional layer are realized on the acquired data through the algorithm.

That is, the bottom layer of the crawler system relies on HDFS (Hadoop Distributed File System ) of Hadoop to store text data of network information security, and processes the data with a spark memory computing engine, the crawler system can provide communication interfaces of various network devices for acquiring network software, based on these interfaces, the system can acquire data information related to network security from different devices, software and web pages, and also can continuously monitor and acquire 7×24 hours of the data information related to network security, and the interfaces specifically include search engines, news portals, forums, blogs, microblogs, electronic newspapers and the like.

In this case, the data information includes at least one of text and image data, and reads information on a device and a network through a communication interface in the crawler system, wherein the information refers to text information or image information which is considered to belong to a network safety hidden trouble, and the image information can be icon information of a certain action; while text information refers to sentences, even code applets, programming languages and the like, information attacks on a network are often realized through some programming languages, and are realized through inserting some keywords into the text information.

Step S220, performing machine learning of text semantics or image shape outlines on the data information according to a preset machine learning algorithm and a semantic definition algorithm of the vocabulary entry to obtain a machine learning result;

step S230, converting the machine learning result into a feature matrix of a word vector, and establishing an internal association relationship between different data information in the data information based on the feature matrix to obtain an information security recognition library;

in this step, the internal association relationship includes a semantic association relationship between text data and a shape contour association relationship between image data, and in practical application, the internal association relationship may also be an association relationship between text data and image data.

In this embodiment, for the feature matrix that converts the learned result into a word vector, and based on the feature matrix, establishing an internal association relationship between different data information in the data information includes:

In practical application, the semantic association relationship between data and a model in text data is established through learning specifically can be realized by constructing a feature matrix of a word vector:

Firstly, training word segmentation is needed to be carried out on text data, namely word groups of the text data are divided in a plurality of modes, and a unique identification word vector is generated on the divided text data;

then, performing multidimensional expansion on the word groups divided by the text data based on the word vectors, generating a dimension vector on each dimension, and performing word meaning increase based on the dimension vector;

finally, calculating the vector of the position of each phrase in the text data in the vocabulary entry of the text data according to the dimension vector and the word vector, and combining the word vector and the dimension vector to carry out vector combination so as to obtain the semantic relation vector of each phrase for each vocabulary entry;

in practical applications, the position vector needs to predict the position of each phrase in other terms according to the word sense expanded by each phrase in addition to learning the position of the term specified in the text data.

Specifically, the multi-dimensional expansion is mainly performed according to the semantics of the phrase, and the direction and the distance can be disregarded for each word of each word position in the divided text data, so that the method has the opportunity to directly decode each word in the sentence. Corresponding weight matrixes can be used as the relation between each word and other words in the same sentence, the larger the weight is, the stronger the relation is indicated, and the weight between words with fuzzy general meaning is deeper;

Further, in the text data representing process, we introduce a vector of the position of the included word to represent the sequence order information, for example, we use the position vector to represent that the meaning of "you are under me 100 wanming days to be still" is different from that of "you are under me 100 wanming days to be still", and the vector of the position of the word is included;

finally, the vectors are combined into a final vector, and the association relation between the data is interpreted by combining an association rule mining mode and a graph mining method.

Furthermore, for text data, at least one entry or keyword exists, and for text data with only one entry or keyword, the semantics of the text data are learned through a safe learning algorithm; in addition to learning the semantics of the text data containing at least more than two text data, the semantic relation between the terms is also required to be established, namely, after learning the semantics in the terms, the internal association relation of the terms is established according to the semantics of each term, and the term is also understood to be simply classified to form a response safety knowledge base. The security knowledge base may be understood as a comparison knowledge base for the security events to be screened, and the security of the security events may be determined by comparing the knowledge base, which may be a screening feature set for calculating the security degree of the security events.

In this embodiment, for the text data in the basic learning knowledge base, according to a machine learning model in the machine learning algorithm and a semantic definition method combined with terms, semantic learning is performed on the text data, and semantic association relations among terms in the basic learning knowledge base are established according to the learning result, so as to form the secure text knowledge base.

For the image data in the basic learning knowledge base, according to a machine learning model in the machine learning algorithm, carrying out memory learning on the shape and the outline of the image data, and establishing an internal relation of similar or same type of image data to form an image knowledge base;

in this embodiment, the safety learning algorithm may be understood as some algorithms or models for learning network safety recognition, where learning is mainly based on learning of natural semantics, and meaning of the terms may be determined according to semantic definitions of terms, for example, some terms with offensiveness are generally used for attack, and learning training is performed on terms of the offensive so as to obtain recognition modes of the terms.

For image data, besides the memory learning of the shape and the outline of the image data, basic features of the image data are obtained by the methods of image overturn, structure transformation, color histogram extraction, image high-level semantic information extraction, image bottom-level visual clustering and the like, the obtained features of the image bottom-level data and the obtained features of the high-level semantic data are subjected to frequent pattern mining respectively on the basis, and the two parts of data are fused to form a multi-mode association rule, so that the association relation of the image data is interpreted.

In the scheme, a knowledge base is built by adopting a machine learning model and a semantic definition mode of the vocabulary entries, and a plurality of new vocabulary entries can be learned. The new words are universal words and professional terms of various industries which emerge along with the development of society and technology; proper nouns such as person names, place names, organization names, commodity names, brand foreign language translations, dialect idioms, etc., which are a special processing unit, possess word attributes and are an integral whole, and generally have their own rules.

The number of new words is difficult to measure by numbers, and with the progress of various aspects of society, especially for the emerging fields of certain fast-developing industries, such as computer, biotechnology, information technology and the like, the new words like the professional terms are more and more, for example, with the development of networks and games, the word "web game" is not available before, but is now found in common words, and is often encountered in some text processing, and the word needs to be segmented when text is segmented. How to distinguish meaningful new words from huge and unordered information is one of the important contents of contemporary information works. Based on the fact that the new words are often an important monitoring part of the existing information security, the system is based on a mechanical word segmentation method and combines a semantic definition method, and the word segmentation text is processed by utilizing statistical knowledge in the word segmentation process so as to achieve the purpose of identifying the new words.

Step S240, a security event to be processed is obtained, and the data type of the security event is determined;

in this step, the security event is network information received by the network terminal from the network server through the network.

Step S250, selecting a corresponding knowledge base from the information security recognition base according to the data type, and performing security classification and screening processing on the security event based on the semantic association relationship between text data and the shape contour association relationship between image data in the knowledge base.

In this embodiment, the security classification and screening herein specifically refers to the determination of the security level of information, and the mining and classification of data text.

In this embodiment, for step S230, building an intrinsic association diagram includes building intrinsic information of each entry between real security events; in practical application, when acquiring data information related to network security, in particular, the data information is acquired by acquiring security events, and the security events after being screened are cached in a webpage or a security log of some devices, wherein the security events can be dangerous events or safe events, after the events are acquired, key terms in the events are extracted by a method for extracting terms, then learning is performed by combining semantic definitions of the terms, and the terms extracted from the events are established in a connection relationship by the semantic definitions and the security types of the events, so that the learning of internal connection of the events is realized.

In this embodiment, after step S230, the method further includes establishing an association relationship between the secure text knowledge base and the image knowledge base according to the secure text knowledge base, where the step is mainly to match corresponding text data with some marked images, specifically to determine whether the currently acquired security event is a concurrent event of text and image by learning a word of "as shown in the following figure" in the text data, and if so, executing the screening method of the step:

firstly, after an event with a concurrent text and image is screened, an image is acquired, the image is identified in outline, specifically, the entry analysis of the image is queried through outline matching of an image knowledge base, the entry analysis is determined through the association relation between a pre-trained and established safety knowledge base and the image knowledge base, after the determination, the text in a safety event is compared and identified based on the entry analysis, if an entry corresponding to or close to the entry analysis exists, the safety event is safe, otherwise, a certain risk exists, and safety management and control processing is needed.

In this embodiment, when data information is acquired through a crawler platform, specifically, interactive data of the internet channel is monitored in real time without interruption by setting the crawler platform, basic data related to network security is extracted from the monitored interactive data according to a rule of randomly extracting samples, and a data sample for training the information security identification library is formed based on the basic data, wherein the internet channel comprises at least one of an internet webpage and a data storage platform;

If the extracted basic data is text data, dividing the text data into a plurality of terms according to a semantic recognition technology to form the data information, wherein the terms are unit terms with definite semantics, namely terms which can be understood as being minimum in text language, can independently move, have meaning and have relatively determined meaning in paragraphs or sentences;

if the extracted basic data are image data, dividing the image data into a plurality of maps according to a dividing technology of an image shape minimum unit to form the data information, wherein the maps are image fragments with complete outlines of a single shape.

In practical application, when the basic data is acquired, a large amount of data information is acquired from different channels, the acquired data information is filtered according to initialization semantics, the filtering includes signature authentication of configuration files in the data information, namely whether the configuration files are legal data or data which are subjected to safe processing, and after verification is passed, a phase Guan Ciku and a sample library are read into an engine memory in a crawler system to be stored, so that the basic data is formed.

In this embodiment, the dividing the text data into a plurality of entries according to the semantic recognition technology, and forming the data information includes:

comparing the absolute frequency of the forward term set with the absolute frequency of the reverse term set, comparing the relative frequency of the forward term set with the relative frequency of the reverse term set to obtain a comparison result of the absolute frequency and the relative frequency, calculating a phase difference value between the absolute frequency and the relative frequency in the comparison result, and selecting any term set with the phase difference value within a preset range as a segmentation set of the text data;

In practical application, in the process of realizing the segmentation of the entry, 2 aspects, namely the accuracy of the word segmentation and the speed of the word segmentation, must be considered. Regardless of the word segmentation method, a great deal of time is needed to calculate the word forming possibility of the character string to be segmented, and then the most probable correct segmentation result is obtained by segmenting the terms according to rules in statistics or grammar, so that the word segmentation accuracy is improved. Therefore, if the initial segmentation speed can be increased, the speed of the whole segmentation algorithm can be greatly improved.

Firstly, the same word is segmented by a forward maximum matching method and a reverse maximum matching method respectively, and then the results are compared. If the "Changchun festival making words" is segmented, the reverse maximum matching method is selected to be used as a result because one word cannot be matched by the forward maximum matching method. Next, referring to the aforementioned concept of word frequency, and each word obtains a word frequency value based on its probability of occurrence in chinese. The word segmentation of the "Changchun drug store" is performed by 2 methods, but the word frequency of the "spring drug store" obtained by the reverse maximum matching method is much lower than that of other words. The result obtained by the word segmentation method is not universal, and the result is obtained by a forward maximum matching method. Therefore, the characteristic of combining the forward and reverse maximum matching methods is adopted, the word segmentation accuracy is greatly improved, and meanwhile, word segmentation ambiguity can be effectively resolved by matching with a word frequency library, so that the word segmentation accuracy is further ensured. After the vocabulary entry is obtained through a forward segmentation method and a reverse segmentation method, the ambiguity of the vocabulary entry is judged through semantics, and the vocabulary entry dividing modes with large ambiguity or small occurrence frequency are removed, so that the vocabulary entry which is finally accurate is obtained.

In this embodiment, when selecting the forward term set or the reverse term set as the division set, it may specifically be determined according to a comparison result of the absolute frequency and the relative frequency in each term set, where the term set corresponding to the absolute frequency greater than the relative frequency is selected as the final division set, if the comparison result is that the absolute frequency in both the forward term set and the reverse term set is greater than the relative frequency, then a difference value between the absolute frequency and the relative frequency is further determined, and one of them is selected based on the difference value, preferably, the corresponding term set in which the absolute frequency and the relative frequency are less different is selected as the division set.

In this embodiment, the machine learning algorithm includes a language learning model and a regression training model, and the step of dividing the text data into a plurality of terms according to the semantic recognition technology to form the data information further includes:

after the step of learning text semantics or image shape outlines of the data information according to a preset machine learning algorithm and a semantic definition algorithm of an entry, converting the learned result into a feature matrix of a word vector, and establishing an internal association relationship between different data information in the data information based on the feature matrix to obtain an information security recognition library, the method further comprises the following steps:

In this embodiment, the language learning model specifically includes a conditional random field model, a TF-IDF model, a hidden markov model, a word2vec, and the like, and a BERT model which is relatively new in the industry, and the machine learning model is fused, including a vector space model, a probability map model, a decision tree model, a support vector machine model, and the like, and an in-deep learning model (of course, also can be understood as a regression model) in the natural language field, such as CNN, RNN, LSTM, xgboost, and the like, is added. The method comprises the steps of carrying out natural language processing on network information texts, analyzing matching relations among keywords, relations among contexts and progressive relations, and embodying internal relations among network information security event entities in a knowledge graph mode through a graph mining mode, wherein in the actual use process, a main natural language processing method is adopted as a main stream, and a machine learning method is combined to carry out good screening processing on security events, so that the accuracy rate and recall rate of a system are improved.

In this case, the re-dividing and learning the text data in the text data according to the language learning model and combining the semantic definition rule of the new term and the proper noun, and forming a text knowledge base includes:

re-segmenting the text data according to the semantic definition rules of the new vocabulary entry and the proper nouns to obtain a new vocabulary entry set;

evaluating each term in the new term set according to a characteristic evaluation method of the TF-IDF model;

In this embodiment, the feature evaluation method of the TF-IDF model includes:

according to the characteristic frequency and the inverse document frequency, determining the evaluation level P of each entry, wherein the calculation formula is as follows:

P＝TF×DF；

where DF (t) represents the number of text containing an entry.

In practice, the TFIDF method described above uses the most common criteria for evaluating a feature, which uses the TF x IDF value of the feature to evaluate a feature. The definition of TF (feature frequency) is the number of times a feature appears in a page. Considering the influence of the document length factor, TF is defined as:

TF(fi,pj)′＝TF(fi,pj)/max(TF(fv,pj))，v＝1,2,…。

Each feature in the feature set that does not appear in the document may have an F value of 0. To avoid this, the TF definition is again modified to TF (fi, pj) "=0.5+0.5×tf (fi, pj)'.

The TF value reflects the importance of a feature relative to a document, defaulting to the number of occurrences being more important. However, some features appear in almost all documents and TF values are high, e.g., the number of "computer" occurrences in text resources of a network educational resource management system is very high. Such features obviously do not help much in classification and should be removed from the feature set. The IDF (inverse document frequency) concept is thus introduced, which is defined as:

the IDF value of the feature clearly decreases with increasing DF value.

In the scheme, in the step of obtaining the knowledge base in the steps S10-S20, the knowledge base can be realized by combining a text mining engine and a knowledge graph technology with a special knowledge base in a unique safety field, a natural language processing technology for one-stop type information safety is realized, after training of the knowledge base obtained based on the above manner, the network safety information received later can realize the requirements of entity extraction, association relation extraction, data capture, trend tracking, hot spot identification, topic analysis, emotion judgment and the like, and also can realize the functions of intelligent searching, intelligent recommendation, public opinion analysis of safety vulnerabilities, text classification mining of safety information, intention analysis, relation analysis, knowledge management system and the like of the knowledge in the safety field.

The method comprises the steps of continuously acquiring texts related to network security information through an automatic crawler system customization task for 24 hours in all weather, continuously expanding the data sources for acquiring the texts, analyzing the network security information from a more comprehensive field and a deeper angle, and improving the effectiveness and persuasion of analysis results; combining machine learning and deep learning with knowledge in the safety field to form a knowledge base specific to the safety field, combining the knowledge with the existing safety service, and providing data technical support for the safety service; through intelligent recommendation, relationship analysis, entity identification, emotion analysis and the like, the inherent connection of network information security knowledge is mined, and the actual service is assisted to find more fraud or vulnerability scenes.

The method for discriminating information security provided by the invention realizes refined discrimination of entry information through natural semantics, the discrimination process can be realized through constructing a natural semantic processing platform, text data is taken as an example for illustration, the framework of the processing platform is shown in fig. 3, the platform comprises a functional layer, a capability layer, an algorithm layer and a basic layer, the functional layer can realize analysis of various information, such as classification of texts, search of internet data, even management functions and analysis of relations among data, and the analysis and construction of internal association relations and acquisition of data information in the method provided by the invention can be realized through the functional layer.

In this embodiment, after text data is obtained through the functional layer, the text data is sent to the capability layer, and word segmentation processing of the text data is performed by the capability layer, and meanwhile, semantic recognition is also realized.

After the text data are subjected to word segmentation and semantic definition and recognition of word features after the word segmentation through the capability layer, the text data are output to the algorithm layer, the word features are trained through the algorithm layer, and particularly, the word features are subjected to vector learning through traditional language models such as TF-IDF models, word2vec and matrix decomposition in the algorithm layer, are converted into vectors, and internal relations among the terms in the text data are constructed based on the vectors, wherein in practical application, the internal relations can be specific to each word feature.

Finally, the data information to be identified is identified based on the internal association relation, so that the data information can be identified from multiple directions in the identification process, and whether the data information is safe or not is determined through semantics.

Based on data mining, knowledge graph and linguistic experience knowledge, a natural language processing language model is combined on the basis of constructing a knowledge base in the information security field, the traditional language model comprises a conditional random field model, a TF-IDF model, a hidden Markov model, a word2vec model and the like, an industry relatively new BERT model and the like, a machine learning model is fused, the model comprises a vector space model, a probability graph model, a decision tree model, a support vector machine model and the like, deep learning models in the natural language field such as CNN, RNN, LSTM, xgboost and the like are added, natural language processing is carried out on network information texts, the matching relationship between keywords and the relation and the progressive relationship between contexts are analyzed, the internal relation between network information security event entities is embodied in a knowledge graph mining mode, a main natural language processing method is adopted in the actual use process, and a machine learning method is combined for carrying out good discrimination processing on security events, so that the accuracy and recall rate of the system are improved.

In order to solve the above-mentioned problems, the embodiment of the present invention further provides an information security screening apparatus, and referring to fig. 4, fig. 4 is a schematic diagram of functional modules of the information security screening apparatus according to the embodiment of the present invention. In this embodiment, the apparatus includes:

the data acquisition module 41 is configured to acquire data information related to network security on each internet channel through a crawler platform, where the crawler platform is built based on a distributed system architecture and a memory type computer engine, and the data information at least includes text data and image data;

a processing module 42 configured to:

and selecting a corresponding knowledge base from the information security identification base according to the data type, and carrying out security classification and screening processing on the security event based on the knowledge base.

In practical application, the functions implemented by the above-mentioned device may also be implemented by specific functional modules, where the device specifically includes:

the crawler platform is used for acquiring data information related to network security on each internet channel through the crawler platform, wherein the crawler platform is built based on a distributed system framework and a memory type computer engine, and the data information at least comprises text data and image data;

the algorithm training module is used for learning text semantics or image shape outlines of the data information according to a preset machine learning algorithm and a semantic definition algorithm of an entry, converting a learning result into a feature matrix of a word vector, and establishing internal association relations among different data information in the data information based on the feature matrix to obtain an information security recognition library, wherein the internal association relations comprise semantic association relations among text data and shape outline association relations among image data;

The screening module is used for acquiring a security event to be processed and determining the data type of the security event, wherein the security event is network information received by a network terminal from a network server through a network; and selecting a corresponding knowledge base from the information security recognition base according to the data type, and carrying out security classification and screening processing on the security event based on semantic association relations between text data and shape contour association relations between image data in the knowledge base.

The content is described based on the same embodiment as the information security screening method in the embodiment of the present invention, so that the content of the embodiment of the information security screening apparatus in this embodiment is not described in detail.

The invention also provides a computer readable storage medium.

In this embodiment, the computer readable storage medium stores an information security screening program, and when the information security screening program is executed by a processor, the steps of the information security screening method described in any one of the above embodiments are implemented. The method implemented when the information security screening program is executed by the processor may refer to various embodiments of the information security screening method of the present invention, so that redundant description is omitted.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM), comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method according to the embodiments of the present invention.

While the embodiments of the present invention have been described above with reference to the drawings, the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made thereto by those of ordinary skill in the art without departing from the spirit of the present invention and the scope of the appended claims, which are to be accorded the full scope of the present invention as defined by the following description and drawings, or by any equivalent structures or equivalent flow changes, or by direct or indirect application to other relevant technical fields.

Claims

1. The information security screening method is characterized by comprising the following steps of:

Selecting a corresponding knowledge base from the information security recognition base according to the data type, and carrying out security classification and screening treatment on the security event based on semantic association relations between text data and shape contour association relations between image data in the knowledge base;

the obtaining the data information related to the network security on each internet channel through the crawler platform comprises the following steps:

if the extracted basic data are image data, dividing the image data into a plurality of maps according to a dividing technology of an image shape minimum unit to form the data information, wherein the maps are image fragments with a determined complete outline with a single shape;

The text data is divided into a plurality of entries according to a semantic recognition technology, and the forming of the data information comprises:

the absolute frequency is calculated by the following steps: dividing the number of times of occurrence of the entry by the length of the text data to obtain the absolute frequency of the entry;

The machine learning algorithm comprises a language learning model and a regression training model, and the step of dividing the text data into a plurality of entries according to the semantic recognition technology to form the data information further comprises the steps of:

2. The method for screening information security according to claim 1, wherein the re-dividing and learning the text data in the text data according to the semantic definition rule and the language learning model to form a text knowledge base includes:

3. The information security screening method according to claim 2, wherein the TF-IDF model feature evaluation method includes:

4. The information security screening method of any one of claims 1-3, wherein the converting the machine learning result into a feature matrix of word vectors, and establishing an internal association between different data information in the data information based on the feature matrix comprises:

and constructing a three-dimensional spatial position relation graph of the word characteristics according to the position vector, wherein the three-dimensional spatial position relation graph comprises internal association relations among entries in the text data.

5. An information security screening apparatus, wherein the information security screening apparatus performs the steps of the information security screening method according to any one of claims 1 to 4, the information security screening apparatus comprising:

A processing module configured to:

6. An information security screening apparatus, characterized in that the information security screening apparatus comprises: a memory, a processor and an information security screening program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the information security screening method of any one of claims 1-4.

7. A computer-readable storage medium, wherein an information security screening program is stored on the computer-readable storage medium, which when executed by a processor, implements the steps of the information security screening method according to any one of claims 1 to 4.