CN111857097A

CN111857097A - Industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency

Info

Publication number: CN111857097A
Application number: CN202010733364.6A
Authority: CN
Inventors: 李少森; 梁钰华; 孙豪; 黄剑湘; 杨光; 李�浩; 张启浩; 任君; 杨铖; 丁丙侯
Original assignee: Kunming Bureau of Extra High Voltage Power Transmission Co
Current assignee: Kunming Bureau of Extra High Voltage Power Transmission Co
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-10-30
Anticipated expiration: 2040-07-27
Also published as: CN111857097B

Abstract

The invention discloses an industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency, which comprises the following steps: establishing a response corpus of diagnosis commands; sending a diagnosis command to the tested system again to obtain the (N +1) th echo message; filtering stop words and segmenting all echoed messages; calculating the inverse document frequency IDF of each word in each group of text lists of all echoing messages by using a TF-IDF word frequency and inverse document frequency algorithm; setting a lowest inverse document frequency threshold IDFmin, and deleting words not greater than IDFmin; establishing a phrase list V for the text list of the filtered N +1 echo messages, and calculating a word frequency value; and setting a word frequency threshold value, and comparing the calculated word frequency value with the set word frequency threshold value to judge the abnormality. The algorithm can define the health degree of each diagnosis command echoed information in a self-learning mode, can greatly reduce the manual development cost of an industrial control monitoring system, and improves the timeliness of event judgment.

Description

Industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency

Technical Field

The invention relates to the technical field of industrial control system abnormity diagnosis, in particular to an industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency.

Background

At present, part of industrial control systems realize operation and maintenance based on remote management, local operation interfaces such as screens and keys are not provided for field operation and maintenance personnel to interact, and a debugging computer is required to be used for access, and the system is interacted with a device in modes such as debugging software/a browser and the like to check and analyze system problems. Once an abnormal event of a channel or a device occurs, on-site operation and maintenance personnel can only give an alarm according to the channel interruption of other service systems and feedback and obtain information of operation and maintenance personnel of a remote monitoring center (such as each level of scheduling master station) and then access an industrial control system to check, analyze and process the abnormal reason by using a debugging computer. If the remote monitoring does not notice the abnormity, the abnormity can be discovered only when on-site operation and maintenance personnel regularly operate and maintain and configure backup, and the fault processing is generally delayed and untimely. Due to the randomness of the abnormality of the industrial control system, the quality of the abnormality analysis is lower as time goes on because the detailed information of the abnormality moment is difficult to grasp by manual regular checking and analysis.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency, and solves the problem of low quality of abnormity analysis of the industrial control system in the prior art.

The invention discloses an industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency, which comprises the following steps:

step 1: establishing a response corpus of diagnostic commands: sending a diagnosis command to a tested system for N times, and arranging the obtained N echo messages according to a time sequence to be used as a response corpus of the diagnosis command;

step 2: sending a diagnosis command to the tested system again to obtain the (N +1) th echo message, and adding the (N +1) th echo message to the end of the diagnosis command corpus established in the step (1);

and step 3: filtering stop words and segmenting the N +1 echo messages;

and 4, step 4: calculating the inverse document frequency IDF of each word in each group of text list in the N +1 echoed messages by using a TF-IDF word frequency and inverse document frequency algorithm;

and 5: setting a lowest inverse document frequency threshold IDFmin, and deleting the inverse document frequency IDF of the words of each group of text lists calculated in the step 4 if the inverse document frequency IDF is less than or equal to the IDFmin value;

step 6: vectorizing the text list of the N +1 echo messages which are filtered in the step 5: extracting all phrases in an N +1 group text list, removing repetition to obtain a phrase table V with the length of M, wherein M is equal to the total number of the phrases after removing the repetition filtering, V represents all phrases appearing in the N +1 group text list after filtering, then reordering the words in the text list according to the ordering of the vocabularies in V in the N +1 group text list after filtering, then converting the phrases into vectors, the size of the vectors is the number of times that the words appear in the echo message where the words are positioned, and calculating the word frequency value

And 7: setting a word frequency threshold tf_maxThe word frequency value calculated in step 6 is used

Value and set word frequency thresholdThe value tf_maxMake a comparison if

The message is identified as an abnormal message and alarm information is output.

According to an embodiment of the present invention, the sending time interval of the diagnostic command in step 1 is T, the value range of T is determined according to the time range in which the returned result of the diagnostic command may change, and the value range of T is 1 to 30 days under the condition that the system resource does not change suddenly; the value range of T is 1 s-24 h under the condition that the network channel is interrupted at any time.

According to an embodiment of the invention, the stop word in step 3 comprises a date and a time.

According to one embodiment of the invention, the date format is yyy-mm-dd and the time format is hh mm ss, h mm.

According to an embodiment of the present invention, the word processing in step 3 specifically includes: and (3) taking a blank as a separator, and dividing the N +1 group of command playback into a plurality of phrases to form an N +1 group of one-dimensional text list.

According to an embodiment of the present invention, the calculation formula of the IDF in step 4 is:

according to one embodiment of the present invention, IDFmin ≧ 1 in step 5.

According to an embodiment of the present invention, in step 6, the word frequency value

The calculation method comprises the following steps: extracting all phrases in an N +1 group text list, removing repetition to obtain a phrase table V with the length of M, wherein M is equal to the total number of the phrases after removing repetition filtering, V represents all phrases appearing in the N +1 group text list after filtering, then reordering the words in the text list according to the ordering of the vocabularies in V in the N +1 group text list after filtering, then converting the phrases into vectors, the size of the vectors is the number of times that the words appear in the echoing message, and obtaining a (N +1) x (M) matrixAnd A, if aij is the element in the ith row and j column of the matrix A, the word frequency of each element a (N +1) j in the (N +1) th group text list

Is defined as:

according to one embodiment of the present invention, tf in step 7_maxThe value range of (A) is 0.2-0.5.

The beneficial effects that the invention can realize are as follows:

1. the invention relates to an industrial control system abnormity diagnostic information identification method based on word frequency and inverse document frequency, which is used for industrial control system diagnostic information identification through a word frequency and inverse document frequency algorithm to realize automatic mining of key information in each piece of diagnostic command echo information, such as abnormal value change, sudden generation of alarm content and the like, without manually defining the key content and information abnormity criterion for each piece of diagnostic command echo information. And then, judging the frequency of the variable type information in the sample through word frequency calculation, and giving an alarm to the variables which appear less frequently (such as sudden abnormal high CPU load, abnormal alarm information and the like) to prompt operation and maintenance personnel to pay attention in time.

2. The algorithm of the invention can define the health degree of each diagnosis command echo information in a self-learning mode, can greatly reduce the manual development cost of an automatic monitoring industrial control system, can be easily transplanted to the operation state monitoring work of different business systems by an analysis method irrelevant to the characteristics of the monitored system, has strong adaptability, can effectively liberate manpower, improves the timeliness of event judgment and improves the operation and maintenance efficiency.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is an algorithm flow chart of an industrial control system abnormality diagnosis information identification method based on word frequency and inverse document frequency according to the present invention;

FIG. 2 is a schematic diagram of N echoed messages in an embodiment of an identification method for abnormality diagnosis information of an industrial control system based on word frequency and inverse document frequency according to the present invention;

fig. 3 is a schematic diagram of N +1 echo messages in an embodiment of the method for identifying information of abnormality diagnosis of an industrial control system based on word frequency and inverse document frequency according to the present invention.

Detailed Description

In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the various embodiments of the present invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary. In addition, some conventional structures and components are shown in simplified schematic form in the drawings.

In addition, the descriptions related to the first, the second, etc. in the present invention are only used for description purposes, do not particularly refer to an order or sequence, and do not limit the present invention, but only distinguish components or operations described in the same technical terms, and are not understood to indicate or imply relative importance or implicitly indicate the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

The invention discloses an industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency, an algorithm flow is shown in figure 1, and the method comprises the following steps:

step 1: establishing a response corpus of diagnostic commands: sending a diagnosis command to a tested system for N times according to a time interval T, and arranging the obtained N echo messages into a corpus as shown in FIG. 2 according to a time sequence to serve as a response corpus of the diagnosis command;

step 2: sending a diagnosis command to the tested system again to obtain the (N +1) th echo message, and adding the (N +1) th echo message to the end of the diagnosis command response corpus established in the step (1), wherein the arranged effect is shown in fig. 3;

and step 3: filtering stop words and performing word segmentation on the N +1 echo messages, wherein the stop words comprise a date format yyy-mm-dd and a time format hh: mm: ss and h: mm, and performing word segmentation: taking a blank as a separator, and dividing the N +1 groups of command playback display into a plurality of phrases to form an N +1 group of one-dimensional text lists;

and 4, step 4: calculating the inverse document frequency IDF of each word in each group of text list in the N +1 echoed messages by using a TF-IDF word frequency and inverse document frequency algorithm, wherein the calculation formula of the IDF is as follows:

step 6: vectorizing the text list of the N +1 echo messages which are filtered in the step 5: extracting all phrases in the N +1 group text list, removing repetition to obtain a phrase table V with the length of M, wherein M is equal to the total number of the phrases after removing repetition filtering, V represents all phrases appearing in the N +1 group text list after filtering, then reordering the words of the text list according to the ordering of the vocabularies in V on the N +1 group text list after filtering, and then the phrasesConverting the word into a vector, wherein the size of the vector is the frequency of the word appearing in the echo message, obtaining an (N +1) x (M) matrix A, and if aij is an element of an ith row and j column of the matrix A, for each element a (N +1) j in an N +1 group text list, the word frequency of the element is

Is defined as:

Value and set word frequency threshold tf_maxMake a comparison if

Example one

The longitudinal encryption device of the Pu' er converter station of +/-800 kV is taken as an example for explanation:

setting a playback message of the tested system after a top diagnosis command is sent to the longitudinal encryption device at a certain time as follows:

top-18:29:33up 2:26,1user,load average:0.00,0.03,0.06

Tasks:0total,0running，0sleeping,0stopped,0zombie

％Cpu(s):20.0us,0.0sy,0.0ni,80.0id,0.0wa,0.0hi,0.0si,0.0st

MiB Mem:987.4total,91.3free,642.2used,253.8buff/cache

MiB Swap:1022.0total,776.4free,245.6used.185.3avail Mem

step 1: and sending the diagnosis command to the tested system for N times according to the time interval T which is 5 seconds, obtaining N echo messages for the diagnosis command, arranging the echo messages according to the time sequence to be used as a corpus of the command, wherein the echo messages simultaneously contain meaningless information and meaningful information. Meaningless information such as date and time, comments and the like, and meaningful information including information reflecting the state of the system to be tested, such as CPU occupancy, memory occupancy, alarm prompts and the like.

Step 2: and after the corpus of the diagnosis command is obtained, sending the diagnosis command to the tested system again and obtaining the (N +1) th echoing information, and adding the echoing message to the corpus.

And step 3: and (3) carrying out filtering stop word processing on the N +1 parts of texts: stop words include date format yyy-mm-dd, time format hh mm: ss, h: mm, e.g., 18:29:33, 2:26, to be filtered; performing word segmentation on the N +1 parts of texts: taking a blank as a separator, dividing the N +1 group of command playback display into a plurality of phrases to form an N +1 group of one-dimensional text lists:

[top,up,1,user,load,average,0.00,0.03,0.06,Tasks……]。

and 4, step 4: and (3) calculating the inverse document frequency IDF of each word of each group of text list in the N +1 groups of echo messages by applying a word frequency algorithm:

where the word top is present in N +1 parts of text, it is

And 5: the lowest inverse document frequency threshold IDFmin is set to be 1.0, the inverse document frequency of the words in each group of text lists is deleted if the inverse document frequency is smaller than or equal to the threshold, the processing can filter out meaningless information in the command playback, and words such as 'Tasks', 'top', 'user', 'load' and 'average' are annotated to have no meaning and appear in N +1 parts of text, and the inverse document frequency is smaller than 1.0 and is filtered.

Step 6: vectorizing the filtered N +1 groups of text lists: extracting all phrases in the N +1 groups of text lists, and obtaining a phrase table V with the length of M after removing repetition: [ "0.00", "0.03", "0.06", … … ], where M equals the total number of phrases that have been repeatedly filtered, V represents all phrases that appear in the N +1 groups of text lists that have been filtered, then the N +1 groups of text lists that have been filtered are reordered to have the words of the text lists sorted by the vocabulary in V, and then the phrases are converted to vectors: if a group of text list contains "0.00" for 1 time, "0.03" for 0 time, and "0.06" for 3 times, the vector is quantized to [1,0,3, … … ], and the position of the vector in the list coincides with the position of the phrase represented by the vector in the phrase list V.

After the processing is finished, obtaining a (N +1) x (M) matrix A, and setting aij as the element of the ith row and j column of the matrix A, and then for each element a (N +1) j in the (N +1) th group text list, the word frequency of the element a (N +1) j

Is defined as:

the results for the N +1 set of matrices after vectorization are shown in table 1 below:

matrix A	Column 1	Column 2	……	Column M
					Line 1:	1	0	0	1
line 2:	0	0	0	2
					……	2	0	0	1
row N + 1:	1	0	1	1

therefore, the method comprises the following steps:

the sum of each column: 4015

Tf of the element of row N + 1: 0.200.50.166667

And 7: setting a word frequency threshold tf_max0.5, when the word frequency of any vector element in the N +1 th group text list

Namely, the abnormal message is considered to appear, the algorithm outputs alarm information to remind the operation and maintenance personnel to pay attention.

Example two

The longitudinal encryption device of the + -800 kV Kunzei converter station is taken as an example for explanation:

step 1: sending a top diagnosis command to the longitudinal encryption authentication device in a cycle of T ═ 10 seconds to obtain 4 echo messages, as shown in table 2:

step 2: the diagnostic command is sent to the longitudinal encryption authentication device again and the 5 th echo message is obtained, as shown in table 3:

and step 3: the 5 echoed messages in the corpus are processed by text filtering stop words in a uniform format, the content of the processed corpus is shown in table 4, and time-related useless information is deleted:

performing text word segmentation processing in a unified format on all echoed messages in the corpus: taking blank as separator, changing N +1 group command back display into N +1 group one-dimensional text list, and processing the corpus content such as

Shown in Table 5:

and 4, step 4: the word frequency algorithm is applied to calculate the inverse document frequency IDF of each word of each group of text list in the N +1 groups of echo messages,

IDF calculation is performed on the corpus after the stop words have been filtered and the word segmentation is completed, taking the calculation of the 1 st, 2, 7 th words of the 1 st echo message as an example, and the result is shown in table 6:

and 5: setting the lowest inverse document frequency threshold IDFmin to be 0.1, if the IDF value is lower than 0.1, determining that the echoed information is over-frequency information, which can be obtained from table 6, top and up are non-important echoed information, filtering, 0.00 is important echoed information, reserving, and after the IDF calculation of all the echoed information is completed, updating the corpus as shown in table 7:

1	0.000.030.06
		2	0.010.050.07
3	0.000.070.06
		4	0.010.030.06
5	0.020.040.18

the corpus is subjected to deduplication processing to generate an important echoed information list, the processing result is shown in table 8, and a non-repeated set of all important information in the corpus is displayed:

vectorizing the text list of the N +1 echo messages which are filtered in the step 5: extracting all phrases in an N +1 group text list, removing repetition to obtain a phrase table V with the length of M, wherein M is equal to the total number of the phrases after removing the repetition filtering, V represents all phrases appearing in the N +1 group text list after completing the filtering, then reordering the words in the text list according to the ordering of the vocabularies in V on the N +1 group text list after completing the filtering, then converting the phrases into vectors, the size of the vectors is the number of times that the words appear in the echo message where the words are located, and the conversion result is shown in a table 9: .

1	{1,1,1,0,0,0,0,0,0}
		2	{0,0,0,1,1,1,0,0,0}
3	{1,0,1,0,0,1,0,0,0}
		4	{0,1,1,1,0,0,0,0,0}
5	{0,0,0,0,0,0,1,1,1}

By calculation of formulae

Performing TF word frequency calculation on the 5-time echo messages: the calculation results are shown in table 10:

and 7: setting a word frequency threshold tf_maxIf the value of the echoed information TF of a certain message is greater than or equal to the fixed value, judging that abnormal echoed information exists in the echoed message, and if the value of the echoed information TF is less than the fixed value, judging that the echoed message is a normal message, wherein the final result is shown in a table 10:

therefore, the information bodies of No. 7, No. 8 and No. 9 in the No. 5 message are abnormal, and the message is an abnormal message and sends out an alarm.

The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. An industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency is characterized by comprising the following steps:

step 2: sending a diagnosis command to the tested system again to obtain the (N +1) th echo message, and adding the (N +1) th echo message to the end of the diagnosis command response corpus established in the step (1);

and step 3: filtering stop words and segmenting the N +1 echo messages;

step 6: vectorizing the text list of the N +1 echo messages which are filtered in the step 5: extracting all phrases in the N +1 group text list, removing repetition to obtain a phrase table V with the length of M, wherein M is equal to the total number of the phrases after removing the repetition filtering, V represents all phrases appearing in the N +1 group text list after filtering, then reordering the words in the text list according to the ordering of the vocabularies in V in the N +1 group text list after filtering, then converting the phrases into vectors, the size of the vectors is the number of times that the words appear in the echo message, and calculating the word frequency value

Value and set word frequency threshold tf_maxMake a comparison if

2. The method for identifying the industrial control system abnormity diagnostic information based on the word frequency and the inverse document frequency according to claim 1, wherein the sending time interval of the diagnostic command in the step 1 is T, the value range of T is determined according to the time range of possible change of the return result of the diagnostic command, and the value range of T is 1-30 days under the condition that the system resource is not mutated; the value range of T is 1 s-24 h under the condition that the network channel is interrupted at any time.

3. The method for identifying the industrial control system abnormality diagnosis information based on the word frequency and the inverse document frequency as claimed in claim 1, wherein the stop word in the step 3 includes a date and a time.

4. The method for identifying the abnormality diagnosis information of the industrial control system based on the word frequency and the inverse document frequency as claimed in claim 3, wherein the date format is yyy-mm-dd, and the time format is hh mm: ss, h mm.

5. The method for identifying the industrial control system abnormality diagnosis information based on the word frequency and the inverse document frequency as claimed in claim 1, wherein the word processing in the step 3 specifically comprises: and (3) taking a blank as a separator, and dividing the N +1 group of command playback into a plurality of phrases to form an N +1 group of one-dimensional text list.

6. The method for identifying the industrial control system abnormality diagnosis information based on the word frequency and the inverse document frequency as claimed in claim 1, wherein the calculation formula of the IDF in the step 4 is:

7. the method for identifying the industrial control system abnormality diagnosis information based on the word frequency and the inverse document frequency as claimed in claim 1, wherein IDFmin is greater than or equal to 1 in the step 5.

8. The method for identifying the abnormality diagnosis information of the industrial control system based on the word frequency and the inverse document frequency as claimed in claim 1, wherein in the step 6, the word frequency value

The calculation method comprises the following steps: removing all phrases in the N +1 group text list, obtaining a phrase table V with the length of M after removing the repetition, wherein M is equal to the total number of the phrases after removing the repetition filtering, V represents all phrases appearing in the N +1 group text list after completing the filtering, and then completing the filtering for the N +1 groupThe words in the text list are reordered according to the ordering of the vocabulary in the V, then the phrases are converted into vectors, the size of the vectors is the frequency of the words appearing in the echoing message, an (N +1) x (M) matrix A is obtained, aij is an element of the ith row and j column of the matrix A, and the word frequency of each element a (N +1) j in the N +1 group of text list

Is defined as:

9. the method for identifying the abnormality diagnosis information of the industrial control system based on the word frequency and the inverse document frequency as claimed in claim 1, wherein tf in the step 7_maxThe value range of (A) is 0.2-0.5.