CN113901815B

CN113901815B - Emergency working condition event detection method based on dam operation log

Info

Publication number: CN113901815B
Application number: CN202111202004.4A
Authority: CN
Inventors: 孙卫; 周华; 迟福东; 毛莺池; 李然; 陈豪; 王龙宝; 程永; 卢俊; 钟鸣; 夏旭东; 李玲; 赵欢; 罗松; 马建平; 袁溯; 吴胜亮
Original assignee: Hohai University HHU; Huaneng Group Technology Innovation Center Co Ltd; Huaneng Lancang River Hydropower Co Ltd
Current assignee: Hohai University HHU; Huaneng Group Technology Innovation Center Co Ltd; Huaneng Lancang River Hydropower Co Ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2023-05-05
Anticipated expiration: 2041-10-15
Also published as: CN113901815A

Abstract

The invention discloses an emergency working condition event detection method for a dam operation log, which comprises the steps of constructing a dam emergency working condition event type set; coding all the word fragments in the dam operation log, and converting the code fragments into embedded vectors corresponding to the word fragments; fusing an embedded vector, a named entity type and a part-of-speech tagging vector corresponding to the word segmentation, and strengthening semantic information of the word segmentation; the sentence-document dual attention fusion context information is used, the sentence-level attention is improved to the important words which possibly trigger events in each sentence, the document-level attention is improved to the important sentences which possibly trigger events in each log document, the local and global semantic information of word segmentation is enhanced, and the problems of word polysense and word and trigger mismatch in traditional Chinese event detection are solved; in order to avoid the problem of unbalanced classification of the positive and negative samples of the classification caused by that each sentence in the common dam log document contains 2 events at most, a training model is adopted to detect the events, and classification of all the documents is realized based on the events contained in each document.

Description

Emergency event detection method based on dam operation log

技术领域Technical Field

本发明涉及一种基于大坝运行日志的应急工况事件检测方法，对水工领域中的大坝运行日志进行事件检测，具体对大坝长周期运行下历经的各种特殊工况事件及其应对事件进行事件检测，属于自然语言处理技术领域。The present invention relates to an emergency working condition event detection method based on a dam operation log, which performs event detection on the dam operation log in the hydraulic engineering field, specifically performs event detection on various special working condition events and response events experienced by the dam under long-term operation, and belongs to the technical field of natural language processing.

背景技术Background Art

事件检测的任务是从大规模非结构化自然语言文本中识别事件触发词并正确分类事件类型，触发词指的是最能清晰明显表达事件出现的核心词语或短语。事件检测对事件语义建模具有重要意义，方便后续对事件的结构化管理与存储。The task of event detection is to identify event trigger words from large-scale unstructured natural language texts and correctly classify event types. Trigger words refer to the core words or phrases that can most clearly express the occurrence of events. Event detection is of great significance to event semantic modeling, which facilitates the subsequent structured management and storage of events.

在水利工程领域，大坝设施提供了防洪、防凌、蓄水、供水、发电等多种功能，是我国水利事业发展的中流砥柱。在几十年的长周期运行过程中，大坝会遇到诸多自然风险事件，如洪水、地震、暴雨等事件，可能会危及大坝结构安全与大坝下游人民生命财产安全。因此，在特殊事件发生后，大坝管理人员会安排全面专项检查对大坝结构进行维护。此外，大坝的日常检查与检修也是保障大坝坝体安全的重要措施。在各种应对措施后，巡检人员会将本次巡检事件的原因与巡检结果进行文字记录，形成大坝运行日志文件。In the field of water conservancy projects, dam facilities provide multiple functions such as flood control, ice prevention, water storage, water supply, and power generation, and are the mainstay of the development of my country's water conservancy. During the long-term operation of several decades, the dam will encounter many natural risk events, such as floods, earthquakes, rainstorms, etc., which may endanger the safety of the dam structure and the safety of life and property of people downstream of the dam. Therefore, after a special event occurs, the dam management personnel will arrange a comprehensive special inspection to maintain the dam structure. In addition, the daily inspection and maintenance of the dam is also an important measure to ensure the safety of the dam body. After various response measures, the inspection personnel will record the cause and inspection results of this inspection event in writing to form a dam operation log file.

通过对大坝运行日志进行一定的处理，能够剖析大坝自建坝以来安全状况以及形成大坝事件知识库，提升大坝智能管理水平。面向大坝运行日志的应急工况事件检测方法能够跳过事件触发器对大坝运行日志中的所有预定事件进行检测并分类每篇文档归属事件类型，为后续对事件的抽取、事件图谱的构建和事件知识库的构建任务提供基础。By processing the dam operation log, we can analyze the safety status of the dam since its construction and form a dam event knowledge base to improve the intelligent management level of the dam. The emergency condition event detection method for the dam operation log can skip the event trigger to detect all scheduled events in the dam operation log and classify each document into event types, providing a basis for subsequent event extraction, event graph construction and event knowledge base construction tasks.

中文文本中存在大量的歧义，而事件一般由事件触发器与事件论元组成。事件触发器多为动词，普遍具有一词多义性、触发器与词不匹配问题，导致以触发器识别为核心的事件检测方法易分类错误。There are a lot of ambiguities in Chinese texts, and events are generally composed of event triggers and event arguments. Event triggers are mostly verbs, which generally have polysemy and mismatch between triggers and words, which makes event detection methods based on trigger identification prone to classification errors.

发明内容Summary of the invention

发明目的：针对现有技术中存在的问题及大坝运行过程中遇到的各种自然事件及其应对措施事件，缺乏规范的针对事件的标准化记录，本发明提供了一种基于大坝运行日志的应急工况事件检测方法，避免了识别触发器的过程，通过模拟句子中的触发器来解决上述问题，从大坝运行日志中检测大坝特殊工况事件，并对每篇文档归属事件类型进行分类，为后续事件抽取提供基础。Purpose of the invention: In view of the problems existing in the prior art and the various natural events and their response measures encountered during the operation of the dam, there is a lack of standardized records of events. The present invention provides an emergency operating condition event detection method based on the dam operation log, which avoids the process of identifying triggers, solves the above problems by simulating triggers in sentences, detects special dam operating condition events from the dam operation log, and classifies the event type to which each document belongs, providing a basis for subsequent event extraction.

技术方案：一种基于大坝运行日志的应急工况事件检测方法，包括如下步骤：Technical solution: A method for detecting emergency conditions based on dam operation logs, comprising the following steps:

(1)日志文件预处理：首先对大坝运行日志根据记录日期进行排序与拆分，给每个文档进行标号，对每个文档中的句子进行排序、标号与分词，每个词进行实体类型标注与词性标注，之后构建大坝应急工况事件类型集合；所述排序指的是对不同日期的日志进行排序；所述拆分指的是同一天的日志根据文档内容进行拆分；(1) Log file preprocessing: First, the dam operation log is sorted and split according to the recording date, each document is numbered, the sentences in each document are sorted, numbered and segmented, each word is marked with entity type and part of speech, and then a set of dam emergency condition event types is constructed; the sorting refers to sorting logs of different dates; the splitting refers to splitting the logs of the same day according to the document content;

(2)编码向量嵌入：使用ALBERT预处理模型对大坝运行日志中的所有分词进行编码，转换成词对应的嵌入向量；(2) Encoded vector embedding: Use the ALBERT preprocessing model to encode all the words in the dam operation log and convert them into embedding vectors corresponding to the words;

(3)BiLSTM特征融合：使用BiLSTM融合分词对应的嵌入向量、命名实体类型与词性标注向量，强化分词的语义信息；(3) BiLSTM feature fusion: Use BiLSTM to fuse the embedding vector, named entity type, and part-of-speech tagging vector corresponding to the word segmentation to enhance the semantic information of the word segmentation;

(4)双重注意力机制语义强化：使用句子-文档双重注意力融合语境信息，句子级注意力提高每个句子中可能触发事件的重要词，文档级注意力提高每个日志文档中可能触发事件的重要句子，解决传统中文事件检测的一词多义和词与触发器不匹配问题；(4) Dual attention mechanism semantic enhancement: Use sentence-document dual attention to fuse contextual information. Sentence-level attention increases the important words in each sentence that may trigger an event, and document-level attention increases the important sentences in each log document that may trigger an event, solving the problems of polysemy and mismatch between words and triggers in traditional Chinese event detection.

(5)利用Focal loss损失函数训练模型并实现分类：为避免普通大坝日志文档中每个句子最多包含2个事件而导致的二分类正负样本不均衡问题，采用Focal loss损失函数训练模型实现对所有文档归属事件的分类。(5) Using the Focal loss function to train the model and realize classification: In order to avoid the imbalance problem of positive and negative samples in the binary classification caused by the fact that each sentence in the ordinary dam log document contains at most two events, the Focal loss function is used to train the model to realize the classification of the events to which all documents belong.

所述大坝应急工况事件类型集合，包括地震、暴雨、泄洪、汛前安全大检查、全面专项检查、日常检修、日常检查等典型事件。The dam emergency condition event type collection includes typical events such as earthquakes, heavy rains, flood discharges, pre-flood safety inspections, comprehensive special inspections, routine maintenance, and daily inspections.

所述命名实体类型包括人名、部门、位置、时间、日期、测值、百分比、缺陷类型等；所述词性标注向量包括名词、动词、形容词、数量词、代词等。The named entity types include names, departments, locations, times, dates, measured values, percentages, defect types, etc.; the part-of-speech tagging vectors include nouns, verbs, adjectives, quantifiers, pronouns, etc.

进一步的，所述步骤(1)中包括如下步骤：Furthermore, the step (1) includes the following steps:

(1.1)首先将大坝运行日志文件根据日志记录日期分成多个文档，对每个文档进行排序标号，将每个文档中的句子进行排序标号，并使用jieba以词为单位进行分词；(1.1) First, divide the dam operation log file into multiple documents according to the log record date, sort and number each document, sort and number the sentences in each document, and use jieba to segment words;

(1.2)对分词结果进行实体类型标注与词性标注，实体类型标注通过查找随机初始化的嵌入表将实体类型标注转换成低维向量，词性标注采用Stanford CoreNLP标注每个词的词性，之后再通过查找对应嵌入表将词性标注转换成低维向量；(1.2) Perform entity type tagging and part-of-speech tagging on the word segmentation results. Entity type tagging converts entity type tags into low-dimensional vectors by searching a randomly initialized embedding table. Part-of-speech tagging uses Stanford CoreNLP to mark the part of speech of each word, and then converts the part-of-speech tags into low-dimensional vectors by searching the corresponding embedding table.

(1.3)预定义大坝应急工况事件类型，包括地震、暴雨、泄洪、汛前安全大检查、全面专项检查、日常检修、日常检查等典型事件。(1.3) Predefine dam emergency event types, including earthquakes, heavy rains, flood discharges, pre-flood safety inspections, comprehensive special inspections, routine maintenance, routine inspections and other typical events.

进一步的，所述步骤(2)中包括如下步骤：Furthermore, the step (2) includes the following steps:

使用ALBERT预训练模型对(1.1)中的所有分词进行编码处理，转化成计算机能够处理的向量表示。Use the ALBERT pre-trained model to encode all the word segmentations in (1.1) and convert them into vector representations that can be processed by computers.

进一步的，所述步骤(3)中包括如下步骤：Furthermore, the step (3) includes the following steps:

(3.1)将每个词对应的嵌入向量、实体类型向量与词性标注向量进行串联，其中嵌入向量为步骤(2)得到的向量，实体类型向量是所有分词命名实体识别结果如人名、组织、位置、时间、日期、数值、百分比等对应的数学向量，词性标注向量是所有分词的词性标注结果如名词、动词、形容词、数量词、代词等对应的数学向量；(3.1) Concatenate the embedding vector, entity type vector and part-of-speech tag vector corresponding to each word, where the embedding vector is the vector obtained in step (2), the entity type vector is the mathematical vector corresponding to all word segmentation named entity recognition results such as names, organizations, locations, times, dates, numbers, percentages, etc., and the part-of-speech tag vector is the mathematical vector corresponding to all word segmentation part-of-speech tag results such as nouns, verbs, adjectives, quantifiers, pronouns, etc.;

(3.2)使用BiLSTM模型对单个句子中的串联向量进行处理，每个向量为一个输入，利用双向LSTM单元捕获单词上下文信息，分别输出两个隐藏状态

和

将该两个向量合成为输出向量

(3.2) Use the BiLSTM model to process the concatenated vectors in a single sentence. Each vector is an input, and the bidirectional LSTM unit is used to capture the word context information and output two hidden states respectively.

and

Combine these two vectors into an output vector

进一步的，所述步骤(4)中包括如下步骤：Furthermore, the step (4) includes the following steps:

(4.1)在训练集中，将每个句子所包含的应急工况预定义事件通过查找随机初始化的嵌入表转化成嵌入向量t₁，将每个文档利用Dov2Vec转换成嵌入向量d；(4.1) In the training set, the emergency condition predefined events contained in each sentence are converted into embedding vectors t ₁ by searching the randomly initialized embedding table, and each document is converted into an embedding vector d using Dov2Vec;

(4.2)对于每个句子中的所有分词，使用局部注意力机制，计算每个分词在本句中的权重，提高触发目标事件类型的单词注意力权值并模拟触发器，计算公式如下：(4.2) For all the words in each sentence, use the local attention mechanism to calculate the weight of each word in the sentence, increase the attention weight of the word that triggers the target event type and simulate the trigger. The calculation formula is as follows:

其中h_k是输出向量h中第k个部分，

是局部注意力向量α_s中第k个部分，

是事件类型嵌入向量的转置；所述触发器指代事件触发器，即触发某事件的词，一般为动词；where h _k is the kth part of the output vector h,

is the kth part of the local attention vector _αs ,

is the transpose of the event type embedding vector; the trigger refers to an event trigger, that is, a word that triggers an event, generally a verb;

(4.3)对于每个句子中的所有分词，使用全局注意力机制，计算分词所在句子在其文档中的权重，获得触发器在该场景下唯一含义，辅助判断该句子的事件类型，解决触发器因语境信息产生的歧义问题，计算公式如下：(4.3) For all the word segments in each sentence, use the global attention mechanism to calculate the weight of the sentence where the word segment is located in its document, obtain the unique meaning of the trigger in this scenario, assist in determining the event type of the sentence, and solve the ambiguity problem of the trigger caused by context information. The calculation formula is as follows:

其中h_k是输出向量h中第k个部分，

是全局注意力向量α_d中第k个部分，

是事件类型嵌入向量转置，d^T是文档级嵌入向量转置；where h _k is the kth part of the output vector h,

is the kth part of the global attention vector _αd ,

is the transposed event type embedding vector, d ^T is the transposed document level embedding vector;

(4.4)加权融合局部注意力与全局注意力，提高事件检测精度，计算局部注意力、全局注意力对于事件的权重向量和加权融合公式如下：(4.4) Weighted fusion of local attention and global attention to improve event detection accuracy. The weight vector of local attention and global attention for events and the weighted fusion formula are calculated as follows:

v_s＝α_s·t₁ v _s = α _s · t ₁

v_d＝α_d·t₂ v _d = α _d · t ₂

o＝σ(λ·v_s+(1-λ)·v_d)o＝σ(λ·v _s +(1-λ)·v _d )

其中，最终输出值o由v_s和v_d两部分组成。v_s由α_s和事件类型嵌入向量t₁点积生成，用于捕获局部特征和模拟隐藏的事件触发器；v_d由α_d和t₂点积生成，用于捕获全局特征和语境信息。σ是Sigmoid函数，λ∈[0，1]是在v_s和v_d之间进行权衡的超参数。Among them, the final output value o consists of two parts: _vs and _vd . _vs is generated by the dot product _{of αs} and the event type embedding vector _t1 , which is used to capture local features and simulate hidden event triggers; _vd is generated by the dot product _{of αd} and _t2 , which is used to capture global features and contextual information. σ is the Sigmoid function, and λ∈[0,1] is a hyperparameter that weighs between _vs and _vd .

进一步的，所述步骤(5)中包括如下步骤：Furthermore, the step (5) includes the following steps:

以句子为单位对数据集进行处理，以＜句子，事件类型＞对构成训练数据，代表给定句子是否传达了t类型的事件，事件类型标签是1或0，例如＜近坝库岸、枢纽区边坡及公路检查情况：无异常，日常检查＞训练对的标签为1，＜近坝库岸、枢纽区边坡及公路检查情况：无异常，地震＞训练对的标签为0，由于单个句子可能表达的事件数目与预定义的事件数目相比很少，因此针对二分类识别造成的负样本数目远大于正样本数目的不均衡问题，引入Focal loss 损失函数训练得到的模型，加强正样本和难分样本对模型影响力，计算公式如下：The data set is processed in units of sentences, and the training data is composed of <sentence, event type> pairs, which represent whether a given sentence conveys an event of type t. The event type label is 1 or 0. For example, the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, daily inspection> is 1, and the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, earthquake> is 0. Since the number of events that a single sentence may express is very small compared with the predefined number of events, the imbalance problem of the number of negative samples far exceeding the number of positive samples caused by binary classification recognition is introduced. The model trained by the Focal loss loss function is strengthened to enhance the influence of positive samples and difficult samples on the model. The calculation formula is as follows:

其中x是由句子和目标事件类型组成，y∈{0，1}，o(x⁽ⁱ⁾)是模型预测值，||θ||²是模型中各个元素的平方和，δ＞0是L2归一化项的权重，β是平衡样本正负权重比例的参数，γ是平衡样本难分类和易分类权重比例的参数，本实验设定β＝0.25，γ＝2。Where x is composed of a sentence and a target event type, y∈{0,1}, o(x ⁽ⁱ⁾ ) is the model prediction value, ||θ|| ² is the sum of the squares of the elements in the model, δ>0 is the weight of the L2 normalization term, β is a parameter for balancing the positive and negative weight ratios of samples, and γ is a parameter for balancing the weight ratios of difficult and easy classification samples. In this experiment, β=0.25 and γ=2 are set.

最后使用训练完成的模型对大坝运行日志文件进行事件检测，基于每篇文档所包含的事件类型进行分类。Finally, the trained model is used to perform event detection on the dam operation log files and classify them based on the event types contained in each document.

一种基于大坝运行日志的应急工况事件检测系统，对水利领域中的大坝运行日志进行事件检测，包括：An emergency condition event detection system based on dam operation logs performs event detection on dam operation logs in the water conservancy field, including:

日志文件预处理模块：首先对大坝运行日志根据记录日期进行排序与拆分，给每个文档进行标号，对每个文档中的句子进行排序、标号与分词，每个词进行实体类型标注与词性标注，之后构建大坝应急工况事件类型集合；Log file preprocessing module: first, sort and split the dam operation log according to the record date, label each document, sort, label and segment the sentences in each document, label each word with entity type and part of speech, and then construct a set of dam emergency condition event types;

编码向量嵌入模块：使用ALBERT预处理模型对大坝运行日志中的所有分词进行编码，转换成词对应的嵌入向量；Encoded vector embedding module: Use the ALBERT preprocessing model to encode all the words in the dam operation log and convert them into embedding vectors corresponding to the words;

BiLSTM特征融合模块：使用BiLSTM融合分词对应的嵌入向量、命名实体类型与词性标注向量，强化分词的语义信息；BiLSTM feature fusion module: Use BiLSTM to fuse the embedding vector, named entity type and part-of-speech tagging vector corresponding to the word segmentation to enhance the semantic information of the word segmentation;

双重注意力机制语义强化模块：使用句子-文档双重注意力融合语境信息；Dual attention mechanism semantic enhancement module: uses sentence-document dual attention to fuse contextual information;

一个模型：采用Focal loss损失函数训练上述模型，并用训练好的模型对所有文档归属事件的分类。A model: Use the focal loss function to train the above model, and use the trained model to classify all document attribution events.

一种计算机设备，该计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行上述计算机程序时实现如上所述的基于大坝运行日志的应急工况事件检测方法。A computer device comprises a memory, a processor and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the emergency condition event detection method based on the dam operation log as described above is implemented.

一种计算机可读存储介质，该计算机可读存储介质存储有执行如上所述的基于大坝运行日志的应急工况事件检测方法的计算机程序。A computer-readable storage medium stores a computer program for executing the emergency condition event detection method based on a dam operation log as described above.

有益效果：与现有技术相比，本发明提供的基于大坝运行日志的应急工况事件检测方法，通过局部注意力捕捉关键词和句子级语义信息，模拟隐藏的事件触发器，实现了在无触发器条件下的事件检测，通过全局注意力引入丰富的文档级语境信息，辅助判断单词真实语境下的含义，跳过触发器识别环节，直接判断事件类型。避免了中文词与触发器不匹配和一词多义的问题，提升事件检测精度。Beneficial effects: Compared with the prior art, the emergency condition event detection method based on dam operation log provided by the present invention captures keywords and sentence-level semantic information through local attention, simulates hidden event triggers, realizes event detection without triggers, introduces rich document-level context information through global attention, assists in judging the meaning of words in the real context, skips the trigger recognition link, and directly judges the event type. It avoids the problem of mismatch between Chinese words and triggers and polysemy of a word, and improves the accuracy of event detection.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的模型训练流程图；FIG1 is a flow chart of model training according to an embodiment of the present invention;

图2为本发明实施例的模型训练框架图。FIG2 is a diagram of a model training framework according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。The present invention is further explained below in conjunction with specific embodiments. It should be understood that these embodiments are only used to illustrate the present invention and are not used to limit the scope of the present invention. After reading the present invention, various equivalent forms of modifications to the present invention by those skilled in the art all fall within the scope defined by the claims attached to this application.

如图1所示，基于大坝运行日志的应急工况事件检测方法主要包含以下步骤：As shown in Figure 1, the emergency condition event detection method based on the dam operation log mainly includes the following steps:

步骤(1)预处理大坝运行日志文件Step (1) Preprocessing dam operation log files

(1.1)首先将大坝运行日志文件根据日志记录日期分成多个文档，作为训练集，对每个文档进行排序标号，将每个文档中的句子进行排序标号，并使用jieba以词为单位进行分词。如图2中句子“近坝库岸、枢纽区边坡及公路检查情况：无异常”首先被拆分成“近坝”，“库岸”，“、”，“枢纽区”，“边坡”，“及”，“公路”，“检查”，“情况”，“：”，“无”，“异常”。(1.1) First, the dam operation log file is divided into multiple documents according to the log record date as the training set, and each document is sorted and numbered. The sentences in each document are sorted and numbered, and Jieba is used to segment words. As shown in Figure 2, the sentence "Inspection of the bank near the dam, the slope of the hub area and the road: no abnormality" is first split into "near the dam", "bank", "," "hub area", "slope", "and", "road", "inspection", "situation", ":", "no", "abnormal".

(1.2)对分词结果进行实体类型标注与词性标注，实体类型标注通过查找随机初始化的嵌入表将其转换成低维向量，词性标注采用Stanford CoreNLP标注每个词的词性，之后再通过查找嵌入表将其转换成低维向量。(1.2) Perform entity type tagging and part-of-speech tagging on the word segmentation results. Entity type tagging converts it into a low-dimensional vector by searching a randomly initialized embedding table. Part-of-speech tagging uses Stanford CoreNLP to mark the part of speech of each word, and then converts it into a low-dimensional vector by searching the embedding table.

(1.3)对大坝运行日志预定义大坝应急工况事件类型，包括地震、暴雨、泄洪、汛前安全大检查、全面专项检查、日常检修、日常检查等典型事件。(1.3) Predefine the types of dam emergency conditions in the dam operation log, including earthquakes, heavy rains, flood discharges, pre-flood safety inspections, comprehensive special inspections, routine maintenance, daily inspections and other typical events.

步骤(2)将分词编码成为词向量Step (2) Encode the word segmentation into word vectors

使用ALBERT预训练模型对(1.3)中的所有分词进行编码处理，转化成计算机能够处理的向量表示。Use the ALBERT pre-trained model to encode all the word segmentations in (1.3) and convert them into vector representations that can be processed by computers.

步骤(3)拼接词向量、命名实体类型和词性标注向量后提取语义信息。Step (3) extracts semantic information by concatenating word vectors, named entity types, and part-of-speech tag vectors.

(3.1)将每个分词对应的嵌入向量、实体类型向量与词性标注向量进行串联，其中嵌入向量为步骤(2)得到的向量，实体类型向量是所有分词命名实体识别结果如人名、组织、位置、时间、日期、测值、百分比、缺陷类型等对应的数学向量，词性标注向量是所有分词的词性标注结果如名词、动词、形容词、数量词、代词等对应的数学向量。(3.1) Concatenate the embedding vector, entity type vector and part-of-speech tag vector corresponding to each word segmentation, where the embedding vector is the vector obtained in step (2), the entity type vector is the mathematical vector corresponding to all word segmentation named entity recognition results such as name, organization, location, time, date, measurement value, percentage, defect type, etc., and the part-of-speech tag vector is the mathematical vector corresponding to the part-of-speech tag results of all word segmentations such as noun, verb, adjective, quantifier, pronoun, etc.

和

将该两个向量合成为输出向量

and

Combine these two vectors into an output vector

步骤(4)使用双重注意力机制捕获句子级语境与文档级语境，强化词向量表示，模拟触发器Step (4) uses a dual attention mechanism to capture sentence-level context and document-level context, strengthen word vector representation, and simulate triggers.

(4.1)在训练集中，将每个句子所包含事件通过查找随机初始化的嵌入表转化成嵌入向量t₁，将每个文档利用Dov2Vec转换成嵌入向量d。(4.1) In the training set, the events contained in each sentence are converted into an embedding vector t ₁ by looking up a randomly initialized embedding table, and each document is converted into an embedding vector d using Dov2Vec.

其中h_k是输出向量h中第k个部分，

是局部注意力向量α_s中第k个部分，

是事件类型嵌入向量的转置。如图2中

用于辅助局部注意力机制，对每个分词模拟触发器。where h _k is the kth part of the output vector h,

is the kth part of the local attention vector _αs ,

is the transpose of the event type embedding vector. As shown in Figure 2

Used to assist the local attention mechanism and simulate triggers for each word segmentation.

(4.3)对于每个句子中的所有分词，使用全局注意力机制，计算分词所在句子在本文档中的权重，获得触发器在该场景下唯一含义，辅助判断该句子的事件类型，解决触发器因语境信息产生的歧义问题，计算公式如下：(4.3) For all the word segments in each sentence, use the global attention mechanism to calculate the weight of the sentence where the word segment is located in this document, obtain the unique meaning of the trigger in this scenario, assist in determining the event type of the sentence, and solve the ambiguity problem of the trigger caused by context information. The calculation formula is as follows:

其中h_k是输出向量h中第k个部分，

是全局注意力向量α_d中第k个部分，

是事件类型嵌入向量转置，d^T是文档级嵌入向量转置。如图2中

用于辅助全局注意力，避免局部注意力造成的歧义问题。where h _k is the kth part of the output vector h,

is the kth part of the global attention vector _αd ,

is the transposed event type embedding vector, and ^dT is the transposed document level embedding vector.

Used to assist global attention and avoid ambiguity caused by local attention.

(4.4)加权融合局部注意力与全局注意力，提高事件检测精度，公式如下：(4.4) Weighted fusion of local attention and global attention improves event detection accuracy. The formula is as follows:

v_s＝α_s·t₁ v _s = α _s · t ₁

v_d＝α_d·t₂ v _d = α _d · t ₂

o＝σ(λ·v_s+(1-λ)·v_d)o＝σ(λ·v _s +(1-λ)·v _d )

其中，最终输出值o由v_s和v_d两部分组成。v_s由α_s和t₁点积生成，用于捕获局部特征和模拟隐藏的事件触发器；v_d由α_d和t₂点积生成，用于捕获全局特征和语境信息。σ是Sigmoid 函数，λ∈[0，1]是在v_s和v_d之间进行权衡的超参数。The final output value o consists of two parts: _vs and _vd . _vs is generated by the _dot product _{of αs} and t1, which is used to capture local features and simulate hidden event triggers; _vd is generated by the dot product of _αd and _t2 , which is used to capture global features and contextual information. σ is the Sigmoid function, and λ∈[0,1] is a hyperparameter that weighs between _vs and _vd .

步骤(5)采用Focal loss损失函数避免正负样本不均衡问题并实现对所有文档的分类Step (5) uses the Focal loss function to avoid the imbalance problem of positive and negative samples and achieve classification of all documents

以句子为单位对数据集进行处理，以＜句子，事件类型＞对构成训练数据，代表给定句子是否传达了t类型的事件，其标签是1或0，例如＜近坝库岸、枢纽区边坡及公路检查情况：无异常，日常检查＞训练对的标签为1，＜近坝库岸、枢纽区边坡及公路检查情况：无异常，地震＞训练对的标签为0，由于单个句子可能表达的事件数目与预定义的事件数目相比很少，因此针对二分类识别造成的负样本数目远大于正样本数目的不均衡问题，引入Focal loss损失函数，加强正样本和难分样本对模型影响力，计算公式如下：The data set is processed in units of sentences, and the training data is composed of <sentence, event type> pairs, which represent whether a given sentence conveys an event of type t. Its label is 1 or 0. For example, the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, daily inspection> is 1, and the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, earthquake> is 0. Since the number of events that a single sentence may express is very small compared with the predefined number of events, the Focal loss function is introduced to address the imbalance problem that the number of negative samples is much larger than the number of positive samples caused by binary classification recognition, and to enhance the influence of positive samples and difficult samples on the model. The calculation formula is as follows:

其中x是由句子和目标事件类型组成，y∈{0，1}，o(x⁽ⁱ⁾)是模型预测值，||θ||²是模型中各个元素的平方和，δ＞0是L2归一化项的权重，β是平衡样本正负权重比例的参数，γ是平衡样本难分类和易分类权重比例的参数，本实验设定β＝0.25，γ＝2；Where x is composed of sentences and target event types, y∈{0,1}, o(x ⁽ⁱ⁾ ) is the model prediction value, ||θ|| ² is the sum of squares of each element in the model, δ>0 is the weight of the L2 normalization term, β is the parameter for balancing the positive and negative weight ratios of samples, and γ is the parameter for balancing the weight ratios of difficult and easy classification samples. In this experiment, β=0.25 and γ=2 are set;

最后使用训练完成的模型对大坝运行日志文件进行事件检测，基于每篇文档所包含的事件类型对文档进行分类。Finally, the trained model is used to perform event detection on the dam operation log files and classify the documents based on the event types contained in each document.

日志文件预处理模块：首先对大坝运行日志根据记录日期进行排序与拆分，给每个文档进行标号，对每个文档中的句子进行排序、标号与分词，每个词进行实体类型标注与词性标注，之后构建大坝应急工况事件类型集合；Log file preprocessing module: first, sort and split the dam operation log according to the record date, label each document, sort, label and segment the sentences in each document, label each word with entity type and part of speech, and then build a set of dam emergency condition event types;

显然，本领域的技术人员应该明白，上述的本发明实施例的基于大坝运行日志的应急工况事件检测方法各步骤或基于大坝运行日志的应急工况事件检测系统各模块可以用通用的计算装置来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上，可选地，它们可以用计算装置可执行的程序代码来实现，从而，可以将它们存储在存储装置中由计算装置来执行，并且在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本发明实施例不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the various steps of the emergency condition event detection method based on the dam operation log or the various modules of the emergency condition event detection system based on the dam operation log of the above-mentioned embodiment of the present invention can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, optionally, they can be implemented by a program code executable by the computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order from that here, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. In this way, the embodiments of the present invention are not limited to any specific combination of hardware and software.

Claims

1. The emergency working condition event detection method based on the dam operation log is used for carrying out event detection on the dam operation log in the water conservancy field and is characterized by comprising the following steps:

(1) Preprocessing a log file: firstly, sorting and splitting dam operation logs according to the recording date, marking each document, sorting sentences in each document, marking and word segmentation, marking entity types and parts of speech of each word, and then constructing a dam emergency working condition event type set;

(2) Encoding vector embedding: encoding all the word fragments in the dam operation log by using an ALBERT preprocessing model, and converting the encoded word fragments into embedded vectors corresponding to the words; coding all the segmented words by using an ALBERT pre-training model, and converting the coded words into vector representations which can be processed by a computer;

(3) BiLSTM feature fusion: using BiLSTM to fuse an embedded vector, a named entity type and a part-of-speech labeling vector corresponding to the word segmentation, and reinforcing semantic information of the word segmentation;

(4) Dual attention mechanism semantic enhancement: context information is fused using sentence-document dual attention;

(5) Training a model by using the Focal loss function and realizing classification: the classification of all document attribution events is realized by adopting a Focal loss function training model;

the step (1) comprises the following steps:

firstly, dividing a dam operation log file into a plurality of documents according to log record dates, sequencing and marking each document, sequencing and marking sentences in each document, and using jieba to divide words in word units;

the method comprises the steps of (1.2) marking entity types and parts of speech of a word segmentation result, converting the entity types into low-dimensional vectors by searching an embedded table which is randomly initialized, marking parts of speech of each word by using Stanford CoreNLP, and then converting the parts of speech into the low-dimensional vectors by searching the embedded table;

(1.3) predefining the event types of dam emergency working conditions, including earthquake, heavy rain, flood discharge, pre-flood safety large inspection, comprehensive special inspection, daily maintenance and daily inspection events;

the step (3) comprises the following steps:

(3.1) connecting an embedded vector, an entity type vector and a part-of-speech tagging vector corresponding to each word in series, wherein the embedded vector is the vector obtained in the step (2), the entity type vector is the mathematical vector of the recognition result of the named entity of all the words, and the part-of-speech tagging vector is the mathematical vector of the part-of-speech tagging result of all the words;

(3.2) processing the serial vectors in a single sentence by using the BiLSTM model, wherein each vector is an input, capturing word context information by using the bidirectional LSTM unit, and respectively outputting two hidden states

And->

Synthesizing the two vectors into an output vector +.>

In the step (4): in the training set, the emergency working condition predefined event contained in each sentence is converted into an embedded vector t by searching a randomly initialized embedded table ₁ Converting each document into an embedded vector d by using Dov2 Vec; for all the words in each sentence, calculating the weight of each word in the sentence by using a local attention mechanism, improving the attention weight of the word triggering the target event type and simulating a trigger;

the step (4) comprises the following steps:

(4.1) in the training set, converting the emergency condition predefined event contained in each sentence into an embedded vector t by searching a randomly initialized embedded table ₁ Converting each document into an embedded vector d by using Dov2 Vec;

(4.2) for all the words in each sentence, calculating the weight of each word in the sentence by using a local attention mechanism, improving the attention weight of the word triggering the target event type and simulating the trigger, wherein the calculation formula is as follows:

wherein h is _k Is the kth part of the output vector h,

is the local attention vector alpha _s Part k of (a)>

Is a transpose of the event type embedding vector;

(4.3) for all the words in each sentence, calculating the weight of the sentence in which the word is located in the document by using a global attention mechanism, obtaining the unique meaning of the trigger in the scene, assisting in judging the event type of the sentence, solving the ambiguity problem of the trigger due to the context information, and adopting the following calculation formula:

wherein h is _k Is the kth part of the output vector h,

is the global attention vector alpha _d Part k of (a)>

Is event type embedding vector transpose, d ^T Is a document-level embedded vector transpose;

(4.4) weighting and fusing the local attention and the global attention, and improving the event detection precision, wherein the formula is as follows:

v _s ＝α _s ·t ₁

v _d ＝α _d ·t ₂

o＝σ(λ·v _s +(1-λ)·v _d )

wherein the final output value o is defined by v _s And v _d Two parts are formed; v _s From alpha _s And t ₁ Dot product generation for capturing local features and simulating hidden event triggers; v _d From alpha _d And t ₂ Dot product generation for capturing global features and context information; sigma is a Sigmoid function, λ ε [0,1]Is at v _s And v _d Super parameters for trade-off between.

2. The method for detecting emergency condition events based on dam operation log according to claim 1, wherein in the step (5), the data set is processed in sentence units, training data is formed by < sentence, event type > pairs, and the training data represents whether a given sentence conveys an event of t type, the label is 1 or 0, a Focal loss function is introduced, the influence of positive samples and refractory samples on the model is enhanced, and the calculation formula is as follows:

where x is composed of sentences and target event types, y ε {0,1}, o (x ⁽ⁱ⁾ ) Is a model predictive value of the model, and, |θ| ² Is the sum of squares of all elements in the model, delta is more than 0, the weight of the L2 normalization term, beta is a parameter for balancing the positive and negative weight proportion of the sample, and gamma is a parameter for balancing the difficult-to-classify and easy-to-classify weight proportion of the sample;

and finally, performing event detection on the dam operation log file by using the trained model, and classifying the event types contained in each document.

3. An emergency working condition event detection system based on a dam operation log, which carries out event detection on the dam operation log in the water conservancy field, is characterized by comprising:

the log file preprocessing module: firstly, sorting and splitting dam operation logs according to the recording date, marking each document, sorting sentences in each document, marking and word segmentation, marking entity types and parts of speech of each word, and then constructing a dam emergency working condition event type set;

the code vector embedding module: encoding all the word fragments in the dam operation log by using an ALBERT preprocessing model, and converting the encoded word fragments into embedded vectors corresponding to the words; coding all the segmented words by using an ALBERT pre-training model, and converting the coded words into vector representations which can be processed by a computer;

BiLSTM feature fusion module: using BiLSTM to fuse an embedded vector, a named entity type and a part-of-speech labeling vector corresponding to the word segmentation, and reinforcing semantic information of the word segmentation;

a dual-attention mechanism semantic enhancement module: context information is fused using sentence-document dual attention;

one model: training a model by adopting a Focal loss function, and classifying all document attribution events by using the trained model;

the log file preprocessing module is realized as follows:

the BiLSTM feature fusion module is realized as follows:

And->

Synthesizing the two vectors into an output vector +.>

In the dual-attention mechanism semantic enhancement module: in the training set, the emergency working condition predefined event contained in each sentence is converted into an embedded vector t by searching a randomly initialized embedded table ₁ Converting each document into an embedded vector d by using Dov2 Vec; for all the words in each sentence, calculating the weight of each word in the sentence by using a local attention mechanism, improving the attention weight of the word triggering the target event type and simulating a trigger;

the dual-attention mechanism semantic enhancement module converts emergency working condition predefined events contained in each sentence into an embedded vector t by searching a randomly initialized embedded table ₁ Converting each document into an embedded vector d by using Dov2 Vec; for all the words in each sentence, calculating the weight of each word in the sentence by using a local attention mechanism, improving the attention weight of the word triggering the target event type and simulating a trigger; the method comprises the following steps:

(4.1) in the training set, searching the emergency working condition predefined event contained in each sentenceConversion of a randomly initialized embedding table into an embedding vector t ₁ Converting each document into an embedded vector d by using Dov2 Vec;

wherein h is _k Is the kth part of the output vector h,

is the local attention vector alpha _s Part k of (a)>

Is a transpose of the event type embedding vector;

wherein h is _k Is the kth part of the output vector h,

is the global attention vector alpha _d Part k of (a)>

v _s ＝α _s ·t ₁

v _d ＝α _d ·t ₂

o＝σ(λ·v _s +(1-λ)·v _d )

4. A computer device, characterized by: the computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the dam operation log-based emergency condition event detection method according to any one of claims 1-2 when executing the computer program.

5. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program for executing the emergency condition event detection method based on the dam operation log according to any one of claims 1 to 2.