CN113901815B - Emergency working condition event detection method based on dam operation log - Google Patents
Emergency working condition event detection method based on dam operation log Download PDFInfo
- Publication number
- CN113901815B CN113901815B CN202111202004.4A CN202111202004A CN113901815B CN 113901815 B CN113901815 B CN 113901815B CN 202111202004 A CN202111202004 A CN 202111202004A CN 113901815 B CN113901815 B CN 113901815B
- Authority
- CN
- China
- Prior art keywords
- vector
- word
- sentence
- document
- event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/20—Administration of product repair or maintenance
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/08—Construction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A10/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
- Y02A10/40—Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Strategic Management (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Marketing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Game Theory and Decision Science (AREA)
- Primary Health Care (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及一种基于大坝运行日志的应急工况事件检测方法,对水工领域中的大坝运行日志进行事件检测,具体对大坝长周期运行下历经的各种特殊工况事件及其应对事件进行事件检测,属于自然语言处理技术领域。The present invention relates to an emergency working condition event detection method based on a dam operation log, which performs event detection on the dam operation log in the hydraulic engineering field, specifically performs event detection on various special working condition events and response events experienced by the dam under long-term operation, and belongs to the technical field of natural language processing.
背景技术Background Art
事件检测的任务是从大规模非结构化自然语言文本中识别事件触发词并正确分类事件类型,触发词指的是最能清晰明显表达事件出现的核心词语或短语。事件检测对事件语义建模具有重要意义,方便后续对事件的结构化管理与存储。The task of event detection is to identify event trigger words from large-scale unstructured natural language texts and correctly classify event types. Trigger words refer to the core words or phrases that can most clearly express the occurrence of events. Event detection is of great significance to event semantic modeling, which facilitates the subsequent structured management and storage of events.
在水利工程领域,大坝设施提供了防洪、防凌、蓄水、供水、发电等多种功能,是我国水利事业发展的中流砥柱。在几十年的长周期运行过程中,大坝会遇到诸多自然风险事件,如洪水、地震、暴雨等事件,可能会危及大坝结构安全与大坝下游人民生命财产安全。因此,在特殊事件发生后,大坝管理人员会安排全面专项检查对大坝结构进行维护。此外,大坝的日常检查与检修也是保障大坝坝体安全的重要措施。在各种应对措施后,巡检人员会将本次巡检事件的原因与巡检结果进行文字记录,形成大坝运行日志文件。In the field of water conservancy projects, dam facilities provide multiple functions such as flood control, ice prevention, water storage, water supply, and power generation, and are the mainstay of the development of my country's water conservancy. During the long-term operation of several decades, the dam will encounter many natural risk events, such as floods, earthquakes, rainstorms, etc., which may endanger the safety of the dam structure and the safety of life and property of people downstream of the dam. Therefore, after a special event occurs, the dam management personnel will arrange a comprehensive special inspection to maintain the dam structure. In addition, the daily inspection and maintenance of the dam is also an important measure to ensure the safety of the dam body. After various response measures, the inspection personnel will record the cause and inspection results of this inspection event in writing to form a dam operation log file.
通过对大坝运行日志进行一定的处理,能够剖析大坝自建坝以来安全状况以及形成大坝事件知识库,提升大坝智能管理水平。面向大坝运行日志的应急工况事件检测方法能够跳过事件触发器对大坝运行日志中的所有预定事件进行检测并分类每篇文档归属事件类型,为后续对事件的抽取、事件图谱的构建和事件知识库的构建任务提供基础。By processing the dam operation log, we can analyze the safety status of the dam since its construction and form a dam event knowledge base to improve the intelligent management level of the dam. The emergency condition event detection method for the dam operation log can skip the event trigger to detect all scheduled events in the dam operation log and classify each document into event types, providing a basis for subsequent event extraction, event graph construction and event knowledge base construction tasks.
中文文本中存在大量的歧义,而事件一般由事件触发器与事件论元组成。事件触发器多为动词,普遍具有一词多义性、触发器与词不匹配问题,导致以触发器识别为核心的事件检测方法易分类错误。There are a lot of ambiguities in Chinese texts, and events are generally composed of event triggers and event arguments. Event triggers are mostly verbs, which generally have polysemy and mismatch between triggers and words, which makes event detection methods based on trigger identification prone to classification errors.
发明内容Summary of the invention
发明目的:针对现有技术中存在的问题及大坝运行过程中遇到的各种自然事件及其应对措施事件,缺乏规范的针对事件的标准化记录,本发明提供了一种基于大坝运行日志的应急工况事件检测方法,避免了识别触发器的过程,通过模拟句子中的触发器来解决上述问题,从大坝运行日志中检测大坝特殊工况事件,并对每篇文档归属事件类型进行分类,为后续事件抽取提供基础。Purpose of the invention: In view of the problems existing in the prior art and the various natural events and their response measures encountered during the operation of the dam, there is a lack of standardized records of events. The present invention provides an emergency operating condition event detection method based on the dam operation log, which avoids the process of identifying triggers, solves the above problems by simulating triggers in sentences, detects special dam operating condition events from the dam operation log, and classifies the event type to which each document belongs, providing a basis for subsequent event extraction.
技术方案:一种基于大坝运行日志的应急工况事件检测方法,包括如下步骤:Technical solution: A method for detecting emergency conditions based on dam operation logs, comprising the following steps:
(1)日志文件预处理:首先对大坝运行日志根据记录日期进行排序与拆分,给每个文档进行标号,对每个文档中的句子进行排序、标号与分词,每个词进行实体类型标注与词性标注,之后构建大坝应急工况事件类型集合;所述排序指的是对不同日期的日志进行排序;所述拆分指的是同一天的日志根据文档内容进行拆分;(1) Log file preprocessing: First, the dam operation log is sorted and split according to the recording date, each document is numbered, the sentences in each document are sorted, numbered and segmented, each word is marked with entity type and part of speech, and then a set of dam emergency condition event types is constructed; the sorting refers to sorting logs of different dates; the splitting refers to splitting the logs of the same day according to the document content;
(2)编码向量嵌入:使用ALBERT预处理模型对大坝运行日志中的所有分词进行编码,转换成词对应的嵌入向量;(2) Encoded vector embedding: Use the ALBERT preprocessing model to encode all the words in the dam operation log and convert them into embedding vectors corresponding to the words;
(3)BiLSTM特征融合:使用BiLSTM融合分词对应的嵌入向量、命名实体类型与词性标注向量,强化分词的语义信息;(3) BiLSTM feature fusion: Use BiLSTM to fuse the embedding vector, named entity type, and part-of-speech tagging vector corresponding to the word segmentation to enhance the semantic information of the word segmentation;
(4)双重注意力机制语义强化:使用句子-文档双重注意力融合语境信息,句子级注意力提高每个句子中可能触发事件的重要词,文档级注意力提高每个日志文档中可能触发事件的重要句子,解决传统中文事件检测的一词多义和词与触发器不匹配问题;(4) Dual attention mechanism semantic enhancement: Use sentence-document dual attention to fuse contextual information. Sentence-level attention increases the important words in each sentence that may trigger an event, and document-level attention increases the important sentences in each log document that may trigger an event, solving the problems of polysemy and mismatch between words and triggers in traditional Chinese event detection.
(5)利用Focal loss损失函数训练模型并实现分类:为避免普通大坝日志文档中每个句子最多包含2个事件而导致的二分类正负样本不均衡问题,采用Focal loss损失函数训练模型实现对所有文档归属事件的分类。(5) Using the Focal loss function to train the model and realize classification: In order to avoid the imbalance problem of positive and negative samples in the binary classification caused by the fact that each sentence in the ordinary dam log document contains at most two events, the Focal loss function is used to train the model to realize the classification of the events to which all documents belong.
所述大坝应急工况事件类型集合,包括地震、暴雨、泄洪、汛前安全大检查、全面专项检查、日常检修、日常检查等典型事件。The dam emergency condition event type collection includes typical events such as earthquakes, heavy rains, flood discharges, pre-flood safety inspections, comprehensive special inspections, routine maintenance, and daily inspections.
所述命名实体类型包括人名、部门、位置、时间、日期、测值、百分比、缺陷类型等;所述词性标注向量包括名词、动词、形容词、数量词、代词等。The named entity types include names, departments, locations, times, dates, measured values, percentages, defect types, etc.; the part-of-speech tagging vectors include nouns, verbs, adjectives, quantifiers, pronouns, etc.
进一步的,所述步骤(1)中包括如下步骤:Furthermore, the step (1) includes the following steps:
(1.1)首先将大坝运行日志文件根据日志记录日期分成多个文档,对每个文档进行排序标号,将每个文档中的句子进行排序标号,并使用jieba以词为单位进行分词;(1.1) First, divide the dam operation log file into multiple documents according to the log record date, sort and number each document, sort and number the sentences in each document, and use jieba to segment words;
(1.2)对分词结果进行实体类型标注与词性标注,实体类型标注通过查找随机初始化的嵌入表将实体类型标注转换成低维向量,词性标注采用Stanford CoreNLP标注每个词的词性,之后再通过查找对应嵌入表将词性标注转换成低维向量;(1.2) Perform entity type tagging and part-of-speech tagging on the word segmentation results. Entity type tagging converts entity type tags into low-dimensional vectors by searching a randomly initialized embedding table. Part-of-speech tagging uses Stanford CoreNLP to mark the part of speech of each word, and then converts the part-of-speech tags into low-dimensional vectors by searching the corresponding embedding table.
(1.3)预定义大坝应急工况事件类型,包括地震、暴雨、泄洪、汛前安全大检查、全面专项检查、日常检修、日常检查等典型事件。(1.3) Predefine dam emergency event types, including earthquakes, heavy rains, flood discharges, pre-flood safety inspections, comprehensive special inspections, routine maintenance, routine inspections and other typical events.
进一步的,所述步骤(2)中包括如下步骤:Furthermore, the step (2) includes the following steps:
使用ALBERT预训练模型对(1.1)中的所有分词进行编码处理,转化成计算机能够处理的向量表示。Use the ALBERT pre-trained model to encode all the word segmentations in (1.1) and convert them into vector representations that can be processed by computers.
进一步的,所述步骤(3)中包括如下步骤:Furthermore, the step (3) includes the following steps:
(3.1)将每个词对应的嵌入向量、实体类型向量与词性标注向量进行串联,其中嵌入向量为步骤(2)得到的向量,实体类型向量是所有分词命名实体识别结果如人名、组织、位置、时间、日期、数值、百分比等对应的数学向量,词性标注向量是所有分词的词性标注结果如名词、动词、形容词、数量词、代词等对应的数学向量;(3.1) Concatenate the embedding vector, entity type vector and part-of-speech tag vector corresponding to each word, where the embedding vector is the vector obtained in step (2), the entity type vector is the mathematical vector corresponding to all word segmentation named entity recognition results such as names, organizations, locations, times, dates, numbers, percentages, etc., and the part-of-speech tag vector is the mathematical vector corresponding to all word segmentation part-of-speech tag results such as nouns, verbs, adjectives, quantifiers, pronouns, etc.;
(3.2)使用BiLSTM模型对单个句子中的串联向量进行处理,每个向量为一个输入,利用双向LSTM单元捕获单词上下文信息,分别输出两个隐藏状态和将该两个向量合成为输出向量 (3.2) Use the BiLSTM model to process the concatenated vectors in a single sentence. Each vector is an input, and the bidirectional LSTM unit is used to capture the word context information and output two hidden states respectively. and Combine these two vectors into an output vector
进一步的,所述步骤(4)中包括如下步骤:Furthermore, the step (4) includes the following steps:
(4.1)在训练集中,将每个句子所包含的应急工况预定义事件通过查找随机初始化的嵌入表转化成嵌入向量t1,将每个文档利用Dov2Vec转换成嵌入向量d;(4.1) In the training set, the emergency condition predefined events contained in each sentence are converted into embedding vectors t 1 by searching the randomly initialized embedding table, and each document is converted into an embedding vector d using Dov2Vec;
(4.2)对于每个句子中的所有分词,使用局部注意力机制,计算每个分词在本句中的权重,提高触发目标事件类型的单词注意力权值并模拟触发器,计算公式如下:(4.2) For all the words in each sentence, use the local attention mechanism to calculate the weight of each word in the sentence, increase the attention weight of the word that triggers the target event type and simulate the trigger. The calculation formula is as follows:
其中hk是输出向量h中第k个部分,是局部注意力向量αs中第k个部分,是事件类型嵌入向量的转置;所述触发器指代事件触发器,即触发某事件的词,一般为动词;where h k is the kth part of the output vector h, is the kth part of the local attention vector αs , is the transpose of the event type embedding vector; the trigger refers to an event trigger, that is, a word that triggers an event, generally a verb;
(4.3)对于每个句子中的所有分词,使用全局注意力机制,计算分词所在句子在其文档中的权重,获得触发器在该场景下唯一含义,辅助判断该句子的事件类型,解决触发器因语境信息产生的歧义问题,计算公式如下:(4.3) For all the word segments in each sentence, use the global attention mechanism to calculate the weight of the sentence where the word segment is located in its document, obtain the unique meaning of the trigger in this scenario, assist in determining the event type of the sentence, and solve the ambiguity problem of the trigger caused by context information. The calculation formula is as follows:
其中hk是输出向量h中第k个部分,是全局注意力向量αd中第k个部分,是事件类型嵌入向量转置,dT是文档级嵌入向量转置;where h k is the kth part of the output vector h, is the kth part of the global attention vector αd , is the transposed event type embedding vector, d T is the transposed document level embedding vector;
(4.4)加权融合局部注意力与全局注意力,提高事件检测精度,计算局部注意力、全局注意力对于事件的权重向量和加权融合公式如下:(4.4) Weighted fusion of local attention and global attention to improve event detection accuracy. The weight vector of local attention and global attention for events and the weighted fusion formula are calculated as follows:
vs=αs·t1 v s = α s · t 1
vd=αd·t2 v d = α d · t 2
o=σ(λ·vs+(1-λ)·vd)o=σ(λ·v s +(1-λ)·v d )
其中,最终输出值o由vs和vd两部分组成。vs由αs和事件类型嵌入向量t1点积生成,用于捕获局部特征和模拟隐藏的事件触发器;vd由αd和t2点积生成,用于捕获全局特征和语境信息。σ是Sigmoid函数,λ∈[0,1]是在vs和vd之间进行权衡的超参数。Among them, the final output value o consists of two parts: vs and vd . vs is generated by the dot product of αs and the event type embedding vector t1 , which is used to capture local features and simulate hidden event triggers; vd is generated by the dot product of αd and t2 , which is used to capture global features and contextual information. σ is the Sigmoid function, and λ∈[0,1] is a hyperparameter that weighs between vs and vd .
进一步的,所述步骤(5)中包括如下步骤:Furthermore, the step (5) includes the following steps:
以句子为单位对数据集进行处理,以<句子,事件类型>对构成训练数据,代表给定句子是否传达了t类型的事件,事件类型标签是1或0,例如<近坝库岸、枢纽区边坡及公路检查情况:无异常,日常检查>训练对的标签为1,<近坝库岸、枢纽区边坡及公路检查情况:无异常,地震>训练对的标签为0,由于单个句子可能表达的事件数目与预定义的事件数目相比很少,因此针对二分类识别造成的负样本数目远大于正样本数目的不均衡问题,引入Focal loss 损失函数训练得到的模型,加强正样本和难分样本对模型影响力,计算公式如下:The data set is processed in units of sentences, and the training data is composed of <sentence, event type> pairs, which represent whether a given sentence conveys an event of type t. The event type label is 1 or 0. For example, the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, daily inspection> is 1, and the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, earthquake> is 0. Since the number of events that a single sentence may express is very small compared with the predefined number of events, the imbalance problem of the number of negative samples far exceeding the number of positive samples caused by binary classification recognition is introduced. The model trained by the Focal loss loss function is strengthened to enhance the influence of positive samples and difficult samples on the model. The calculation formula is as follows:
其中x是由句子和目标事件类型组成,y∈{0,1},o(x(i))是模型预测值,||θ||2是模型中各个元素的平方和,δ>0是L2归一化项的权重,β是平衡样本正负权重比例的参数,γ是平衡样本难分类和易分类权重比例的参数,本实验设定β=0.25,γ=2。Where x is composed of a sentence and a target event type, y∈{0,1}, o(x (i) ) is the model prediction value, ||θ|| 2 is the sum of the squares of the elements in the model, δ>0 is the weight of the L2 normalization term, β is a parameter for balancing the positive and negative weight ratios of samples, and γ is a parameter for balancing the weight ratios of difficult and easy classification samples. In this experiment, β=0.25 and γ=2 are set.
最后使用训练完成的模型对大坝运行日志文件进行事件检测,基于每篇文档所包含的事件类型进行分类。Finally, the trained model is used to perform event detection on the dam operation log files and classify them based on the event types contained in each document.
一种基于大坝运行日志的应急工况事件检测系统,对水利领域中的大坝运行日志进行事件检测,包括:An emergency condition event detection system based on dam operation logs performs event detection on dam operation logs in the water conservancy field, including:
日志文件预处理模块:首先对大坝运行日志根据记录日期进行排序与拆分,给每个文档进行标号,对每个文档中的句子进行排序、标号与分词,每个词进行实体类型标注与词性标注,之后构建大坝应急工况事件类型集合;Log file preprocessing module: first, sort and split the dam operation log according to the record date, label each document, sort, label and segment the sentences in each document, label each word with entity type and part of speech, and then construct a set of dam emergency condition event types;
编码向量嵌入模块:使用ALBERT预处理模型对大坝运行日志中的所有分词进行编码,转换成词对应的嵌入向量;Encoded vector embedding module: Use the ALBERT preprocessing model to encode all the words in the dam operation log and convert them into embedding vectors corresponding to the words;
BiLSTM特征融合模块:使用BiLSTM融合分词对应的嵌入向量、命名实体类型与词性标注向量,强化分词的语义信息;BiLSTM feature fusion module: Use BiLSTM to fuse the embedding vector, named entity type and part-of-speech tagging vector corresponding to the word segmentation to enhance the semantic information of the word segmentation;
双重注意力机制语义强化模块:使用句子-文档双重注意力融合语境信息;Dual attention mechanism semantic enhancement module: uses sentence-document dual attention to fuse contextual information;
一个模型:采用Focal loss损失函数训练上述模型,并用训练好的模型对所有文档归属事件的分类。A model: Use the focal loss function to train the above model, and use the trained model to classify all document attribution events.
一种计算机设备,该计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行上述计算机程序时实现如上所述的基于大坝运行日志的应急工况事件检测方法。A computer device comprises a memory, a processor and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the emergency condition event detection method based on the dam operation log as described above is implemented.
一种计算机可读存储介质,该计算机可读存储介质存储有执行如上所述的基于大坝运行日志的应急工况事件检测方法的计算机程序。A computer-readable storage medium stores a computer program for executing the emergency condition event detection method based on a dam operation log as described above.
有益效果:与现有技术相比,本发明提供的基于大坝运行日志的应急工况事件检测方法,通过局部注意力捕捉关键词和句子级语义信息,模拟隐藏的事件触发器,实现了在无触发器条件下的事件检测,通过全局注意力引入丰富的文档级语境信息,辅助判断单词真实语境下的含义,跳过触发器识别环节,直接判断事件类型。避免了中文词与触发器不匹配和一词多义的问题,提升事件检测精度。Beneficial effects: Compared with the prior art, the emergency condition event detection method based on dam operation log provided by the present invention captures keywords and sentence-level semantic information through local attention, simulates hidden event triggers, realizes event detection without triggers, introduces rich document-level context information through global attention, assists in judging the meaning of words in the real context, skips the trigger recognition link, and directly judges the event type. It avoids the problem of mismatch between Chinese words and triggers and polysemy of a word, and improves the accuracy of event detection.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明实施例的模型训练流程图;FIG1 is a flow chart of model training according to an embodiment of the present invention;
图2为本发明实施例的模型训练框架图。FIG2 is a diagram of a model training framework according to an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
下面结合具体实施例,进一步阐明本发明,应理解这些实施例仅用于说明本发明而不用于限制本发明的范围,在阅读了本发明之后,本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。The present invention is further explained below in conjunction with specific embodiments. It should be understood that these embodiments are only used to illustrate the present invention and are not used to limit the scope of the present invention. After reading the present invention, various equivalent forms of modifications to the present invention by those skilled in the art all fall within the scope defined by the claims attached to this application.
如图1所示,基于大坝运行日志的应急工况事件检测方法主要包含以下步骤:As shown in Figure 1, the emergency condition event detection method based on the dam operation log mainly includes the following steps:
步骤(1)预处理大坝运行日志文件Step (1) Preprocessing dam operation log files
(1.1)首先将大坝运行日志文件根据日志记录日期分成多个文档,作为训练集,对每个文档进行排序标号,将每个文档中的句子进行排序标号,并使用jieba以词为单位进行分词。如图2中句子“近坝库岸、枢纽区边坡及公路检查情况:无异常”首先被拆分成“近坝”,“库岸”,“、”,“枢纽区”,“边坡”,“及”,“公路”,“检查”,“情况”,“:”,“无”,“异常”。(1.1) First, the dam operation log file is divided into multiple documents according to the log record date as the training set, and each document is sorted and numbered. The sentences in each document are sorted and numbered, and Jieba is used to segment words. As shown in Figure 2, the sentence "Inspection of the bank near the dam, the slope of the hub area and the road: no abnormality" is first split into "near the dam", "bank", "," "hub area", "slope", "and", "road", "inspection", "situation", ":", "no", "abnormal".
(1.2)对分词结果进行实体类型标注与词性标注,实体类型标注通过查找随机初始化的嵌入表将其转换成低维向量,词性标注采用Stanford CoreNLP标注每个词的词性,之后再通过查找嵌入表将其转换成低维向量。(1.2) Perform entity type tagging and part-of-speech tagging on the word segmentation results. Entity type tagging converts it into a low-dimensional vector by searching a randomly initialized embedding table. Part-of-speech tagging uses Stanford CoreNLP to mark the part of speech of each word, and then converts it into a low-dimensional vector by searching the embedding table.
(1.3)对大坝运行日志预定义大坝应急工况事件类型,包括地震、暴雨、泄洪、汛前安全大检查、全面专项检查、日常检修、日常检查等典型事件。(1.3) Predefine the types of dam emergency conditions in the dam operation log, including earthquakes, heavy rains, flood discharges, pre-flood safety inspections, comprehensive special inspections, routine maintenance, daily inspections and other typical events.
步骤(2)将分词编码成为词向量Step (2) Encode the word segmentation into word vectors
使用ALBERT预训练模型对(1.3)中的所有分词进行编码处理,转化成计算机能够处理的向量表示。Use the ALBERT pre-trained model to encode all the word segmentations in (1.3) and convert them into vector representations that can be processed by computers.
步骤(3)拼接词向量、命名实体类型和词性标注向量后提取语义信息。Step (3) extracts semantic information by concatenating word vectors, named entity types, and part-of-speech tag vectors.
(3.1)将每个分词对应的嵌入向量、实体类型向量与词性标注向量进行串联,其中嵌入向量为步骤(2)得到的向量,实体类型向量是所有分词命名实体识别结果如人名、组织、位置、时间、日期、测值、百分比、缺陷类型等对应的数学向量,词性标注向量是所有分词的词性标注结果如名词、动词、形容词、数量词、代词等对应的数学向量。(3.1) Concatenate the embedding vector, entity type vector and part-of-speech tag vector corresponding to each word segmentation, where the embedding vector is the vector obtained in step (2), the entity type vector is the mathematical vector corresponding to all word segmentation named entity recognition results such as name, organization, location, time, date, measurement value, percentage, defect type, etc., and the part-of-speech tag vector is the mathematical vector corresponding to the part-of-speech tag results of all word segmentations such as noun, verb, adjective, quantifier, pronoun, etc.
(3.2)使用BiLSTM模型对单个句子中的串联向量进行处理,每个向量为一个输入,利用双向LSTM单元捕获单词上下文信息,分别输出两个隐藏状态和将该两个向量合成为输出向量 (3.2) Use the BiLSTM model to process the concatenated vectors in a single sentence. Each vector is an input, and the bidirectional LSTM unit is used to capture the word context information and output two hidden states respectively. and Combine these two vectors into an output vector
步骤(4)使用双重注意力机制捕获句子级语境与文档级语境,强化词向量表示,模拟触发器Step (4) uses a dual attention mechanism to capture sentence-level context and document-level context, strengthen word vector representation, and simulate triggers.
(4.1)在训练集中,将每个句子所包含事件通过查找随机初始化的嵌入表转化成嵌入向量t1,将每个文档利用Dov2Vec转换成嵌入向量d。(4.1) In the training set, the events contained in each sentence are converted into an embedding vector t 1 by looking up a randomly initialized embedding table, and each document is converted into an embedding vector d using Dov2Vec.
(4.2)对于每个句子中的所有分词,使用局部注意力机制,计算每个分词在本句中的权重,提高触发目标事件类型的单词注意力权值并模拟触发器,计算公式如下:(4.2) For all the words in each sentence, use the local attention mechanism to calculate the weight of each word in the sentence, increase the attention weight of the word that triggers the target event type and simulate the trigger. The calculation formula is as follows:
其中hk是输出向量h中第k个部分,是局部注意力向量αs中第k个部分,是事件类型嵌入向量的转置。如图2中用于辅助局部注意力机制,对每个分词模拟触发器。where h k is the kth part of the output vector h, is the kth part of the local attention vector αs , is the transpose of the event type embedding vector. As shown in Figure 2 Used to assist the local attention mechanism and simulate triggers for each word segmentation.
(4.3)对于每个句子中的所有分词,使用全局注意力机制,计算分词所在句子在本文档中的权重,获得触发器在该场景下唯一含义,辅助判断该句子的事件类型,解决触发器因语境信息产生的歧义问题,计算公式如下:(4.3) For all the word segments in each sentence, use the global attention mechanism to calculate the weight of the sentence where the word segment is located in this document, obtain the unique meaning of the trigger in this scenario, assist in determining the event type of the sentence, and solve the ambiguity problem of the trigger caused by context information. The calculation formula is as follows:
其中hk是输出向量h中第k个部分,是全局注意力向量αd中第k个部分,是事件类型嵌入向量转置,dT是文档级嵌入向量转置。如图2中用于辅助全局注意力,避免局部注意力造成的歧义问题。where h k is the kth part of the output vector h, is the kth part of the global attention vector αd , is the transposed event type embedding vector, and dT is the transposed document level embedding vector. Used to assist global attention and avoid ambiguity caused by local attention.
(4.4)加权融合局部注意力与全局注意力,提高事件检测精度,公式如下:(4.4) Weighted fusion of local attention and global attention improves event detection accuracy. The formula is as follows:
vs=αs·t1 v s = α s · t 1
vd=αd·t2 v d = α d · t 2
o=σ(λ·vs+(1-λ)·vd)o=σ(λ·v s +(1-λ)·v d )
其中,最终输出值o由vs和vd两部分组成。vs由αs和t1点积生成,用于捕获局部特征和模拟隐藏的事件触发器;vd由αd和t2点积生成,用于捕获全局特征和语境信息。σ是Sigmoid 函数,λ∈[0,1]是在vs和vd之间进行权衡的超参数。The final output value o consists of two parts: vs and vd . vs is generated by the dot product of αs and t1, which is used to capture local features and simulate hidden event triggers; vd is generated by the dot product of αd and t2 , which is used to capture global features and contextual information. σ is the Sigmoid function, and λ∈[0,1] is a hyperparameter that weighs between vs and vd .
步骤(5)采用Focal loss损失函数避免正负样本不均衡问题并实现对所有文档的分类Step (5) uses the Focal loss function to avoid the imbalance problem of positive and negative samples and achieve classification of all documents
以句子为单位对数据集进行处理,以<句子,事件类型>对构成训练数据,代表给定句子是否传达了t类型的事件,其标签是1或0,例如<近坝库岸、枢纽区边坡及公路检查情况:无异常,日常检查>训练对的标签为1,<近坝库岸、枢纽区边坡及公路检查情况:无异常,地震>训练对的标签为0,由于单个句子可能表达的事件数目与预定义的事件数目相比很少,因此针对二分类识别造成的负样本数目远大于正样本数目的不均衡问题,引入Focal loss损失函数,加强正样本和难分样本对模型影响力,计算公式如下:The data set is processed in units of sentences, and the training data is composed of <sentence, event type> pairs, which represent whether a given sentence conveys an event of type t. Its label is 1 or 0. For example, the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, daily inspection> is 1, and the label of the training pair <inspection of the bank of the reservoir near the dam, the slope of the hub area and the road: no abnormality, earthquake> is 0. Since the number of events that a single sentence may express is very small compared with the predefined number of events, the Focal loss function is introduced to address the imbalance problem that the number of negative samples is much larger than the number of positive samples caused by binary classification recognition, and to enhance the influence of positive samples and difficult samples on the model. The calculation formula is as follows:
其中x是由句子和目标事件类型组成,y∈{0,1},o(x(i))是模型预测值,||θ||2是模型中各个元素的平方和,δ>0是L2归一化项的权重,β是平衡样本正负权重比例的参数,γ是平衡样本难分类和易分类权重比例的参数,本实验设定β=0.25,γ=2;Where x is composed of sentences and target event types, y∈{0,1}, o(x (i) ) is the model prediction value, ||θ|| 2 is the sum of squares of each element in the model, δ>0 is the weight of the L2 normalization term, β is the parameter for balancing the positive and negative weight ratios of samples, and γ is the parameter for balancing the weight ratios of difficult and easy classification samples. In this experiment, β=0.25 and γ=2 are set;
最后使用训练完成的模型对大坝运行日志文件进行事件检测,基于每篇文档所包含的事件类型对文档进行分类。Finally, the trained model is used to perform event detection on the dam operation log files and classify the documents based on the event types contained in each document.
一种基于大坝运行日志的应急工况事件检测系统,对水利领域中的大坝运行日志进行事件检测,包括:An emergency condition event detection system based on dam operation logs performs event detection on dam operation logs in the water conservancy field, including:
日志文件预处理模块:首先对大坝运行日志根据记录日期进行排序与拆分,给每个文档进行标号,对每个文档中的句子进行排序、标号与分词,每个词进行实体类型标注与词性标注,之后构建大坝应急工况事件类型集合;Log file preprocessing module: first, sort and split the dam operation log according to the record date, label each document, sort, label and segment the sentences in each document, label each word with entity type and part of speech, and then build a set of dam emergency condition event types;
编码向量嵌入模块:使用ALBERT预处理模型对大坝运行日志中的所有分词进行编码,转换成词对应的嵌入向量;Encoded vector embedding module: Use the ALBERT preprocessing model to encode all the words in the dam operation log and convert them into embedding vectors corresponding to the words;
BiLSTM特征融合模块:使用BiLSTM融合分词对应的嵌入向量、命名实体类型与词性标注向量,强化分词的语义信息;BiLSTM feature fusion module: Use BiLSTM to fuse the embedding vector, named entity type and part-of-speech tagging vector corresponding to the word segmentation to enhance the semantic information of the word segmentation;
双重注意力机制语义强化模块:使用句子-文档双重注意力融合语境信息;Dual attention mechanism semantic enhancement module: uses sentence-document dual attention to fuse contextual information;
一个模型:采用Focal loss损失函数训练上述模型,并用训练好的模型对所有文档归属事件的分类。A model: Use the focal loss function to train the above model, and use the trained model to classify all document attribution events.
显然,本领域的技术人员应该明白,上述的本发明实施例的基于大坝运行日志的应急工况事件检测方法各步骤或基于大坝运行日志的应急工况事件检测系统各模块可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明实施例不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the various steps of the emergency condition event detection method based on the dam operation log or the various modules of the emergency condition event detection system based on the dam operation log of the above-mentioned embodiment of the present invention can be implemented by a general computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, optionally, they can be implemented by a program code executable by the computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a different order from that here, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. In this way, the embodiments of the present invention are not limited to any specific combination of hardware and software.
Claims (5)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111202004.4A CN113901815B (en) | 2021-10-15 | 2021-10-15 | Emergency working condition event detection method based on dam operation log |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111202004.4A CN113901815B (en) | 2021-10-15 | 2021-10-15 | Emergency working condition event detection method based on dam operation log |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113901815A CN113901815A (en) | 2022-01-07 |
| CN113901815B true CN113901815B (en) | 2023-05-05 |
Family
ID=79192213
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111202004.4A Active CN113901815B (en) | 2021-10-15 | 2021-10-15 | Emergency working condition event detection method based on dam operation log |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113901815B (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115082898A (en) * | 2022-07-04 | 2022-09-20 | 小米汽车科技有限公司 | Obstacle detection method, obstacle detection device, vehicle, and storage medium |
| CN116738366B (en) * | 2023-06-16 | 2024-07-16 | 河海大学 | Method and system for identifying causal relationship of dam emergency event based on feature fusion |
| CN118897895B (en) * | 2024-10-09 | 2025-01-21 | 恒实建设管理股份有限公司 | Method and system for automatically generating supervision log for construction supervision |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110135457A (en) * | 2019-04-11 | 2019-08-16 | 中国科学院计算技术研究所 | Method and system for extracting event trigger words based on autoencoder fusion document information |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103049532A (en) * | 2012-12-21 | 2013-04-17 | 东莞中国科学院云计算产业技术创新与育成中心 | Knowledge base engine construction and query method based on emergency management of emergencies |
| CN111881258B (en) * | 2020-07-28 | 2023-06-20 | 广东工业大学 | Self-learning event extraction method and application thereof |
| CN112612871B (en) * | 2020-12-17 | 2023-09-15 | 浙江大学 | Multi-event detection method based on sequence generation model |
| CN112765952A (en) * | 2020-12-28 | 2021-05-07 | 大连理工大学 | Conditional probability combined event extraction method under graph convolution attention mechanism |
| CN113312500B (en) * | 2021-06-24 | 2022-05-03 | 河海大学 | Method for constructing event map for safe operation of dam |
-
2021
- 2021-10-15 CN CN202111202004.4A patent/CN113901815B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110135457A (en) * | 2019-04-11 | 2019-08-16 | 中国科学院计算技术研究所 | Method and system for extracting event trigger words based on autoencoder fusion document information |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113901815A (en) | 2022-01-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Mo et al. | Large language model (llm) ai text generation detection based on transformer deep learning algorithm | |
| US11842324B2 (en) | Method for extracting dam emergency event based on dual attention mechanism | |
| CN113312500B (en) | Method for constructing event map for safe operation of dam | |
| Khan et al. | A survey on the state-of-the-art machine learning models in the context of NLP | |
| CN113901815B (en) | Emergency working condition event detection method based on dam operation log | |
| CN110135457A (en) | Method and system for extracting event trigger words based on autoencoder fusion document information | |
| CN111382575A (en) | An event extraction method based on joint annotation and entity semantic information | |
| Mohan et al. | Sarcasm detection using bidirectional encoder representations from transformers and graph convolutional networks | |
| CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
| CN110852040B (en) | A punctuation prediction model training method and text punctuation determination method | |
| Vitiugin et al. | Cross-lingual query-based summarization of crisis-related social media: An abstractive approach using transformers | |
| Hindocha et al. | Short-text Semantic Similarity using GloVe word embedding | |
| Lucky et al. | An attention on sentiment analysis of child abusive public comments towards bangla text and ml | |
| CN117313704B (en) | Hybrid readability evaluation method and system based on public and private feature decomposition | |
| CN113312907B (en) | Method and device for remote supervision relationship extraction based on hybrid neural network | |
| CN113486657B (en) | Emotion-reason pair extraction system based on knowledge assistance | |
| CN114282542B (en) | Network public opinion monitoring method and equipment | |
| CN116627915B (en) | Dam emergency working condition event detection method and system based on slot semantic interaction | |
| CN114943229A (en) | Software defect named entity identification method based on multi-level feature fusion | |
| CN114138942A (en) | Violation detection method based on text emotional tendency | |
| CN116756624B (en) | Text classification method for civil aviation supervision item inspection record processing | |
| Wróblewska et al. | Spoiler in a Textstack: How Much Can Transformers Help? | |
| Sirirattanajakarin et al. | BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter | |
| Morales-Márquez et al. | Artificial Intelligence-Based Text Classification: Separating Human Writing from Computer Generated Writing. | |
| Chen et al. | Location Extraction from Twitter Messages using Bidirectional Long Short-Term Memory Model. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |