[go: up one dir, main page]

CN116991983B - Event extraction method and system for company information text - Google Patents

Event extraction method and system for company information text Download PDF

Info

Publication number
CN116991983B
CN116991983B CN202311259460.1A CN202311259460A CN116991983B CN 116991983 B CN116991983 B CN 116991983B CN 202311259460 A CN202311259460 A CN 202311259460A CN 116991983 B CN116991983 B CN 116991983B
Authority
CN
China
Prior art keywords
company name
text
company
event
information text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311259460.1A
Other languages
Chinese (zh)
Other versions
CN116991983A (en
Inventor
李栓
王笑
朱健平
那崇宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311259460.1A priority Critical patent/CN116991983B/en
Publication of CN116991983A publication Critical patent/CN116991983A/en
Application granted granted Critical
Publication of CN116991983B publication Critical patent/CN116991983B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an event extraction method and system for company information texts, wherein in an event extraction task for company information texts, a new labeling rule is provided for solving the problem of interference of noise company names to model performances, the noise company names are incorporated into a labeling system for entity identification, and event categories corresponding to the noise company names are set; the noise problem that the company name field needs to be extracted and the event type corresponding to the company name needs to be judged is converted into a simple classification problem, so that the pressure of a model is greatly relieved, and the difficulty of a task is reduced; and a two-stage extraction model of company names and event types is constructed, so that the accuracy of extracting company name fields and judging event types corresponding to the company names by the model is improved.

Description

Event extraction method and system for company information text
Technical Field
The invention relates to two crossing fields of natural language processing and finance, in particular to an event extraction method and an event extraction system for company information texts.
Background
The task mode of event extraction for company information text is: extracting what happens to a company (event body) from a given information text (event type); however, a noise company name is often present in the text given by the task, that is, the company name field is only mentioned or appears in the text in the given text, and nothing happens, in the labeling system commonly used by the task, the part of company name is not labeled, and the model structure facing the task is also often influenced by the part of noise company name; at present, the model structure of the task is mainly divided into two types: 1. two-section extraction, namely extracting a company name field in a text, and judging what happens in the text in the company name; when the company name field in the task mode is extracted, the company name field in the text is accurately extracted, and whether the company name field has event types set in a labeling system or not in the context is judged, so that the accuracy of model identification and extraction is lower, and particularly satisfactory application performance cannot be achieved under the condition of few samples; 2. the method comprises the steps of jointly extracting, extracting company name fields in a text and judging the event types of the company names in a given text, wherein to a certain extent, the judgment of the event types by a model gives certain information to a company name extraction task, so that the model is helpful to judge whether the set event types occur in the given text in the company name fields to be extracted in the text, however, the problem of company name noise in the text is not solved by the model structure at the source, and a large amount of interference exists on the model in part of noise. Therefore, there is a need to address the technical challenges of how to optimize and mitigate interference with model performance from company name field noise of a set event type that does not occur in a given text.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an event extraction method and an event extraction system for company information texts.
The technical scheme adopted for solving the technical problems is as follows:
a method for extracting event oriented to company information text comprises the following steps:
(1) Acquiring information texts facing companies, and constructing a corpus of the information texts; cleaning and preprocessing information text in a corpus;
(2) Labeling the cleaned information text according to a preset rule; performing text vectorization and label digitization on the marked information text;
(3) Constructing a two-stage event extraction model of company names and event types, training, and extracting the company names and the corresponding event types by using the trained model;
(4) Finally screening and outputting the extracted company name and the corresponding event type;
specifically, the step (1) of cleaning and preprocessing the information text in the corpus specifically includes: the operations of unifying English letters and cases, unifying Chinese and English punctuation marks, converting traditional Chinese into simplified Chinese, deleting messy codes and failing to print characters are sequentially carried out.
Further, the step (2) of labeling the cleaned information text according to a preset rule includes the following sub-steps:
(2.1) field [ com ] for labeling all company names and their abbreviations in the text of the information 1 ,com 2 ,com 3 ,…];
(2.2) according to the preset event type [ EventType ] 1 ,EventType 2 ,EventType 3 ,…,EventType n ,None,Out]Marking all event types, [ EventType ] of company name field in given information text 1 ,EventType 2 ,EventType 3 ,…,EventType n ]Indicating the event type to be extracted, n indicating a total of n event types, none indicating that no thing has occurred in the company name field in the given information text, and Out indicating that an event type other than the event type to be extracted has occurred in the company name field in the given information text.
Further, the specific fields marked with all company names and short names in the information text in the step (2.1) are as follows:
(2.1.1) acquiring an open source data set of company name strong labels, naming an entity identification data set with CLUENER fine granularity, and individually screening samples containing the company name labels in the data set; the strong annotation means that the accuracy of the annotation on the sample is more than 98%;
(2.1.2) constructing and training a BERT+Softmax company name entity extraction model, and automatically marking information texts by using the trained company name entity extraction model;
(2.1.3) obtaining an open source company noun table, and continuing to label the company name on the information text by using a forward matching algorithm and the open source company noun table;
and (2.1.4) finally, manually verifying, checking and correcting the wrongly marked company name field, and carrying out supplementary marking on the company name field which is not marked.
Further, in the step (2), text vectorization and tag digitization are performed on the labeled information text, which specifically includes: text T of information to be entered i Vectorizing to obtain X i =[x i1 ,x i2 ,x i3 ,...]The method comprises the steps of carrying out a first treatment on the surface of the Coding the location of company name in descriptive text using BIO coding rules to obtain Tag i Masking the location of each company name within the tag in the information text using the number 1 to generate each company name [ com ] i1 ,com i2 ,com i3 ,...com ik ]Relative to the information text T i Is a mask vector of (2)Generating event category labels Lab corresponding to each event main body i =[lab i1 ,lab i2 ,lab i3 ...,lab ik ]K represents the counseling text T i There are k company name fields in total, each company name field comi j There is a corresponding mask vector mi j And event category label lab ij
Further, in the step (3), a two-stage event extraction model of company name and event type is constructed and trained, specifically: vectorizing a representation X of an information text i Inputting a pre-training model BERT 1 Obtaining semantic representation X of information text embed,i,1 The method comprises the steps of carrying out a first treatment on the surface of the Inputting the semantic representation of the text into a Linear function Linear and a normalized exponential function Softmax of a layer in turn to obtain a predicted probability value P of whether the characters in the information text are company name fields tag,i Calculating loss value loss in fitting company name fields using cross entropy function cross sentropy com Obtaining a trained company name prediction model after back propagation and parameter optimization; vectorizing a representation X of an information text i Inputting a pre-training model BERT 2 Obtaining semantic representation X of information text embed,i,2 The method comprises the steps of carrying out a first treatment on the surface of the Traversing the mask vector m for each information text ij Using the mask vector m ij Screening a characterization vector X corresponding to a company name j in a text i embed,ij Sequentially inputting a pooling function Avgpool, a single-layer Linear function Linear and a logistic regression function Sigmoid to obtain probability distribution p of different events of company name j in text i type,ij Calculating loss value losst during prediction event type using a bi-classification cross entropy loss function BCELoss ype And carrying out back propagation and model parameter optimization to obtain a trained event type prediction model.
Further, in the step (3), the trained model is used to extract the company name and the corresponding event type, specifically: according to the two-stage event extraction model for building company name and event type and training to obtain probability value P of whether the character in the input information text is company name field ta,g And extracts the company name field [ com ] in the text of the input information i1 ,com i2 ,com i3 ,...com ik ]The method comprises the steps of carrying out a first treatment on the surface of the Masking the location of each company name obtained in the information text using the number 1 to generate each company name relative to the information text T i Is of the mask vector M i =[m i1 ,m i2 ,m i3 ,...m ik ]The method comprises the steps of carrying out a first treatment on the surface of the According to the two-stage event extraction model for constructing company name and event type and training to obtain probability distribution p of different events of each company name field in information text type,ij And extracting the event type of each company name segment in the input information text, if the event type of a certain company name field is null, the probability distribution p type,ij And if the probability of each category is smaller than 0.5, selecting the event type with the largest probability value as the predicted event type.
Further, the screening and outputting the company name and the event type extracted by the model in the step (4) specifically includes: judging whether the event type corresponding to the company name contains Out and None, if so, deleting the event type, if not, outputting the company name and the event type corresponding to the company name, and if so, deleting the company name and the event type corresponding to the company name.
Another aspect of the invention: an event extraction system for company information text, comprising: the system comprises a text database module, a text preprocessing module, a text labeling module, a text modeling module and an output module;
a text database module: acquiring and storing information texts facing to companies; the text preprocessing module is used for cleaning and preprocessing information texts in the corpus;
the text labeling module: labeling the cleaned information text according to a preset rule;
text modeling module: the method is used for text vectorization and label digitization, and builds a joint extraction model and training of company names and event types;
and an output module: the system is used for outputting the company name and event type extracted by the model;
a terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the company information text oriented event extraction method when the computer program is executed.
The beneficial effects of the invention are as follows:
1. in the event extraction method facing company information text, a new labeling rule is provided for the interference problem of noise company names on models, the noise company names are brought into the labeling rule of entity identification, corresponding labels of the noise company names are attached, and company name extraction noise which needs to be judged simultaneously for company name types and company name boundaries is converted into simple classification problems, so that model ground pressure is greatly relieved, task ground difficulty is reduced, and identification and extraction ground precision is improved;
2. in the event extraction method for the company information text, a three-section labeling method is provided, automatic labeling of a deep learning model is sequentially carried out, automatic labeling of an external word list is carried out, manual labeling and error correction flow are carried out, the machine learning method is fully utilized in labeling tasks, the workload and pressure of labeling personnel are relieved, and the labeling accuracy is improved.
3. In the event extraction method for the company information text, provided by the invention, the accuracy of event extraction in the company information text is improved by adopting a two-stage event extraction model of company names and event types in the face of the proposed labeling rules.
Drawings
FIG. 1 is a method for extracting events oriented to company information text;
FIG. 2 is a flow chart of labeling information text in an event extraction method for company information text;
FIG. 3 is a diagram showing a model structure and training flow chart in an event extraction method for company information text;
FIG. 4 is a flow chart of an event extraction system for company information text;
fig. 5 is a schematic diagram of an electronic device according to the present invention.
Detailed Description
The invention is further described below with reference to examples. The following examples are presented only to aid in the understanding of the invention. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.
The invention is further illustrated with reference to the following drawings:
example 1
Referring to fig. 1, a method for extracting event oriented to company information text includes the following steps:
step S1: acquiring information texts facing companies, and constructing a corpus of the information texts;
step S2: cleaning and preprocessing information text in a corpus;
step S3: labeling the cleaned information text according to a preset rule;
step S4: performing text vectorization and label digitization on the marked information text;
step S5: constructing a two-stage event extraction model of company names and event types, training, and extracting the company names and the corresponding event types by using the trained model;
step S6: screening and outputting the name and event type of the company extracted by the model
Further, the step S2 mainly includes: the operations of unifying English letters and cases, unifying Chinese and English punctuation marks, converting traditional Chinese into simplified Chinese, deleting messy codes and being incapable of printing characters are sequentially carried out;
further, the step S3 includes the steps of:
step S31, marking all company names and abbreviated fields [ com ] in the information text 1 ,com 2 ,com 3 ,…]The method comprises the steps of carrying out a first treatment on the surface of the According to the preset event type [ EventType ] 1 ,EventType 2 ,EventType 3 ,…,EventType n ,None,Out]Marking all event types, [ EventType ] of company name field in given information text 1 ,EventType 2 ,EventType 3 ,…,EventType n ]N represents a total of n event types, none represents that no matter occurs in the company name field in the given information text, and Out represents that event types other than the event type to be extracted occur in the company name field in the given information text;
step S32, using the example text "A company today' S fast news: a person leaves a certain department president and leaves from the company B; some stakeholders want to hold no more than 6% of the shares. For example, the noted company name fields are "company A", "department", "company B", "a plurality of" and "a rich stock", and the event types of company A department B and company A department B are None, high-level change, out, stockholder hold-down and Out respectively;
further, referring to fig. 2, the step S31 is marked with a reference numeralAll company names and short field [ com ] in information text 1 ,com 2 ,com 3 ,…]The method specifically comprises the following steps:
step S311, acquiring an open source data set of company name strong labeling, constructing a BERT+Softmax company name entity extraction model, training, and automatically labeling information text by using the constructed company name entity extraction model;
step S312, obtaining an open source company noun table, and continuing to label company names on the information text by using a forward matching algorithm and the open source company noun table;
step S313, finally, performing manual verification and correcting the company name field of the error label;
further, the step S4 mainly includes:
s41: text T of information to be entered i Vectorizing to obtain X i =[x i1 ,x i2 ,x i3 ,...]The method comprises the steps of carrying out a first treatment on the surface of the Coding the location of company name in descriptive text using BIO coding rules to obtain Tag i The method comprises the steps of carrying out a first treatment on the surface of the Masking the location of each company name within the tag in the information text using the number 1 to generate each company name [ com ] i1 ,com i2 ,com i3 ,...com ik ]Relative to the information text T i Is of the mask vector M i =[m i1 ,m i2 ,m i3 ,...m ik ]The method comprises the steps of carrying out a first treatment on the surface of the Generating event category labels Lab corresponding to each event main body i =[lab i1 ,lab i2 ,lab i3 ...,lab ik ]K represents the counseling text T i There are k company name fields in total, each company name field com ij There is a corresponding mask vector m ij And event category label lab ij
S42: the text "A company today's newsletter" is used as an example: a person leaves a certain department president and leaves from the company B; some stakeholders want to hold no more than 6% of the shares. "for example, text vectorization results in a one-dimensional vector of length 46 [101,4567, …,102 ]]Company name com 1 Mask vector corresponding to = "company a" is m 1 =[0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]The corresponding event type is lab 1 =[1,0,0…,0],lab 1 Is the number of preset event types.
Further, referring to fig. 3, the step S5 of constructing a two-stage event extraction model of company name and event type and training includes the following steps:
step S51: vectorizing a representation X of an information text i Inputting a pre-training model BERT 1 Obtaining semantic representation X of information text embed,i,1
Step S52: inputting the semantic representation of the text into a Linear function Linear and a normalized exponential function Softmax of a layer in turn to obtain a predicted probability value P of whether the characters in the information text are company name fields tag,i
Step S53: calculating loss value loss in fitting company name fields using cross entropy function cross sentropy com Obtaining a trained company name prediction model after back propagation and parameter optimization;
step S54: vectorizing a representation X of an information text i Inputting a pre-training model BERT 2 Obtaining semantic representation X of information text embed,i,2
Step S55: traversing the mask vector m for each information text ij Using the mask vector m ij Screening a characterization vector X corresponding to a company name j in a text i embed,ij Sequentially inputting a pooling function Avgpool, a single-layer Linear function Linear and a logistic regression function Sigmoid to obtain probability distribution p of different events of company name j in text i type,ij
Step S56: calculating loss value loss during prediction event type using a bi-classification cross entropy loss function BCELoss type Performing back propagation and model parameter optimization to obtain a trained event type prediction model;
further, referring to fig. 3, the step S5 of extracting the company name and the corresponding event type using the trained model includes the following steps:
step S57: according to step S51, S52, it is obtained whether the character in the text of the input information is a company name fieldProbability value P of (2) tag,i And extracts the company name field [ com ] in the text of the input information i1 ,com i2 ,com i3 ,...com ik ]The method comprises the steps of carrying out a first treatment on the surface of the Masking the location of each company name obtained in the information text using the number 1 according to step S41 to generate each company name relative to the information text T i Is of the mask vector M i =[m i1 ,m i2 ,m i3 ,...m ik ]The method comprises the steps of carrying out a first treatment on the surface of the According to step S54, S55 obtains probability distribution p of occurrence of different events in the information text for each company name field type,ij And extracting the event type of each company name segment in the input information text, if the event type of a certain company name field is null, the probability distribution p type,ij The probability of each category is smaller than 0.5, and the event type with the largest probability value is selected as the predicted event type;
further, the step S6 of screening and outputting the company name and event type extracted by the model includes: judging whether the event type corresponding to the company name contains Out and None, if so, deleting the event type, if not, outputting the company name and the event type corresponding to the company name, and if so, deleting the company name and the event type corresponding to the company name;
example two
Referring to fig. 4, an event extraction system for company information text includes: the system comprises a text database module, a text preprocessing module, a text labeling module and a text modeling module;
the text database module is used for acquiring and storing information texts facing to companies;
the text preprocessing module is used for cleaning and preprocessing information texts in the corpus;
the text labeling module is used for labeling the cleaned information text according to a preset rule;
the text modeling module is used for text vectorization and label digitization, and constructing a joint extraction model and training of company names and event types;
the output module is used for outputting the company name and event type extracted by the model.
The specific manner in which the various modules perform the operations in relation to the systems of the above embodiments have been described in detail in relation to the embodiments of the method and will not be described in detail herein.
For system embodiments, reference is made to the description of method embodiments for the relevant points, since they essentially correspond to the method embodiments. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a company information text oriented event extraction method as described above. As shown in fig. 5, a hardware structure diagram of any device with data processing capability in which the system is located in the embodiment of the present invention is shown in fig. 5, and besides the processor, the memory and the network interface shown in fig. 5, any device with data processing capability in the embodiment of the present invention may further include other hardware according to the actual function of the any device with data processing capability, which is not described herein.
Accordingly, the present application also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a company information text oriented event extraction method as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any device having data processing capabilities. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
It will be understood that the above is only one embodiment of the present invention, and that the present invention is not limited to the structure that has been described above and shown in the drawings, but that several modifications and adaptations can be made without departing from the principle of the present invention. The scope of the invention is limited only by the appended claims.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof.

Claims (3)

1.一种面向公司资讯文本的事件抽取方法,其特征在于,包括以下步骤:1. An event extraction method for company information text, which is characterized by including the following steps: (1)获取面向公司的资讯文本,构建资讯文本的语料库;清洗并预处理语料库中资讯文本;所述清洗并预处理语料库中资讯文本具体为:依次进行英文字母大小写统一、中英文标点符号统一、繁体中文转简体中文、和删除乱码及无法打印字符的操作;(1) Obtain company-oriented information texts and construct a corpus of information texts; clean and preprocess the information texts in the corpus; the specific steps of cleaning and preprocessing the information texts in the corpus include: unifying the upper and lower case of English letters, and unifying Chinese and English punctuation marks in sequence Unify, convert Traditional Chinese to Simplified Chinese, and delete garbled and unprintable characters; (2)按照预设的规则对清洗后的资讯文本进行标注;对标注后的资讯文本进行文本向量化及标签数字化的操作;所述按照预设的规则对清洗后的资讯文本进行标注,包括如下子步骤:(2) Annotate the cleaned information text according to the preset rules; perform text vectorization and label digitization operations on the annotated information text; label the cleaned information text according to the preset rules, including The following sub-steps: (2.1)标注出资讯文本中所有的公司名及其简称的字段[com1,com2,com3,…];(2.1) Mark out all the company names and their abbreviation fields in the information text [com 1 , com 2 , com 3 ,…]; (2.2)按照预设的事件类型[EventType1,EventType2,EventType3,…,EventTypen,None,Out],标注出公司名字段在给定资讯文本中发生的所有事件类型,[EventType1,EventType2,EventType3,…,EventTypen]表示待抽取的事件类型,n表示一共有n种事件类型,None表示公司名字段在给定的资讯文本中没有发生任何事情,Out表示公司名字段在给定的资讯文本中发生了待抽取事件类型以外的事件类型;(2.2) According to the preset event types [EventType 1 , EventType 2 , EventType 3 ,..., EventType n , None, Out], mark all event types that occur in the company name field in the given information text, [EventType 1 , EventType 2 , EventType 3 ,…,EventType n ] represents the event type to be extracted, n represents a total of n event types, None represents that nothing happened in the company name field in the given information text, and Out represents that the company name field is in An event type other than the event type to be extracted occurs in the given information text; 所述步骤(2.1)中标注出资讯文本中所有的公司名及其简称的字段的具体为:In step (2.1), the fields that mark all company names and their abbreviations in the information text are as follows: (2.1.1)获取公司名强标注的开源数据集,并以CLUENER细粒度命名实体识别数据集,并单独筛选出数据集中包含公司名标注的样本;(2.1.1) Obtain the open source data set with strong company name annotations, use CLUENER to name the entity recognition data set at a fine-grained level, and separately filter out the samples containing company name annotations in the data set; (2.1.2)构建BERT+Softmax公司名实体抽取模型并训练,使用训练好的公司名实体抽取模型自动标注资讯文本;(2.1.2) Build a BERT+Softmax company name entity extraction model and train it, and use the trained company name entity extraction model to automatically label information text; (2.1.3)获取开源公司名词表,使用前向匹配算法和开源公司名词表继续在资讯文本上进行公司名标注;(2.1.3) Obtain the open source company noun list, and use the forward matching algorithm and the open source company noun list to continue annotating the company name on the information text; (2.1.4)最后进行人工验证,检查并改正错误标注的公司名字段,对未标注出的公司名字段进行补充标注;(2.1.4) Finally, perform manual verification, check and correct the incorrectly marked company name field, and supplement the unmarked company name field; 所述对标注后的资讯文本进行文本向量化及标签数字化的操作,具体为:将输入的资讯文本Ti进行向量化得到Xi=[xi1,xi2,xi3,...];使用BIO编码规则对公司名在描述文本中位置进行编码得到Tagi,使用数字1对标签内的每个公司名在资讯文本中位置进行遮掩生成每个公司名[comi1,comi2,comi3,...comik]相对于资讯文本Ti的遮掩向量Mi=[mi1,mi2,mi3,...mik],生成每个事件主体对应的事件类别标签Labi=[labi1,labi2,labi3...,labik],k表示咨询文本Ti中共存在k个公司名字段,每个公司名字段comij存在对应的遮掩向量mij和事件类别标签labijThe operations of text vectorization and label digitization on the annotated information text are specifically as follows: vectorizing the input information text Ti to obtain X i =[x i1 , x i2 , x i3 ,...]; Use BIO coding rules to encode the position of the company name in the description text to obtain Tag i . Use the number 1 to mask the position of each company name in the information text to generate each company name [com i1 , com i2 , com i3 ,...com ik ] relative to the mask vector M i =[m i1 ,m i2 ,m i3 ,...m ik ] of the information text T i , generate the event category label Lab i =[ for each event subject corresponding to lab i1 , lab i2 , lab i3 ..., lab ik ], k represents that there are k company name fields in the consultation text T i , and each company name field com ij has a corresponding masking vector m ij and event category label lab ij ; (3)构建公司名及事件类型的两阶段事件抽取模型并训练,使用训练好的模型抽取公司名及对应的事件类型;所述构建公司名及事件类型的两阶段事件抽取模型并训练,具体包括如下子步骤:(3) Construct a two-stage event extraction model of company name and event type and train it, and use the trained model to extract the company name and corresponding event type; The two-stage event extraction model of company name and event type is constructed and trained, specifically Includes the following sub-steps: (3.1)将资讯文本的向量化表示Xi输入预训练模型BERT1得到资讯文本的语义表示Xembed,i,1(3.1) Input the vectorized representation Xi of the information text into the pre-training model BERT 1 to obtain the semantic representation X embed,i,1 of the information text; (3.2)将文本的语义表示依次输入一层的线性函数Linear和归一化指数函数Softmax得到资讯文本中的字符是否为公司名字段的预测概率值Ptag,i,使用交叉熵函数crossentropy计算拟合公司名字段过程中的损失值losscom,进行反向传播和参数优化后得到训练好的公司名预测模型;(3.2) Input the semantic representation of the text into a layer of linear function Linear and normalized exponential function Softmax in sequence to obtain the predicted probability value P tag,i of whether the characters in the information text are company name fields, and use the cross entropy function crossentropy to calculate the pseudo The loss value loss com in the process of combining the company name field is used to obtain the trained company name prediction model after backpropagation and parameter optimization; (3.3)将资讯文本的向量化表示Xi输入预训练模型BERT2得到资讯文本的语义表示Xembed,i,2(3.3) Input the vectorized representation Xi of the information text into the pre-training model BERT 2 to obtain the semantic representation X embed,i,2 of the information text; (3.4)遍历每个资讯文本的遮掩向量mij,使用遮掩向量mij筛选文本i中公司名j对应的表征向量Xembed,ij,并依次输入池化函数Avgpool、单层线性函数Linear、逻辑回归函数Sigmoid得到公司名j在文本i中发生不同事件的概率分布ptype,ij,使用二分类交叉熵损失函数BCELoss计算预测事件类型过程中的损失值losstype,进行反向传播和模型参数优化得到训练好的事件类型预测模型;(3.4) Traverse the mask vector m ij of each information text, use the mask vector m ij to filter the representation vector X embed,ij corresponding to the company name j in the text i, and input the pooling function Avgpool, single-layer linear function Linear, and logic in sequence The regression function Sigmoid obtains the probability distribution p type,ij of different events of company name j in text i, and uses the binary cross-entropy loss function BCELoss to calculate the loss value loss type in the process of predicting the event type, and performs back propagation and model parameter optimization. Obtain the trained event type prediction model; 所述使用训练好的模型抽取公司名及对应的事件类型,具体为:依据构建公司名及事件类型的两阶段事件抽取模型并训练能够得到输入资讯文本中的字符是否为公司名字段的概率值Ptag,i,并依此抽取出输入资讯文本中的公司名字段[comi1,comi2,comi3,...comik];使用数字1对得到的每个公司名在资讯文本中位置进行遮掩生成每个公司名相对于资讯文本Ti的遮掩向量Mi=[mi1,mi2,mi3,...mik];按照构建公司名及事件类型的两阶段事件抽取模型并训练中得到每个公司名字段在资讯文本中发生不同事件的概率分布ptype,ij,并依此抽取出输入资讯文本中的每个公司名字段发生的事件类型,若某个公司名字段发生的事件类型为空,即概率分布ptype,ij中每个类别的概率均小于0.5,则选取概率值最大的事件类型为预测出的事件类型;The method of using a trained model to extract company names and corresponding event types is as follows: building a two-stage event extraction model for company names and event types and training to obtain the probability value of whether the characters in the input information text are company name fields. P tag,i , and extract the company name field [com i1 , com i2 , com i3 ,...com ik ] in the input information text; use the number 1 pair to get the position of each company name in the information text Masking is performed to generate the masking vector M i = [m i1 , m i2 , m i3 ,...m ik ] of each company name relative to the information text T i ; according to the two-stage event extraction model of building company names and event types, During training, the probability distribution p type,ij of different events occurring in each company name field in the information text is obtained, and based on this, the event types that occur in each company name field in the input information text are extracted. If a certain company name field occurs The event type of is empty, that is, the probability of each category in the probability distribution p type,ij is less than 0.5, then the event type with the largest probability value is selected as the predicted event type; (4)最后筛选并输出抽取出的公司名及对应的事件类型;所述筛选并输出模型抽取的公司名及事件类型,具体为:判定公司名对应的事件类型是否包含Out和None,若包含,则删除该事件类型,若删除后该公司名对应的事件类型不为空,则输出该公司名及其对应的事件类型,若为空,则删除该公司名及其对应的事件类型。(4) Finally, filter and output the extracted company name and corresponding event type; filter and output the company name and event type extracted by the model, specifically: determine whether the event type corresponding to the company name contains Out and None, and if so, , then delete the event type. If the event type corresponding to the company name is not empty after deletion, the company name and its corresponding event type will be output. If it is empty, then the company name and its corresponding event type will be deleted. 2.一种面向公司资讯文本的事件抽取系统,其特征在于,包括:文本数据库模块、文本预处理模块、文本标注模块、文本建模模块及输出模块;2. An event extraction system for company information text, characterized by including: a text database module, a text preprocessing module, a text annotation module, a text modeling module and an output module; 文本数据库模块:获取面向公司的资讯文本,构建资讯文本的语料库;清洗并预处理语料库中资讯文本;所述清洗并预处理语料库中资讯文本具体为:依次进行英文字母大小写统一、中英文标点符号统一、繁体中文转简体中文、和删除乱码及无法打印字符的操作;Text database module: obtain company-oriented information texts and build a corpus of information texts; clean and preprocess the information texts in the corpus; the specific steps of cleaning and preprocessing the information texts in the corpus include: unifying the upper and lower case of English letters and unifying Chinese and English punctuation in sequence Unify symbols, convert Traditional Chinese to Simplified Chinese, and delete garbled and unprintable characters; 文本标注模块:按照预设的规则对清洗后的资讯文本进行标注;对标注后的资讯文本进行文本向量化及标签数字化的操作;所述按照预设的规则对清洗后的资讯文本进行标注,包括如下子步骤:Text annotation module: annotate the cleaned information text according to the preset rules; perform text vectorization and label digitization operations on the annotated information text; annotate the cleaned information text according to the preset rules, Includes the following sub-steps: (2.1)标注出资讯文本中所有的公司名及其简称的字段[com1,com2,com3,…];(2.1) Mark out all the company names and their abbreviation fields in the information text [com 1 , com 2 , com 3 ,…]; (2.2)按照预设的事件类型[EventType1,EventType2,EventType3,…,EventTypen,None,Out],标注出公司名字段在给定资讯文本中发生的所有事件类型,[EventType1,EventType2,EventType3,…,EventTypen]表示待抽取的事件类型,n表示一共有n种事件类型,None表示公司名字段在给定的资讯文本中没有发生任何事情,Out表示公司名字段在给定的资讯文本中发生了待抽取事件类型以外的事件类型;(2.2) According to the preset event types [EventType 1 , EventType 2 , EventType 3 ,..., EventType n , None, Out], mark all event types that occur in the company name field in the given information text, [EventType 1 , EventType 2 , EventType 3 ,…,EventType n ] represents the event type to be extracted, n represents a total of n event types, None represents that nothing happened in the company name field in the given information text, and Out represents that the company name field is in An event type other than the event type to be extracted occurs in the given information text; 所述步骤(2.1)中标注出资讯文本中所有的公司名及其简称的字段的具体为:In step (2.1), the fields that mark all company names and their abbreviations in the information text are as follows: (2.1.1)获取公司名强标注的开源数据集,并以CLUENER细粒度命名实体识别数据集,并单独筛选出数据集中包含公司名标注的样本;(2.1.1) Obtain the open source data set with strong company name annotations, use CLUENER to name the entity recognition data set at a fine-grained level, and separately filter out the samples containing company name annotations in the data set; (2.1.2)构建BERT+Softmax公司名实体抽取模型并训练,使用训练好的公司名实体抽取模型自动标注资讯文本;(2.1.2) Build a BERT+Softmax company name entity extraction model and train it, and use the trained company name entity extraction model to automatically label information text; (2.1.3)获取开源公司名词表,使用前向匹配算法和开源公司名词表继续在资讯文本上进行公司名标注;(2.1.3) Obtain the open source company noun list, and use the forward matching algorithm and the open source company noun list to continue annotating the company name on the information text; (2.1.4)最后进行人工验证,检查并改正错误标注的公司名字段,对未标注出的公司名字段进行补充标注;(2.1.4) Finally, perform manual verification, check and correct the incorrectly marked company name field, and supplement the unmarked company name field; 所述对标注后的资讯文本进行文本向量化及标签数字化的操作,具体为:将输入的资讯文本Ti进行向量化得到Xi=[xi1,xi2,xi3,...];使用BIO编码规则对公司名在描述文本中位置进行编码得到Tagi,使用数字1对标签内的每个公司名在资讯文本中位置进行遮掩生成每个公司名[comi1,comi2,comi3,...comik]相对于资讯文本Ti的遮掩向量Mi=[mi1,mi2,mi3,...mik],生成每个事件主体对应的事件类别标签Labi=[labi1,labi2,labi3...,labik],k表示咨询文本Ti中共存在k个公司名字段,每个公司名字段comij存在对应的遮掩向量mij和事件类别标签labijThe operations of text vectorization and label digitization on the annotated information text are specifically as follows: vectorizing the input information text Ti to obtain X i =[x i1 , x i2 , x i3 ,...]; Use BIO coding rules to encode the position of the company name in the description text to obtain Tag i . Use the number 1 to mask the position of each company name in the information text to generate each company name [com i1 , com i2 , com i3 ,...com ik ] relative to the mask vector M i =[m i1 ,m i2 ,m i3 ,...m ik ] of the information text T i , generate the event category label Lab i =[ for each event subject corresponding to lab i1 , lab i2 , lab i3 ..., lab ik ], k represents that there are k company name fields in the consultation text T i , and each company name field com ij has a corresponding masking vector m ij and event category label lab ij ; 文本建模模块:构建公司名及事件类型的两阶段事件抽取模型并训练,使用训练好的模型抽取公司名及对应的事件类型;所述构建公司名及事件类型的两阶段事件抽取模型并训练,具体包括如下子步骤:Text modeling module: Construct and train a two-stage event extraction model of company name and event type, and use the trained model to extract company name and corresponding event type; Construct and train a two-stage event extraction model of company name and event type. , specifically including the following sub-steps: (3.1)将资讯文本的向量化表示Xi输入预训练模型BERT1得到资讯文本的语义表示Xembed,i,1(3.1) Input the vectorized representation Xi of the information text into the pre-training model BERT 1 to obtain the semantic representation X embed,i,1 of the information text; (3.2)将文本的语义表示依次输入一层的线性函数Linear和归一化指数函数Softmax得到资讯文本中的字符是否为公司名字段的预测概率值Ptag,i,使用交叉熵函数crossentropy计算拟合公司名字段过程中的损失值losscom,进行反向传播和参数优化后得到训练好的公司名预测模型;(3.2) Input the semantic representation of the text into a layer of linear function Linear and normalized exponential function Softmax in sequence to obtain the predicted probability value P tag,i of whether the characters in the information text are company name fields, and use the cross entropy function crossentropy to calculate the pseudo The loss value loss com in the process of combining the company name field is used to obtain the trained company name prediction model after backpropagation and parameter optimization; (3.3)将资讯文本的向量化表示Xi输入预训练模型BERT2得到资讯文本的语义表示Xembed,i,2(3.3) Input the vectorized representation Xi of the information text into the pre-training model BERT 2 to obtain the semantic representation X embed,i,2 of the information text; (3.4)遍历每个资讯文本的遮掩向量mij,使用遮掩向量mij筛选文本i中公司名j对应的表征向量Xembed,ij,并依次输入池化函数Avgpool、单层线性函数Linear、逻辑回归函数Sigmoid得到公司名j在文本i中发生不同事件的概率分布ptype,ij,使用二分类交叉熵损失函数BCELoss计算预测事件类型过程中的损失值losstype,进行反向传播和模型参数优化得到训练好的事件类型预测模型;(3.4) Traverse the mask vector m ij of each information text, use the mask vector m ij to filter the representation vector X embed,ij corresponding to the company name j in the text i, and input the pooling function Avgpool, single-layer linear function Linear, and logic in sequence The regression function Sigmoid obtains the probability distribution p type,ij of different events of company name j in text i, and uses the binary cross-entropy loss function BCELoss to calculate the loss value loss type in the process of predicting the event type, and performs back propagation and model parameter optimization. Obtain the trained event type prediction model; 所述使用训练好的模型抽取公司名及对应的事件类型,具体为:依据构建公司名及事件类型的两阶段事件抽取模型并训练能够得到输入资讯文本中的字符是否为公司名字段的概率值Ptag,i,并依此抽取出输入资讯文本中的公司名字段[comi1,comi2,comi3,...comik];使用数字1对得到的每个公司名在资讯文本中位置进行遮掩生成每个公司名相对于资讯文本Ti的遮掩向量Mi=[mi1,mi2,mi3,...mik];按照构建公司名及事件类型的两阶段事件抽取模型并训练中得到每个公司名字段在资讯文本中发生不同事件的概率分布ptype,ij,并依此抽取出输入资讯文本中的每个公司名字段发生的事件类型,若某个公司名字段发生的事件类型为空,即概率分布ptype,ij中每个类别的概率均小于0.5,则选取概率值最大的事件类型为预测出的事件类型;The method of using a trained model to extract company names and corresponding event types is specifically: building a two-stage event extraction model for company names and event types and training to obtain the probability value of whether the characters in the input information text are company name fields. P tag,i , and extract the company name field [com i1 , com i2 , com i3 ,...com ik ] in the input information text; use the number 1 pair to get the position of each company name in the information text Masking is performed to generate the masking vector M i = [m i1 , m i2 , m i3 ,...m ik ] of each company name relative to the information text T i ; according to the two-stage event extraction model of building company names and event types, During training, the probability distribution p type,ij of different events occurring in each company name field in the information text is obtained, and based on this, the event types that occur in each company name field in the input information text are extracted. If a certain company name field occurs The event type of is empty, that is, the probability of each category in the probability distribution p type,ij is less than 0.5, then the event type with the largest probability value is selected as the predicted event type; 输出模块:最后筛选并输出抽取出的公司名及对应的事件类型;所述筛选并输出模型抽取的公司名及事件类型,具体为:判定公司名对应的事件类型是否包含Out和None,若包含,则删除该事件类型,若删除后该公司名对应的事件类型不为空,则输出该公司名及其对应的事件类型,若为空,则删除该公司名及其对应的事件类型。Output module: Finally, filter and output the extracted company name and corresponding event type; filter and output the company name and event type extracted by the model, specifically: determine whether the event type corresponding to the company name contains Out and None, and if so, , then delete the event type. If the event type corresponding to the company name is not empty after deletion, output the company name and its corresponding event type. If it is empty, delete the company name and its corresponding event type. 3.一种终端设备,其特征在于,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1所述的面向公司资讯文本的事件抽取方法。3. A terminal device, characterized in that it includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, the computer program as claimed in the claim The event extraction method for company information text described in requirement 1.
CN202311259460.1A 2023-09-27 2023-09-27 Event extraction method and system for company information text Active CN116991983B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311259460.1A CN116991983B (en) 2023-09-27 2023-09-27 Event extraction method and system for company information text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311259460.1A CN116991983B (en) 2023-09-27 2023-09-27 Event extraction method and system for company information text

Publications (2)

Publication Number Publication Date
CN116991983A CN116991983A (en) 2023-11-03
CN116991983B true CN116991983B (en) 2024-02-02

Family

ID=88534216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311259460.1A Active CN116991983B (en) 2023-09-27 2023-09-27 Event extraction method and system for company information text

Country Status (1)

Country Link
CN (1) CN116991983B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107797993A (en) * 2017-11-13 2018-03-13 成都蓝景信息技术有限公司 A kind of event extraction method based on sequence labelling
CN111985229A (en) * 2019-05-21 2020-11-24 腾讯科技(深圳)有限公司 Sequence labeling method and device and computer equipment
CN113128203A (en) * 2021-03-30 2021-07-16 北京工业大学 Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN113159363A (en) * 2020-12-30 2021-07-23 成都信息工程大学 Event trend prediction method based on historical news reports
CN113886601A (en) * 2021-09-30 2022-01-04 武汉大学 Electronic text event extraction method, device, equipment and storage medium
WO2022142011A1 (en) * 2020-12-30 2022-07-07 平安科技(深圳)有限公司 Method and device for address recognition, computer device, and storage medium
CN116127087A (en) * 2022-12-06 2023-05-16 平安健康保险股份有限公司 Knowledge graph construction method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9026551B2 (en) * 2013-06-25 2015-05-05 Hartford Fire Insurance Company System and method for evaluating text to support multiple insurance applications

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107797993A (en) * 2017-11-13 2018-03-13 成都蓝景信息技术有限公司 A kind of event extraction method based on sequence labelling
CN111985229A (en) * 2019-05-21 2020-11-24 腾讯科技(深圳)有限公司 Sequence labeling method and device and computer equipment
CN113159363A (en) * 2020-12-30 2021-07-23 成都信息工程大学 Event trend prediction method based on historical news reports
WO2022142011A1 (en) * 2020-12-30 2022-07-07 平安科技(深圳)有限公司 Method and device for address recognition, computer device, and storage medium
CN113128203A (en) * 2021-03-30 2021-07-16 北京工业大学 Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN113886601A (en) * 2021-09-30 2022-01-04 武汉大学 Electronic text event extraction method, device, equipment and storage medium
CN116127087A (en) * 2022-12-06 2023-05-16 平安健康保险股份有限公司 Knowledge graph construction method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于LDA模型和AP聚类的主题事件抽取技术;张建恒;黄蔚;胡国超;;计算机与现代化(第12期);全文 *
民航突发事件实体识别方法研究;王红;李浩飞;邸帅;;计算机应用与软件(第03期);全文 *

Also Published As

Publication number Publication date
CN116991983A (en) 2023-11-03

Similar Documents

Publication Publication Date Title
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
US10956673B1 (en) Method and system for identifying citations within regulatory content
US11880435B2 (en) Determination of intermediate representations of discovered document structures
CN112001177A (en) Electronic medical record named entity identification method and system integrating deep learning and rules
US11055327B2 (en) Unstructured data parsing for structured information
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
Singh et al. A decision tree based word sense disambiguation system in Manipuri language
CN114860873B (en) Method, device and storage medium for generating text summary
CN112464927B (en) Information extraction method, device and system
CN116304023A (en) Method, system and storage medium for extracting bidding elements based on NLP technology
Sarkhel et al. Improving information extraction from visually rich documents using visual span representations
Yasin et al. Transformer-based neural machine translation for post-OCR error correction in cursive text
CN118093527B (en) Report quality inspection method and device and electronic equipment
CN117859122A (en) AI-enhanced audit platform including techniques for automated document processing
CN116991983B (en) Event extraction method and system for company information text
JP7696893B2 (en) Domain-based text extraction method and system
US20240020473A1 (en) Domain Based Text Extraction
Pandey et al. A robust approach to plagiarism detection in handwritten documents
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
Romero et al. Information extraction in handwritten marriage licenses books
Nisa et al. Annotation of struck-out text in handwritten documents
CN118297069B (en) Data management system, method, equipment and medium based on natural language processing
CN119719343A (en) A method and system for proofreading ancient books based on large language model
Vercruysse et al. Human-in-the-loop tabular data extraction methods for historical climate data rescue
Fujitake JaPOC: Japanese Post-OCR Correction Benchmark Using Vouchers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant