CN118780269B

CN118780269B - High-efficiency cross-document information extraction system and method based on large language model

Info

Publication number: CN118780269B
Application number: CN202411272956.7A
Authority: CN
Inventors: 黄登蓉; 岳爱珍; 张其来; 张思嘉; 张吉成; 段强; 姜凯
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2024-09-12
Filing date: 2024-09-12
Publication date: 2025-02-18
Anticipated expiration: 2044-09-12
Also published as: CN118780269A

Abstract

The present invention relates to the field of artificial intelligence technology, and in particular to an efficient cross-document information extraction system and method based on a large language model. The information compression module of the present invention not only effectively solves the information redundancy and fragmentation problems in cross-document event extraction through multi-format data conversion and efficient data cleaning, as well as feature analysis, metadata management, consistency verification and security measures, but also ensures the comprehensiveness and reliability of information processing. The event extraction module realizes accurate recognition and classification of events in complex texts through pre-trained large-scale language models and supervised sequence annotation models. The cross-document event collection module uses deep semantic analysis and intelligent similarity evaluation technology to effectively integrate relevant event information from different documents and different time points, providing users with a more complete and coherent event picture.

Description

High-efficiency cross-document information extraction system and method based on large language model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a high-efficiency cross-document information extraction system and method based on a large language model.

Background

In the age of information explosion, event extraction is one of core tasks in the field of information extraction, and is very important for understanding complex and changeable text data and mining potential events and association relations thereof. Especially when processing cross-document events, challenges such as large time span, scattered information distribution, complex and changeable context exist among texts.

Conventional approaches often have difficulty effectively capturing and integrating the full view of these events and their inherent links. While current research has attempted to address the difficult problem of cross-document event extraction by fine-tuning small language models, these approaches face the increasingly complex and diverse text environments, which still face the problems of low accuracy in event extraction, poor understanding of event relationships in complex contexts, and the like.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a high-efficiency cross-document information extraction method and system based on a large language model, which are beneficial to improving the accuracy of event extraction and the understanding capability of event relations under complex contexts.

In order to achieve the technical scheme, the first aspect of the invention provides a high-efficiency cross-document information extraction system based on a large language model, which comprises an information compression module, a data extraction module and a data extraction module, wherein the information compression module is used for carrying out deep analysis and summarization on a plurality of documents in various formats through the large language model so as to acquire cross-document summary information;

the event extraction module is used for extracting event information from the generated cross-document summary information through the large-scale language model and the supervised sequence labeling model;

the cross-document event collection module is used for obtaining an event view through deep semantic analysis and vectorization representation based on the extracted event information;

The information compression module comprises a data preprocessing module and a data processing module, wherein the data preprocessing module is used for carrying out data conversion, data cleaning, feature analysis, metadata management, consistency verification and safety measures on a plurality of documents in various formats.

Further, the data preprocessing module includes:

A metadata management unit for collecting and managing metadata of each of a plurality of documents in a plurality of formats;

A document format conversion unit for converting each of a plurality of documents in a plurality of formats into text data using a natural language processing tool;

the feature analysis unit is used for carrying out mean value analysis and maximum and minimum length statistics on the converted text data by applying a machine learning algorithm so as to understand the text data;

The data cleaning unit is used for thoroughly cleaning the text data based on the understood text data;

The data consistency verification unit is used for verifying consistency of the text data in the process of converting a plurality of documents in various formats into the text data and in the process of cleaning the stored text data;

and the data security measure implementation unit is used for implementing the data security measure.

Further, the information compression module further comprises a text segmentation module, a summary extraction module and an information fusion module;

the text segmentation module is used for cutting text data of each document to obtain a plurality of text blocks of each document;

The abstract extraction module is used for extracting abstract information of each document from a plurality of text blocks of each document;

the information fusion module is used for carrying out information fusion on the extracted summary information of each document so as to generate cross-document summary information.

Further, the text segmentation module includes:

the high-efficiency document loading and content extracting unit is used for identifying the text data structure of each document so as to extract key content;

A recursive character text segmentation unit for segmenting the text data into a plurality of structured text blocks using a recursive algorithm based on the extracted key contents;

The dynamic text block size adjusting unit is used for automatically adjusting the size of the text block according to the length and the complexity of the text data;

A sliding window technology application unit for applying a sliding window on the text block to obtain context information of the text block;

and the self-adaptive text segmentation unit is used for dynamically adjusting the segmentation strategy based on the obtained context information of the text block to obtain semantic density and structural characteristics.

Further, the information compression module also comprises a multi-language support module and a summary quality evaluation module;

The multi-language support module is used for evaluating abstract information obtained by text data of each document under different language environments and optimizing the abstract information based on an evaluation result;

The summary quality evaluation module is used for performing quality evaluation on the generated cross-document summary information and generating feedback information.

Further, the event extraction module comprises a model training module, an event extraction module and an event evaluation module;

the model training module is used for training and fusing the large-scale language model and the supervised sequence labeling model;

The event extraction module is used for extracting event information from the cross-document summary information through the fused large-scale language model and the supervised sequence labeling model;

the event evaluation module is used for evaluating the performances of the deep-fused large-scale language model and the supervised sequence labeling model based on the extracted time information.

Further, the cross-document event collection module comprises a deep semantic analysis and vectorization characterization module, an intelligent similarity evaluation and high-efficiency grouping module, a cross-document event fusion and consistency maintenance module and a cross-document event fusion and consistency maintenance module;

The deep semantic analysis and vectorization characterization module is used for learning context information in the extracted event information through a deep neural network by utilizing a text embedding technology so as to convert the event information into high-dimensional and dense semantic vectors;

The intelligent similarity evaluation and high-efficiency grouping module is used for intelligently identifying and merging event information based on the converted semantic vector through a cosine similarity calculation method and a clustering algorithm DBSCAN so as to generate merging event information;

The cross-document event fusion and consistency maintenance module is used for fusing and maintaining the merging events to obtain target event information;

And the multidimensional event verification and information enhancement module is used for verifying and enhancing the target event information so as to obtain an event view.

In a second aspect, the present invention provides a method for efficiently extracting cross-document information based on a large language model, comprising the steps of:

Deep analysis and summarization are carried out on a plurality of documents in various formats through a large language model so as to obtain abstract information of each document;

Step two, extracting event information from the generated cross-document summary information through a large-scale language model and a supervised sequence labeling model;

Step three, based on the extracted event information, obtaining an event view through deep semantic analysis and vectorization characterization;

and fourthly, continuously optimizing the system through data feedback and algorithm iteration.

Further, the first step includes:

preprocessing a plurality of documents in a plurality of formats to obtain text data of each document;

cutting the text data of each document to obtain a plurality of text blocks of each document;

Extracting summary information of each document from a plurality of text blocks of each document;

and carrying out information fusion on the extracted summary information of each document to generate cross-document summary information.

In a third aspect, the present invention provides a computer readable storage medium, where the computer readable storage medium includes a stored program, and when the program runs, controls a device where the computer readable storage medium is located to execute a high-efficiency cross-document information extraction system based on a large language model as described above.

The invention has the beneficial effects that:

(1) The information compression module not only effectively solves the problems of information redundancy and fragmentation in cross-document event extraction, but also ensures the comprehensiveness and reliability of information processing through multi-format data conversion, efficient data cleaning, feature analysis, metadata management, consistency verification and safety measures. The event extraction module realizes accurate identification and classification of events in complex texts through a pre-trained large-scale language model and a supervised sequence labeling model. The cross-document event collection module utilizes deep semantic analysis and intelligent similarity evaluation technology, effectively integrates related event information from different documents and different time points, and provides a more complete and coherent event view for users.

(2) In addition, the performance evaluation and optimization module continuously monitors the running state of the system, and continuous optimization and stability of the system performance are ensured through data feedback and algorithm iteration.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a schematic block diagram of a high-efficiency cross-document information extraction system based on a large language model of the present invention.

FIG. 2 is a flow chart of a method for efficient cross-document information extraction based on a large language model of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, each technical and scientific term used in this example has the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

In the present invention, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", etc. refer to an orientation or a positional relationship based on that shown in the drawings, and are merely relational terms, which are used for convenience in describing structural relationships of various components or elements of the present invention, and do not denote any one of the components or elements of the present invention, and are not to be construed as limiting the present invention.

In the present invention, terms such as "fixedly attached," "connected," "coupled," and the like are to be construed broadly and refer to either a fixed connection or an integral or removable connection, or both, as well as directly or indirectly via an intermediary. The specific meaning of the terms in the present invention can be determined according to circumstances by a person skilled in the relevant art or the art, and is not to be construed as limiting the present invention.

Example 1:

as shown in fig. 1, the present embodiment provides a high-efficiency cross-document information extraction system based on a large language model, which includes the following steps:

And (one) an information compression module, which is used for carrying out deep analysis and summarization on a plurality of documents in various formats through a large language model (such as Qwen < 2 >) so as to acquire cross-document summary information.

The information compression module includes:

(1) And the data preprocessing module is used for preprocessing a plurality of documents in a plurality of formats to obtain text data of each document.

A) And a metadata management unit for collecting and managing metadata of each document in the plurality of documents in a plurality of formats, wherein the metadata of each document comprises information of a source, an author, a creation date and the like of the document. These metadata are critical to understanding the context of the data and for subsequent analysis of the data.

B) And a document format conversion unit for converting each of a plurality of documents in a plurality of formats (such as PDF, HTML, word and the like) into text data which is easy to process by using a natural language processing tool such as a Word text recognizer and a PDF extractor. This conversion process not only helps to increase the efficiency of data processing, but also enhances the consistency of the data by eliminating format differences.

C) And the feature analysis unit is used for carrying out mean analysis and maximum and minimum length statistics on the converted text data by applying a machine learning algorithm so as to understand the text data (such as the features and distribution of the text data). This helps identify patterns and anomalies in the data, providing a basis for further data mining and knowledge discovery.

D) And the data cleaning unit is used for thoroughly cleaning the data of the text data based on the understood text data, wherein the data cleaning comprises the steps of removing irrelevant information, noise filtering and data standardization. This helps to ensure the quality and accuracy of the text data, which lays a solid foundation for subsequent analysis and processing.

E) And the data storage unit is used for storing the cleaned data in a file with a specific format, such as JSON, XML or CSV. The formats not only ensure the safety of the data, but also improve the accessibility and the readability of the data, and facilitate the subsequent data utilization and analysis.

F) And the data consistency verification unit is used for verifying the consistency of the text data in the process of converting a plurality of documents in various formats into the text data and in the process of cleaning the stored text data, so that the consistency of the data in different sources and formats is ensured after integration, and analysis errors caused by inconsistent data are avoided.

G) And implementing a data security measure unit, namely implementing strict data security measures including data encryption, access control and audit logs in the process of data storage and processing to protect data from unauthorized access or tampering.

(2) And the text segmentation module is used for cutting the text data of each document to obtain a plurality of text blocks of each document.

The text segmentation module comprises:

A) And the efficient document loading and content extracting unit is used for identifying the text data structure of each document, extracting key content in a customized mode and ensuring efficient acquisition and processing of information.

B) And the recursive character text segmentation unit is used for segmenting the text data into a plurality of structured text blocks according to natural language separators such as line feed, space, punctuation and the like by using a recursive algorithm based on the extracted key content. The method not only improves the flexibility of text data processing, but also lays a foundation for further semantic analysis and information extraction.

C) And the dynamic text block size adjusting unit is used for automatically adjusting the size of the text block according to the length and the complexity of the text data by taking the context limitation of the large language model into consideration through a mechanism for dynamically adjusting the size of the text block, so as to realize the optimal cutting of the text data. The method meets the precision requirement, adapts to text data with different lengths, and improves the adaptability and flexibility of processing.

D) And the sliding window technology application unit is used for applying a sliding window on the text block to acquire wider context information while maintaining local information, so that the retrieval precision and the richness of the context information are balanced, the accuracy of extracting the abstract from the text block is improved, and the context information is maintained in the large-block synthesis.

E) And the self-adaptive text segmentation unit is used for dynamically adjusting the segmentation strategy based on the obtained context information of the text block to obtain semantic density and structural characteristics. The system can better process text data of different types and styles, and the accuracy and efficiency of text segmentation are improved.

F) And the multi-dimensional text analysis unit is used for carrying out multi-dimensional text analysis on the text blocks, including but not limited to keyword extraction, topic identification, emotion analysis and the like. These analyses provide rich information and perspectives for the in-depth understanding and application of the text.

(3) And the abstract extraction module is used for extracting abstract information of each document based on a plurality of text blocks of each document.

A) An effective prompt unit is constructed and used for constructing a prompt similar to 'help me extract abstract information in text' so as to guide a large language model to concentrate on key information in the text. These cues are tailored to the text content and context to ensure that the model captures the most relevant summary information.

B) And the context understanding unit is used for understanding the context information obtained based on the sliding window technology application unit so as to ensure the continuity and the relativity of the abstract information and avoid the fault or omission of the information.

C) And the key information identification unit analyzes each text block through complex semantic analysis and pattern identification technology to identify key information such as a subject sentence, important data, main views and the like, and constructs a abstract based on the key information.

D) And a preliminary digest generating unit for generating a preliminary digest of each document based on the key information identified from each text block. Wherein, in the course of generating the preliminary abstract of each document, the primary viewpoint and structure of the original text are maintained as much as possible, while redundant and secondary information is removed.

E) And the continuity optimizing unit is used for further optimizing the generation of the preliminary abstract so as to improve the continuity of the abstract and generate an optimized abstract. Specifically, the generated preliminary abstract is optimized for adjusting sentence sequence, adding transitional words or phrases and the like so as to ensure logic fluency of the abstract.

F) And the summary and integrity balancing unit is used for searching a proper balancing point between the length and the information density of the optimized summary so as to ensure the summary to be concise and ensure the integrity of the summary, namely key information is not missed, and the summary information of each document is extracted based on the balancing point.

(4) And the information fusion module is used for carrying out information fusion on the extracted summary information of each document so as to generate cross-document summary information.

The information fusion module comprises:

A) And the hierarchical attention mechanism unit is used for capturing key information at the word level and carrying out importance assessment at the sentence and paragraph level for each document by adopting a multi-layer attention mechanism so as to more comprehensively understand the document structure and the theme.

B) And the context-aware embedding unit is used for dynamically adjusting the representation of the word or phrase according to the context through the dynamic embedding technology of the pre-training language model, so as to improve the accuracy of semantic understanding.

C) And the long text processing capability is that aiming at long documents, a module adopts a strategy of blocking processing and global integration, the documents are firstly segmented, and global information is integrated through a span attention mechanism, so that information loss is avoided.

(5) And the multi-language support module is used for translating the text data of each document into different languages so as to acquire abstract information of each document in different language environments and evaluating the abstract information so as to optimize the abstract information based on an evaluation result.

The multi-language support module includes:

A) A multilingual training model for training a multilingual model (e.g., qwen, llama 3) with a large amount of multilingual data so that the multilingual model can understand the grammatical structure and semantic information of different languages. The method can process text translation and realize understanding and generation of cross languages in tasks such as text abstract and emotion analysis.

B) The multi-language model is internally provided with a high-efficiency language detection algorithm, the text language of the text data of each document is automatically identified through the trained multi-language model, and based on the identification result, the text data is automatically switched to corresponding language processing and algorithm to process. The automatic adaptation mechanism ensures the processing effect of texts in different languages, and improves the flexibility and user satisfaction of the system.

C) And the cross-language consistency assessment unit optimizes the abstract information by comparing the abstract information obtained by the same text data in different languages in a multi-language environment so as to ensure consistency and accuracy of the abstract information. This involves not only direct translation comparisons between languages, but also consideration of cultural differences and contextual understanding to promote overall reliability of the cross-language abstract. The evaluation mechanism is detailed as follows:

i) And the multilingual translation engine integration unit is used for integrating the multilingual translation engines leading in the industry, ensuring that text data can be accurately converted into target languages, and taking the target languages as the basis of evaluation. These engines are continuously optimized to handle complex linguistic phenomena and provide high quality translation control for evaluation.

Ii) a cultural sensitivity analysis unit for identifying and comparing specific expressions, metaphors or slang of text data in different language versions, which may be generated due to cultural background differences, with a pre-established cultural difference database.

Iii) A context understanding enhancement unit for assessing the suitability of text data in different contexts by Natural Language Processing (NLP) techniques, in particular a context aware model Qwen. The context understanding enhancement unit can analyze the context in which the text data is located, including topics, emotional colors, social occasions, etc., to help assess whether the text data maintains the same contextual meaning and emotional propensity in different language versions.

And iv) a manual auditing and feedback circulation unit, wherein the manual auditing team is responsible for rechecking the automatic evaluation result, particularly paying attention to subtle cultural differences and context changes which are difficult to capture by an automatic tool, and the feedback of the manual auditing is incorporated into the continuous optimization of the evaluation system to form a closed loop feedback mechanism, so that the accuracy and the efficiency of the evaluation are continuously improved. Wherein, the manual auditing team consists of experts with multilingual capability and deep cultural background knowledge.

D) The language resource library construction module is used for constructing a rich language resource library including vocabulary, grammar rules, language models and the like. These resources provide the necessary language knowledge for the system to better understand and generate text in various languages.

(6) The summary quality evaluation module is used for performing quality evaluation on the generated cross-document summary information and generating feedback information.

A) And the automatic evaluation module is used for automatically evaluating the generated target abstract based on evaluation indexes such as overlapping degree, semantic similarity, consistency, diversity and the like so as to generate automatic evaluation feedback, thereby comprehensively measuring the quality of the target abstract.

B) And the human evaluation feedback module is used for periodically collecting human expert evaluation feedback so as to calibrate the parameters of the automatic evaluation module and ensure the reliability of the evaluation result.

C) And the real-time feedback circulation module is used for continuously optimizing the abstract quality through reinforcement learning or fine tuning technology based on automatic evaluation feedback and human expert evaluation feedback.

And (II) an event extraction module, which is used for extracting event information from the generated cross-document summary information through a large-scale language model (such as BERT, GPT series and the like) and a supervised sequence labeling model (such as BiLSTM-CRF).

Specifically, the event extraction module realizes accurate extraction of events in the text by combining a large-scale language model and a supervised sequence labeling model.

The event extraction module includes:

(1) And the model training module is used for training and fusing the large-scale language model and the supervised sequence labeling model.

A model training module comprising:

A) And the data collection unit is used for collecting and sorting historical cross-document summary information containing rich event information. These cross-document summary information should cover a wide variety of contexts and event types to ensure the generalization capability of the model. Meanwhile, preprocessing is carried out on historical cross-document summary information, including word segmentation, part-of-speech tagging, named entity recognition and the like, and high-quality input is provided for subsequent model training.

B) And the large language model pre-training unit is used for pre-training the large-scale language model by utilizing the collected historical cross-document summary information so as to enable the large-scale language model to better understand the event structure and the context information in the cross-document summary information by fine-tuning the parameters of the large-scale language model. This step aims to provide preliminary event candidates or event characterizations for event extraction using the general knowledge of the language model.

C) The supervised sequence labeling model pre-training unit is used for constructing a supervised sequence labeling model and training the constructed supervised sequence labeling model through output data in the pre-training process of the large-scale language model.

Specifically, the supervised sequence labeling model is used to receive data (e.g., event references, event attributes, etc.) from the output of the large-scale language model and learn how to accurately identify and classify events by labeling event boundaries and types in the output data of the large-scale language model. In the training process, optimization strategies such as cross entropy loss function and the like are adopted, and model parameters are continuously adjusted so as to improve the accuracy and efficiency of event extraction.

D) And the model fusion and optimization unit is used for carrying out deep fusion on the pre-trained large-scale language model and the supervised sequence labeling model, and specifically, the two models can work cooperatively through a fusion mechanism (such as feature splicing and attention mechanism) with reasonable design, so that the event extraction performance is improved jointly.

In addition, the verification set is utilized to tune the model, so that the model is ensured to have good generalization capability in a complex text environment.

(2) The event extraction module is used for extracting event information from the cross-document summary information through the deep-fused large-scale language model and the supervised sequence labeling model.

The event extraction module includes:

a) And the text input unit is used for inputting the cross-document summary information into the deep-fused large-scale language model and the supervised sequence labeling model.

B) And the event candidate generation unit is used for carrying out preliminary processing on the cross-document abstract information through the pre-trained large-scale language model to generate potential event candidates. These candidates may include key information such as event references, event attributes, etc.

C) And the event identification and classification unit is used for further identifying and classifying event candidates through a pre-trained supervised sequence labeling model so as to extract completed event information. By annotating event boundaries and types in cross-document summary information, the supervised sequence annotation model can accurately extract complete event information from the cross-document summary information.

D) The event relation construction unit also needs to consider the relation between the event information for the cross-document summary information. The method can be realized by analyzing the characteristics of co-occurrence, time sequence relation and the like of event candidates in different documents, so that a more complete and accurate event information network is constructed.

(3) And the event evaluation module is used for evaluating the performances of the deep-fused large-scale language model and the supervised sequence labeling model based on the extracted time information.

A) The accuracy evaluation unit is used for calculating the matching degree between the extracted event and the manually marked event and evaluating the accuracy of the model. Commonly used evaluation indexes include accuracy, recall, and F1 values.

The manually marked events refer to that a test set is divided for a data set, some events are manually marked as labels of the test set, and then the events obtained through an event extraction module are matched with the events obtained manually.

B) The efficiency evaluation unit is used for evaluating the efficiency and the speed of the deep-fused large-scale language model and the supervised sequence labeling model when the cross-document abstract information is processed, so that the real-time requirement can be met in practical application.

C) And the generalization capability evaluation unit is used for performing generalization capability test on the large-scale language model and the supervised sequence annotation model by utilizing the data set which does not participate in the training of the large-scale language model and the supervised sequence annotation model so as to evaluate the performances of the large-scale language model and the supervised sequence annotation model under different contexts and event types.

D) And the error analysis unit is used for carrying out deep analysis on error events extracted through the large-scale language model and the supervised sequence labeling model, finding out the reasons and rules of the errors and providing guidance for further optimization of the large-scale language model and the supervised sequence labeling model.

And thirdly, a cross-document event collection module which is used for intelligently identifying and merging the extracted event information by utilizing a deep semantic analysis and vectorization characterization technology.

The cross-document event collection module comprises:

(1) And the deep semantic analysis and vectorization characterization module is used for accurately converting the complex natural language text into high-dimensional and dense semantic vectors by learning the context information in the extracted event information through a deep neural network by using a text embedding technology such as BGE.

The process not only captures subtle differences among vocabularies, but also digs deep logic and semantic association among sentences, paragraphs and even the whole document deeply, and lays a solid foundation for subsequent entity grouping and integration.

(2) And the intelligent similarity evaluation and high-efficiency grouping module is used for carrying out intelligent recognition and high-efficiency merging on the event information based on the converted semantic vector through a cosine similarity calculation method and an advanced clustering algorithm DBSCAN so as to generate merging event information.

The intelligent recognition and efficient merging can automatically detect and merge highly similar entity descriptions in the event, effectively avoid information redundancy, and simultaneously keep diversity and integrity of the event information. By dynamically adjusting the clustering parameters and the similarity threshold, the accuracy and the flexibility of the clustering result are ensured, and a clearer and orderly entity grouping view is provided for the user.

The intelligent similarity evaluation and efficient grouping module comprises:

A) The semantic vector construction deepening unit is used for extracting semantic features in the event by adopting the BGE-Embedding model and generating a high-dimensional semantic vector.

Specifically, the BGE-Embedding model can capture the fine semantic difference between events through training on a large-scale corpus, and provides a solid foundation for subsequent similarity calculation.

B) And the cosine similarity calculation module is used for giving different weights to the features of different dimensions according to the importance of the features by a weighted cosine similarity calculation method. The optimization measure not only improves the accuracy of similarity evaluation, but also enhances the robustness of the model to noise. While the traditional cosine similarity calculation can effectively measure the directional similarity among vectors, the traditional cosine similarity calculation can be faced with the problems of high calculation complexity, sensitivity to noise and the like when processing high-dimensional semantic vectors.

C) The customizing application unit of the clustering algorithm DBSCAN is used for customizing the clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) which is a Density-based clustering algorithm and is used for dividing data points into spatial clustering application and noise recognition algorithm based on Density of core points, boundary points and noise points.

Specifically, the clustering algorithm DBSCAN can automatically identify and exclude noise points, and meanwhile, find out a clustering structure with any shape, and is very suitable for processing a data set with complex distribution characteristics. For a specific application scene, the clustering algorithm DBSCAN is subjected to customized adjustment, wherein the two key parameters of the neighborhood radius (epsilon) and the minimum point number (MinPts) are dynamically adjusted. By monitoring the clustering effect in real time and feeding back and adjusting the parameters, the accuracy and the flexibility of the clustering result are ensured.

D) And the dynamic adjustment unit of the clustering parameters is used for dynamically adjusting the clustering parameters.

Specifically, in order to ensure that the clustering result can accurately reflect the internal structure of the data, whether the current parameter setting is proper or not is automatically judged through a dynamic adjustment strategy of the clustering parameters based on the evaluation indexes (contour coefficients, calinski-Harabasz indexes and the like) of the clustering result, and fine adjustment is performed according to the current parameter setting.

In addition, through a user feedback mechanism, a user can finely adjust the clustering result according to the own requirements, and the flexibility and the user experience of the system are further improved.

(3) And the cross-document event fusion and consistency maintenance module is used for fusing and maintaining the merging event information so as to obtain target event information.

In particular, homonymous or highly similar entities from different documents are intelligently identified and merged using entity identification, linking, and matching techniques. By constructing a global entity index and consistency check mechanism, the uniqueness and consistency of entity identities in the whole event collection and integration process are ensured.

The quality and purity of the data are further improved through a fine deduplication strategy, and a more reliable event information basis is provided for users.

(4) And the multidimensional event verification and information enhancement module is used for verifying and enhancing the target event information so as to obtain an event view.

Specifically, in order to ensure the accuracy and authority of the target event information, a comprehensive entity verification process is performed.

The process merges an external authoritative knowledge base (e.g., WIKIPEDIA, DBPEDIA, etc.), expert systems, and advanced automated context analysis techniques. Through multiple verification and comparison of the identified entities, the false identification rate is effectively reduced, and the credibility of the information is improved.

Meanwhile, based on a context analysis technology, key attribute information of the entity, such as time, place, participants and the like, can be automatically supplemented or corrected, so that the event description is more vivid, detailed and accurate. This functionality not only enhances the readability and value of the information, but also provides a deeper event insight to the user.

And fourthly, the performance evaluation and optimization module continuously optimizes the system through data feedback and algorithm iteration.

The performance evaluation and optimization module comprises:

(1) And the multi-dimensional performance index collection module is used for collecting system performance indexes such as accuracy, recall rate, F1 score, robustness, expandability, accuracy, processing time, resource consumption and the like.

These metrics can help us fully understand the performance of the system in different ways, and thus make more detailed optimizations.

(2) And the automatic super-parameter adjustment module is used for searching the optimal configuration of the system by applying a super-parameter optimization technology based on machine learning.

Specifically, a machine learning based super-parametric optimization technique, in particular bayesian optimization (Bayesian Optimization), is applied to find the optimal system configuration in a huge parameter space.

Specifically, the optimal system configuration can be found by the following method:

in the category of smart search algorithms, the grid search (GRID SEARCH) follows a formulation strategy, defines the grid of the parameter space, and finds the optimal configuration by traversing all grid points, the formalized representation of which can be reduced to a vector of parameters Discretizing each dimension of (2) to form Cartesian product space。

Random Search (Random Search) then randomly selects the parameter points, the process of which can be regarded as from the parameter spaceIs sampled in a uniform distribution. To overcome the problem that grid search and random search drastically decrease efficiency in increasing parameter space dimensions.

Genetic algorithms (Genetic Algorithms, GA) and simulated annealing (Simulated Annealing, SA), GA iteratively improve the solution set of parameters by simulating natural selection processes (e.g., operations such as selection, crossover, mutation, etc.). The core is an fitness function (usually related to model performance) that evaluates each solution for quality. And SA simulates the energy reduction of the system in the physical annealing process, and receives the non-optimal solution in a probability mode, so that the local optimal solution can be jumped out, and the global optimal solution can be achieved.

For Bayesian optimization, the core is to construct a Gaussian Process (GP) modelWherein, the method comprises the steps of, wherein,Is a function of the mean value,Is a covariance function (also called kernel function) that describes the similarity between different hyper-parameter configurations. Bayesian optimization is achieved by maximizing an acquisition function (Acquisition Function)(E.g., EI-Expected Improvement, PI-Probability of Improvement, etc.) to select the next evaluation point. The acquisition function combines a balance of exploration and utilization, formalized representation such as EI can be written as:

Wherein, Is an existing set of observed data that is,Is the maximum model performance currently known.

As the iteration proceeds, new observation points are added continuously,To a data setThe GP model is updated, so that the model performance of an unknown region is predicted more accurately, and the search process is guided to approach to an optimal solution. This feature of bayesian optimization enables it to find a near-globally optimal hyper-parameter configuration in a small number of evaluations.

(3) And the real-time monitoring and feedback module is used for monitoring performance indexes and health states of the system, such as memory use, CPU load and the like, so as to ensure the stable operation of the system.

(4) And the integrated A/B test module is used for dynamically adjusting flow distribution of the experimental group and the control group through a multi-arm slot machine algorithm in the A/B test, so as to maximize the learning efficiency.

(5) And the model version control and rollback module is used for realizing system version control and recording the detailed information updated each time so as to quickly rollback to the previous stable version when the new version is in problem.

(6) And the cross-platform performance evaluation module is used for evaluating the performance of the cross-platform by considering that different platforms possibly have different influences on the performance of the system, so that the system can keep the best performance in various environments.

(7) And the generalization capability test module is used for evaluating the generalization capability of the system by testing the system on different data sets, so that the system is ensured to perform well on training data and adapt to new and unseen data.

(8) And the continuous integration and continuous deployment (CI/CD) module is used for integrating the performance evaluation and optimization process into the CI/CD process, realizing automatic test, evaluation and deployment and improving the development efficiency and the system stability.

Through the optimization measures, the functions of each module in the system are more comprehensive, more personalized, efficient and accurate service can be provided, and meanwhile, the adaptability and the user friendliness of the system are enhanced.

Example 2:

as shown in fig. 2, the present embodiment provides a method for efficiently extracting cross-document information based on a large language model, which includes the following steps:

And S1, carrying out deep analysis and summarization on a plurality of documents in various formats through a large language model (such as Qwen) so as to acquire abstract information of each document.

And S11, preprocessing a plurality of documents in a plurality of formats to obtain text data of each document.

And S11-1, collecting and managing metadata of each document in a plurality of documents in a plurality of formats.

And S11-2, converting each document in the plurality of documents in the plurality of formats into text data, and verifying consistency of the text data.

S11-3, performing mean analysis and maximum and minimum length statistics on the text data by applying a machine learning algorithm to understand the text data;

and S11-4, thoroughly cleaning the text data based on the understood text data, and verifying the consistency of the text data.

And S11-5, storing the cleaned data in a file with a specific format.

S11-6, real-time safety protection measures are performed in the process of deep analysis and summarizing of a plurality of documents in various formats.

And S12, cutting the text data of each document into blocks to obtain a plurality of text blocks of each document.

S12-1, identifying a text data structure of each document to extract key content;

s12-2, dividing the text data into a plurality of structured text blocks by using a recursive algorithm based on the extracted key content;

And S12-3, automatically adjusting the size of the text block according to the length and the complexity of the text data.

And S12-4, applying a sliding window on the text block to acquire wider context information.

S12-5, dynamically adjusting the segmentation strategy based on the acquired context information.

And S12-6, carrying out multi-dimensional text analysis on the text block.

And S13, extracting abstract information of each document based on a plurality of text blocks of each document.

And S13-1, constructing prompts to guide the large language model to concentrate on key information in the text.

Specifically, a prompt similar to "help me extract summary information in text" is constructed to guide the large language model to focus on key information in text. These cues are tailored to the text content and context to ensure that the model captures the most relevant summary information.

And S13-2, understanding the obtained context information to ensure continuity and correlation of abstract information and avoid faults or omission of the information.

S13-3, analyzing each text block of each document through complex semantic analysis and pattern recognition technology to recognize key information such as topic sentences, important data, main views and the like, and constructing a abstract based on the key information.

And S13-4, generating a preliminary abstract of each document based on the identified key information.

Wherein, in the course of generating the preliminary abstract of each document, the primary viewpoint and structure of the original text are maintained as much as possible, while redundant and secondary information is removed.

And S13-5, further optimizing the generated preliminary abstract to improve abstract continuity so as to generate an optimized abstract.

Specifically, the generated preliminary abstract is optimized for adjusting sentence sequence, adding transitional words or phrases and the like so as to ensure logic fluency of the abstract.

And S13-6, searching a proper balance point between the length of the optimized abstract and the information density, and extracting abstract information of each text block based on the searched balance point.

The specific words are searched for proper balance points, so that the integrity of the abstract is ensured while the simplicity of the abstract is ensured, namely, key information is not missed.

The step S13 further includes:

The text data of each document is translated into different languages to obtain abstract information of each document in different language environments, and evaluation is performed to optimize the abstract information based on the evaluation result.

And S14, information fusion is carried out on the extracted summary information of each document so as to generate cross-document summary information.

And S15, performing quality evaluation on the generated cross-document summary information and generating feedback information.

S2, extracting event information from the generated cross-document summary information through a large-scale language model and a supervised sequence labeling model.

And S21, training and fusing the large-scale language model and the supervised sequence labeling model.

And S21-1, collecting and sorting historical cross-document summary information containing rich event information.

S21-2, pre-training the large-scale language model by utilizing the collected history cross-document summary information.

S21-3, constructing a supervised sequence labeling model, and training the constructed supervised sequence labeling model through output data in a pre-training process of the large-scale language model.

S21-4, carrying out deep fusion on the pre-trained large-scale language model and the supervised sequence labeling model.

S22, extracting event information from the cross-document summary information through the fused large-scale language model and the supervised sequence labeling model.

S22-1, inputting cross-document summary information into the fused large-scale language model and the supervised sequence labeling model.

S22-2, preliminary processing is carried out on the cross-document summary information through a pre-trained large-scale language model, and potential event candidates are generated.

S22-3, further identifying and classifying event candidates through a pre-trained supervised sequence labeling model to extract complete event information.

S22-4, constructing an event information network.

S23, evaluating the performance of the fused large-scale language model and the supervised sequence labeling model based on the extracted time information.

And S3, based on the extracted event information, obtaining an event view through deep semantic analysis and vectorization characterization.

S31, learning context information in the extracted event information through a deep neural network by using a text embedding technology to generate high-dimensional and dense semantic vectors.

S32, based on the semantic vector, intelligent identification and efficient merging are carried out on the extracted event information through a cosine similarity calculation method and an advanced clustering algorithm so as to merge the event information.

S33, fusing and maintaining the merging event to obtain target event information;

and S34, verifying and enhancing the target event information to obtain an event view.

And S4, continuously optimizing the system through data feedback and algorithm iteration.

And S41, collecting system performance indexes to understand the system performance.

S42, applying a super-parameter optimization technology based on machine learning to find the optimal configuration of the system.

Example 3:

The present embodiment provides a computer readable storage medium, where the computer readable storage medium includes a stored program, and when the program runs, the device where the computer readable storage medium is controlled to execute the method for extracting high-efficiency cross-document information based on the large language model described in embodiment 2.

The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the terminal embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description in the method embodiment for relevant points.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, it should be noted that, in the description corresponding to the flowcharts or the block diagrams in the figures, the operations or steps corresponding to different blocks may also occur in different orders than that disclosed in the description, and sometimes, there is no specific order between the different operations or steps. For example, two consecutive operations or steps may actually be performed substantially in parallel, they may sometimes be performed in reverse order, which may be dependent on the functions involved. Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The high-efficiency cross-document information extraction system based on the large language model is characterized by comprising an information compression module, a data extraction module and a data extraction module, wherein the information compression module is used for carrying out deep analysis and summarization on a plurality of documents in various formats through the large language model so as to acquire cross-document summary information;

2. The efficient cross-document information extraction system based on a large language model of claim 1, wherein the data preprocessing module comprises:

3. The efficient cross-document information extraction system based on a large language model of claim 2, wherein the information compression module further comprises a text segmentation module, a abstract extraction module and an information fusion module;

4. The efficient cross-document information extraction system based on a large language model of claim 3, wherein the text segmentation module comprises:

5. The efficient cross-document information extraction system based on a large language model of claim 2, wherein the information compression module further comprises a multi-language support module and a summary quality assessment module;

6. The efficient cross-document information extraction system based on a large language model of claim 1, wherein the event extraction module comprises a model training module, an event extraction module, and an event evaluation module;

7. The large language model based high-efficiency cross-document information extraction system of claim 1, wherein the cross-document event collection module comprises a deep semantic parsing and vectorization characterization module, an intelligent similarity evaluation and high-efficiency grouping module, a cross-document event fusion and consistency maintenance module and a cross-document event fusion and consistency maintenance module;

8. The high-efficiency cross-document information extraction method based on the large language model is characterized by comprising the following steps of:

The method comprises the steps of carrying out deep analysis and summarization on a plurality of documents in a plurality of formats through a large language model to obtain abstract information of each document, wherein the deep analysis and summarization on the plurality of documents in the plurality of formats comprises carrying out data conversion, data cleaning, feature analysis, metadata management, consistency verification and safety measures on the plurality of documents in the plurality of formats;

9. The method for efficient cross-document information extraction based on a large language model as claimed in claim 8, wherein said step one includes:

10. A computer readable storage medium, wherein the computer readable storage medium includes a stored program, and wherein the computer readable storage medium is controlled to perform a method for efficiently extracting cross-document information based on a large language model according to any one of claims 8 to 9 when the program is run.