CN119005168B

CN119005168B - A structured analysis method for PDF paper metadata based on a multimodal large model

Info

Publication number: CN119005168B
Application number: CN202411205429.4A
Authority: CN
Inventors: 胡懋地; 宋东桓; 钱力; 常志军
Original assignee: National Science Library Chinese Academy Of Sciences
Current assignee: National Science Library Chinese Academy Of Sciences
Priority date: 2024-08-30
Filing date: 2024-08-30
Publication date: 2025-03-25
Anticipated expiration: 2044-08-30
Also published as: CN119005168A

Abstract

The present invention discloses a structured analysis method for PDF paper metadata based on a multimodal large model, and relates to the technical field related to academic paper management. The method comprises: collecting a first PDF paper for visual analysis, generating metadata extraction prompts; constructing an initial multimodal large model, and outputting first initial metadata; defining multiple data texts to generate a metadata extraction prompt set; outputting N initial metadata of N PDF papers; generating a multimodal large model; verifying the first initial metadata, and outputting the first metadata of the first PDF paper. The method solves the technical problem that the key and detailed information of the document is lost during the OCR text conversion process in the existing PDF paper metadata analysis, and all the texts in the PDF paper cannot be accurately identified, which leads to poor accuracy and integrity of metadata extraction and identification, and achieves the technical effect of improving the accuracy and integrity of PDF paper metadata extraction results.

Description

PDF paper metadata structured analysis method based on multi-mode large model

Technical Field

The application relates to the technical field related to academic paper management, in particular to a PDF paper metadata structured analysis method based on a multi-mode large model.

Background

Along with the increasing of scientific research activities, the number of academic papers is increased explosively, wherein a PDF format is used as a main carrier of academic documents, metadata extraction of PDF papers is an important basis for information integration and knowledge management, the existing PDF paper metadata extraction mainly depends on an OCR (optical character recognition) technology, the OCR (optical character recognition) technology scans characters in images and converts the characters into an editable text format, metadata extraction is carried out on the basis of text results, however, along with the diversification and complicacy of the academic documents, the limitation of the PDF paper metadata extraction based on the OCR technology is increasingly highlighted, firstly, the visual characteristics of font size, character arrangement, color depth and the like in PDF documents are often ignored in the process of converting the PDF documents into pure texts, and the information is very important for accurately recognizing key data such as paper titles, authors and abstracts, and PDF documents facing PDF documents comprising complex layouts, tables, charts and multi-column texts, misread and misread phenomena often occur, so that metadata extraction results are wrong, and in addition, in the practical application of PDF, the PDF is insufficient, the character size, character arrangement, color depth and other visual characteristics are seriously influenced by the fact that the defects such as poor in the quality of the documents, and the fact that the error quality of the documents are seriously caused by the defects.

Therefore, in the related technology of metadata analysis of the PDF paper at the present stage, the technical problems that key and detail information of a document is lost in the process of converting the text by OCR, all the texts in the PDF paper cannot be accurately identified, and the accuracy and the completeness of metadata extraction and identification are poor are caused.

Disclosure of Invention

The application solves the technical problems that in the prior art, in the process of OCR text conversion, key and detail information of a document is lost and all texts in a PDF paper cannot be accurately identified, so that the accuracy and the integrity of metadata extraction and identification are poor, and the technical effect of improving the accuracy and the integrity of the metadata extraction result of the PDF paper is achieved.

The application provides a multi-modal large model-based PDF paper metadata structured analysis method, which comprises the steps of collecting a first PDF paper to carry out visual analysis, generating first PDF image information, identifying the first PDF image information according to text extraction requirements, generating metadata extraction prompts, constructing an initial multi-modal large model, synchronizing the first PDF image information and the metadata extraction prompts to the initial multi-modal large model, outputting first initial metadata of the first PDF paper, collecting N PDF papers to carry out visual analysis based on a plurality of information sources, generating N PDF image information, defining a plurality of data text extraction prompt sets according to the text extraction requirements, generating metadata extraction prompt sets according to the N PDF image information and the metadata extraction prompt sets to be integers larger than 1, synchronizing the N PDF image information and the metadata extraction prompt sets to the initial multi-modal large model, outputting N initial metadata of the N PDF papers, adjusting the initial multi-modal large model based on the N PDF papers and the N initial metadata, and carrying out multi-modal large model verification on the first PDF paper, and outputting the first multi-modal large model by means of the first PDF paper.

In a possible implementation manner, the first PDF paper further performs the following processing of traversing the information sources to perform random selection to generate a plurality of PDF documents, performing virtual generation according to the plurality of PDF documents to generate a plurality of virtual metadata and a plurality of first page texts, wherein the plurality of virtual metadata and the plurality of first page texts have corresponding relations with the plurality of PDF documents, aligning the plurality of virtual metadata and the plurality of first page texts according to the plurality of PDF documents, performing random extraction on traversing alignment results, and generating the first PDF paper.

In a possible implementation mode, a first PDF paper is collected for visual analysis, first PDF image information is generated, the first PDF image information is identified according to text extraction requirements, metadata extraction prompts are generated, the following processing is carried out, page conversion is carried out on a first document page of the first PDF paper, a first document page image is generated, image standardization processing is carried out on the first document page image, first PDF image information is generated, feature extraction is carried out on the first PDF image information by means of a computer visual technology, text extraction requirements are obtained, the first PDF image information is traversed according to the text extraction requirements, labels to be extracted are generated, and the metadata extraction prompts are determined according to the labels to be extracted.

In a possible implementation manner, the method comprises the steps of constructing the initial multi-mode large model, constructing a visual encoder framework by using a convolutional neural network, training by using the visual encoder framework in combination with the first PDF image information, determining a visual encoder, preprocessing based on the metadata extraction prompt, generating a word embedding sequence, constructing a language encoder framework by using the convolutional neural network, training by using the language encoder framework in combination with the word embedding sequence, determining a language encoder, cooperatively connecting the visual encoder and the language encoder, constructing a model component, calling LLaVA framework processing, and constructing the initial multi-mode large model by combining with the model component.

In a possible implementation manner, the method includes synchronizing the first PDF image information and the metadata extraction prompt to the initial multi-mode big model, outputting first initial metadata of the first PDF paper, inputting the first PDF image information to the visual encoder for feature extraction to generate a visual feature vector, inputting the metadata extraction prompt to the language encoder for feature extraction to generate a text feature vector, performing joint coding on the visual feature vector and the text feature vector to generate a fusion feature, and performing decoding processing based on the fusion feature to generate the first initial metadata.

In a possible implementation manner, before the visual feature vector and the text feature vector are jointly encoded, the following processing is further executed, namely before data extraction prompt generation, the visual feature vector and the text feature vector are required to be aligned in data to generate an image-text pair, a contrast loss function is introduced to calculate the image-text pair, vector distance data is determined, whether the vector distance data is in a vector distance threshold interval is judged, if so, a matching signal is generated, the visual feature vector and the text feature vector are compared according to the matching signal, when the visual feature vector and the text feature vector are consistent, the matching signal is enhanced, and when the visual feature vector and the text feature vector are inconsistent, the matching signal is reduced.

In a possible implementation manner, the visual feature vector and the text feature vector are jointly coded to generate a fusion feature, and the processing is further performed that a plurality of visual feature subareas are divided based on the visual feature vector, the visual feature subareas comprise visual token, attention calculation is conducted based on the visual token to generate a plurality of visual attention weights, the text feature subareas comprise text token is divided based on the text feature vector, attention calculation is conducted based on the visual token to generate a plurality of text attention weights, a plurality of association relations between the visual feature vector and the text feature vector are constructed based on a matching signal according to the visual attention weights and the text attention weights, and joint coding is conducted on the visual feature vector and the text feature vector based on the association relations to generate the fusion feature.

In a possible implementation manner, the initial multi-mode big model is adjusted based on the N PDF papers and the N initial metadata to generate a multi-mode big model, and the following processing is further executed that the N PDF papers and the N initial metadata are subjected to data enhancement to construct a training data set and a verification data set, the training data set is used for training in a metadata extraction mode to generate a plurality of training results, the verification data set is used for carrying out cross verification on the plurality of training results, the training results which pass verification are extracted to dynamically adjust the initial multi-mode big model to generate an adjustment result, an evaluation index is introduced to evaluate the adjustment result, and the multi-mode big model is generated according to adjustment evaluation information.

In a possible implementation manner, an evaluation index is introduced to evaluate the adjustment result, the multi-mode large model is generated according to adjustment evaluation information, the following processing is further carried out, metadata extraction targets are set based on the N PDF papers and the N initial metadata, extraction training is carried out based on the metadata extraction targets, testing is carried out according to training results, test passing rate is calculated, extraction quality scores are set according to the test passing rate, the evaluation index is formulated according to the quality scores, whether adjustment data smaller than the evaluation index exist in the adjustment result or not is judged according to the evaluation index, if the adjustment data smaller than the evaluation index exist in the adjustment result, marking records are carried out on the adjustment data smaller than the evaluation index and are added to a data set to be verified, the adjustment result is screened based on the data set to be verified, and the adjustment evaluation information is generated.

According to the method for analyzing the metadata structure of the PDF paper based on the multi-mode large model, which is provided by the application, visual analysis is carried out on the first PDF paper, first PDF image information is generated, the first PDF image information is identified according to text extraction requirements, metadata extraction prompt is generated, then an initial multi-mode large model is constructed, synchronization is carried out on the first PDF image information and the metadata extraction prompt to the initial multi-mode large model, first initial metadata of the first PDF paper is output, visual analysis is carried out on N PDF papers based on a plurality of information sources, N PDF image information is generated, a plurality of data text generation metadata extraction prompt set is defined according to text extraction requirements, then N initial metadata of the N PDF image information and the metadata extraction prompt set are synchronized to the initial multi-mode large model, N initial metadata of the N PDF papers are output, the initial multi-mode large model is regulated based on the N PDF papers and the N initial metadata, finally the first initial metadata is verified by combining the multi-mode large model, the first initial metadata of the first PDF paper is output, the first metadata of the first PDF paper is analyzed, the accuracy of the first PDF paper is improved, the accuracy of the analysis of the key text and the text is improved, and the accuracy of the text extraction result is not completely lost, and the accuracy of the text is completely recognized, and the text is completely clear.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following will briefly describe the drawings of the embodiments of the present disclosure, in which flowcharts are used to illustrate operations performed by a system according to embodiments of the present disclosure. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

Fig. 1 is a schematic flow chart of a PDF paper metadata structured analysis method based on a multi-mode large model according to an embodiment of the present application;

Fig. 2 is a schematic flow chart of generating metadata extraction prompts in a multi-mode large model-based PDF paper metadata structured analysis method according to an embodiment of the present application;

Fig. 3 is a schematic flow chart of comparative learning in a multi-mode large model-based PDF paper metadata structured analysis method according to an embodiment of the present application.

Detailed Description

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict, the term "first\second" being referred to merely as distinguishing between similar objects and not representing a particular ordering for the objects. The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or modules that may not be expressly listed or inherent to such process, method, article, or apparatus, and unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. The terminology used herein is for the purpose of describing embodiments of the application only.

The embodiment of the application provides a PDF paper metadata structured analysis method based on a multi-mode large model, which is shown in figure 1 and comprises the following steps:

Step S100, a first PDF paper is collected for visual analysis, first PDF image information is generated, the first PDF image information is identified according to text extraction requirements, and a metadata extraction prompt is generated. A target PDF paper is collected or a PDF paper (for example, a plurality of elements including texts, graphs, tables, formulas and the like) is randomly obtained to be used as a first PDF paper, then the first PDF paper is subjected to visual analysis by utilizing a computer technology, specifically, the texts, graphs, tables and the like of the first page of the first PDF paper are subjected to recognition analysis to generate a first PDF image, wherein the first PDF image is the first page image of the first PDF paper, and then the first page image is identified according to text extraction requirements, wherein the text extraction requirements comprise extracting detailed information such as the title, author, abstract, keywords, font size, text relative position and layout of the title of the paper, namely, marking corresponding texts in the first PDF image information according to specific text extraction requirements, for example, marking text areas needing to be extracted by using marking tools (such as rectangular box marks, polygon marks and the like), and finally, metadata extraction prompts are generated, and structured information of paper metadata description data is generally used for providing additional information about data. In academic papers, metadata includes titles, authors, summaries, keywords, etc., generating metadata extraction cues that help the model more efficiently extract corresponding data and detailed information from PDF papers means automatically generating or suggesting which metadata should be extracted based on the identified text regions.

In a possible implementation manner, step S100 further includes step S101 of traversing the multiple information sources to perform random selection to generate multiple PDF documents, performing virtual generation according to the multiple PDF documents to generate multiple virtual metadata and multiple top page texts, where the multiple virtual metadata and the multiple top page texts have corresponding relations with the multiple PDF documents, aligning the multiple virtual metadata and the multiple top page texts according to the multiple PDF documents, performing random extraction on the traversing alignment result, and generating the first PDF paper.

Preferably, the first PDF paper is generated based on metadata of massive PDF data synthesized by a small number of context patterns of a publishing company, namely, a small number of publishing companies are determined from a plurality of information sources in a random selection mode, context patterns are called by the small number of publishing companies, the PDF documents and the data are synthesized by the called context patterns, virtual extraction is carried out on each PDF document, the PDF documents can contain various academic papers, reports and the like, namely, a large number of PDFs of the small number of patterns and the corresponding metadata thereof are simulated at low cost through a large model such as GPT4 and the like, virtual metadata and top page texts of a plurality of PDF documents are generated, the virtual metadata comprise titles, authors, summaries, keywords, publishing information and the like of the papers, each PDF document uniquely corresponds to one virtual metadata and the top page text, then the plurality of virtual metadata and the plurality of top page texts are aligned according to the plurality of PDF documents, the corresponding virtual LAT texts are generated by using the version patterns provided by the publishing company, the PDF papers are obtained as alignment results, and finally the PDF papers are randomly extracted from the PDF papers to be used as first traversal papers.

In one possible implementation manner, as shown in fig. 2, step S100 further includes step S110, performing page conversion on the first page of the document in the first PDF paper, to generate a first page image of the document. The step of performing page conversion on the first document top page of the first PDF paper refers to converting the first document top page in PDF format into an image format (such as JPEG, PNG, etc.), and generating a document top page image. And step S120, performing image standardization processing on the document top page image to generate first PDF image information. The image normalization process typically includes adjusting the size, resolution, color space, etc. of the top page image of the document to ensure consistency and comparability of the images, resulting in a first PDF image. Step S130, extracting features of the first PDF image information by using a computer vision technology, determining a plurality of image features, and acquiring text extraction requirements. Computer vision techniques (e.g., deep learning, image processing algorithms, etc.) are used to identify and analyze key features in the image, which may include font size features (e.g., font size of key information such as title, subtitle, author name, abstract, etc., often different from other text), font relative position features (e.g., relative position of elements such as author name, title, abstract, etc., on the page, author name may be under the title, abstract may be under the author name and typically be at a certain vertical distance from the title and author name), typesetting layout features (e.g., paragraph distribution, column layout, table position, image embedding manner, etc.), by analyzing the above image features, it may be inferred that text of which portions is the focus of metadata extraction, i.e., to obtain text extraction requirements, e.g., title is typically at the top of the page and fonts are larger, author name and unit may be under the title and smaller but fixed in position.

And step S140, traversing the first PDF image information according to the text extraction requirement for marking, and generating a label to be extracted. Text content in the PDF image is converted to a processable format (e.g., text character strings or text blocks) using image processing techniques, and the first PDF image information is traversed according to text extraction requirements to find and identify potential text information, generating tags to be extracted for determining which text is extracted, e.g., tags such as "titles", "authors", "summaries", etc. And step S150, determining the metadata extraction prompt according to the label to be extracted. The method comprises the steps of determining metadata extraction prompts by using generated tags to be extracted, namely, determining metadata extraction prompts which are used for prompting how to effectively extract corresponding metadata from an original document (such as a PDF file), specifically, formulating metadata extraction strategies, comprising determining specific positions of the metadata in the document by using text position features (such as font relative position features), identifying and extracting the metadata by text content matching (such as keyword searching and regular expression matching), extracting the metadata by analyzing a document structure for a structured document (such as a PDF with an HTML tag), and the like, and generating and determining metadata extraction prompts according to the metadata extraction strategies, wherein the metadata extraction prompts comprise position prompts (page positions or text areas where the metadata possibly appear), format prompts (sizes, formats and the like of words), content prompts (keywords or phrases possibly contained by the metadata) and the like.

Step S200, an initial multi-mode large model is built, synchronization is carried out on the basis of the first PDF image information and the metadata extraction prompt to the initial multi-mode large model, and first initial metadata of the first PDF paper is output. The method comprises the steps of constructing a visual encoder architecture by utilizing first PDF image information based on a convolutional neural network, specifically, training the visual encoder architecture by utilizing the first PDF image information as a training set, learning how to automatically extract useful visual information from PDF images in the training process so as to obtain a visual encoder, constructing the language encoder architecture by utilizing metadata extraction prompts based on a convolutional neural network, training the language encoder architecture by utilizing metadata extraction prompts as training data, learning how to extract useful language information from a metadata extraction prompt sequence in the training process so as to obtain a language encoder, integrating the constructed visual encoder, the language encoder and a connection mechanism between the visual encoder and the language encoder by utilizing the LLaVA architecture, thereby constructing an initial multi-mode large model, and simultaneously processing data of multiple modes (such as images, texts and the like) by integrating information of different modes, realizing more comprehensive understanding and processing capacity, simultaneously processing the images of documents and the extracted prompts of the metadata by the multi-mode large model for PDF metadata extraction, and outputting metadata, and particularly, integrating the first PDF image information with the initial PDF image information, the initial PDF information, the first key word and the first key word, the first word, and the other words and the like.

In a possible implementation manner, step S200 further includes step S210, constructing a visual encoder architecture using a convolutional neural network, and training the visual encoder architecture in combination with the first PDF image information to determine a visual encoder. Convolutional Neural Networks (CNNs) are used to construct a visual encoder capable of processing image data, typically comprising only convolutional and pooling layers (or convolutional layers with a step size greater than 1) in a feature extraction network, which is trained using a first PDF image as training data, which learns how to identify and extract visual features useful for metadata extraction for extracting advanced features of PDF image information (which may have been converted to an image format suitable for processing), such as font size, typesetting layout, etc. And step S220, preprocessing is performed based on the metadata extraction prompt, and a word embedding sequence is generated. Preprocessing the text data according to the metadata extraction prompt may include text cleaning, word segmentation, removal of stop words, etc., and then converting the text into vector representations, i.e., word embedding sequences, using Word embedding techniques (e.g., word2Vec, gloVe or BERT embedding layers) to capture semantic relationships between words.

Step S230, constructing a language encoder framework by adopting a cyclic neural network, and training by utilizing the language encoder framework and combining the word embedding sequence to determine a language encoder. The method comprises the steps of constructing an encoder capable of receiving word embedding sequences as input and outputting encoded language features by using a cyclic neural network, capturing semantic relationships between words by a word embedding technology, enabling similar words to be closer in vector space, preprocessing text data (such as texts extracted from PDFs), such as word segmentation, removal of stop words and the like, converting each word into a corresponding vector representation by using the word embedding technology, arranging the words in sequence in the texts to form word embedding sequences, training a language encoder framework by using the word embedding sequences, receiving a word embedding vector at each time step by using the language encoder framework, updating the internal states of the word embedding vectors to capture information in the sequences, and determining the language encoder and converting how the word embedding sequences are learned into useful language feature representations after enough training iteration.

And step S240, the visual encoder and the language encoder are cooperatively connected to construct a model component. The visual encoder for processing visual information and the language encoder for processing language information are combined through a specific mechanism to realize interaction and fusion of the visual information and the language information, for example, a double-flow attention mechanism is adopted by a VisualBERT model, images and texts are respectively input into the visual encoder and the language encoder and are interacted through a multi-layer attention mechanism, the images and the texts are subjected to preliminary processing in the respective encoders and then are interacted through a cross-modal attention layer, specifically, the input images and texts are respectively subjected to embedded representation, for the images, a pre-trained convolutional neural network is generally used for extracting features and converting the features into embedded vectors, for the texts, word embedding and position coding are used for representing text sequences, then the embedded representation is respectively input into the visual encoder and the language encoder, preliminary feature extraction and coding are carried out, and interaction and information transmission between the images and the texts are realized through the cross-modal attention mechanism.

Step S250, invoking LLaVA architecture processing, and constructing the initial multi-mode large model by combining the model components. LLaVA is an open-source multi-modal large model designed to connect a visual encoder and a Large Language Model (LLM) together to achieve general visual and language understanding, to be able to process visual information (images) of PDF documents and to effectively perform accurate generation of metadata text, specifically, to integrate the visual encoder and the language encoder together through a specific connection mechanism to form a unified multi-modal model, to perform end-to-end training on the model using multi-modal instruction following dataset (e.g., dataset containing images and corresponding text descriptions), during which the model learns how to align and fuse visual features and language features to better accomplish multi-modal tasks, and finally to obtain the initial multi-modal large model.

In a possible implementation manner, step S200 further includes step S260, where the first PDF image information is input to the visual encoder to perform feature extraction, so as to generate a visual feature vector. The first PDF image information is input into the visual encoder, the visual encoder performs feature extraction on the image through a multi-layer convolutional neural network, the visual encoder comprises the steps of extracting local features of the image layer by using a convolutional layer, such as edges, corner points, textures and the like, downsampling a feature map by using a pooling layer, reducing dimensions, retaining main features, flattening the pooled feature map by using a full-connection layer, mapping the feature map to a high-dimensional feature space, generating visual feature vectors, representing points in the high-dimensional space, and representing key information in the image. Step S270, inputting the metadata extraction prompt into the language encoder for feature extraction, and generating text feature vectors. The metadata extraction prompt is input into a language encoder, the language encoder performs word segmentation, embedding and the like on the text, and applies a transducer and other structures to capture semantic and contextual information in the text, and finally, the information is converted into text feature vectors, wherein the vectors are also points in a high-dimensional space and can represent key information in the text.

And step S280, the visual feature vector and the text feature vector are combined and encoded to generate a fusion feature. Jointly encoding a visual feature vector and a text feature vector typically involves stitching the two vectors together, or fusing them by some form of cross-modal attention mechanism, the purpose of which is to generate a fused feature that contains both the critical information in the image and text, and which is interrelated in vector space. And step S290, performing decoding processing based on the fusion feature, and generating the first initial metadata. The decoding process is performed based on the fusion features, and the decoder receives the fusion features as input and applies a series of transformations (e.g., linear layer, softmax, etc.) to map the final text content output to generate the off-site initial metadata, i.e., structured initial metadata such as author name, title, abstract, etc.

In a possible implementation manner, as shown in fig. 3, step S200 further includes a step S201 of aligning the visual feature vector with the text feature vector to generate an image-text pair, a step S202 of introducing a contrast loss function to calculate the image-text pair to determine vector distance data, a step S203 of determining whether the vector distance data is in a vector distance threshold interval, if so, generating a matching signal, and comparing the visual feature vector with the text feature vector according to the matching signal, and a step S204 of enhancing the matching signal when the visual feature vector is consistent with the text feature vector, and if not, cutting down the matching signal.

Preferably, the visual feature vector is a high-dimensional vector extracted from the image by image processing techniques (such as convolutional neural network CNN), the text feature vector is a vector extracted from the text by Natural Language Processing (NLP) techniques (such as word embedding, BERT, etc.), the image and text of each PDF document are aligned to form a data pair of image-text pairs, each pair of data is ensured to come from the top page of the same PDF document (but not necessarily accurate, i.e. not necessarily capable of determining that the image is actually from the same document, and therefore needs to be trained by a subsequent contrast, if the vectors are consistent, the contrast loss function is a loss function commonly used for learning the similarity or difference between samples to quantify the similarity between the image and text, the smaller the vector distance data is, the more similar the image and the text are represented in the feature space, the image and the text are as similar as possible (distance is smaller), the image and text in different PDF documents are ensured to be as possible, if the feature vectors and feature vectors in the different PDF documents are not necessarily identical, the contrast is better than the feature is determined, the distance is more likely to be set, the distance is better than the feature is calculated, the threshold is better is calculated, and the visual feature signal is matched between the text is more consistent, the threshold is better than the text is calculated, and the feature is better the threshold is matched between the text and the text is better than the feature is compared to the text and the text is better the feature signal if the contrast is better than the text and the feature is better than the text and the text is better contrast, when the visual feature vector and the text feature vector do not fall within the set vector distance threshold interval, namely the matching signal is negative at the moment, and the matching signal is reduced, namely when the consistency of the visual feature vector and the text feature vector is weaker, the matching signal is reduced in the negative direction.

In one possible implementation manner, the step S280 further includes a step S285 of dividing a plurality of visual feature subareas based on the visual feature vector, wherein the plurality of visual feature subareas comprise a visual token, a step S286 of performing attention calculation based on the visual token to generate a plurality of visual attention weights, a step S287 of dividing a plurality of text feature subareas based on the text feature vector, wherein the plurality of text feature subareas comprise a text token, a step S288 of performing attention calculation based on the text token to generate a plurality of text attention weights, a step S289 of constructing a plurality of association relations between the visual feature vector and the text feature vector according to the plurality of visual attention weights and the plurality of text attention weights based on a matching signal, and a step S2810 of jointly encoding the visual feature vector and the text feature vector based on the plurality of association relations to generate the fusion feature.

Preferably, the original visual feature vector (usually a high-dimensional, continuous vector) is divided into smaller, more easily processed sub-regions, each sub-region containing a set of visual tokens, specifically, in a model such as a visual transducer, the visual tokens generally refer to a series of image blocks into which the image is divided, each image block is regarded as a token, the linear projection is mapped to the input dimension of the transducer to obtain a series of image block vectors, specifically, if the resolution of an image is H×W, the size of the divided image blocks is P×P, then (H/P) x (W/P) image blocks, namely (HW)/(P2) image blocks, are obtained, the token is taken as a basic unit when the model processes the image, the local information and the features of the image are carried, the attention of each visual token is calculated to determine the importance of the whole visual token, namely, the correlation between each token and other visual token or the whole visual token is calculated, and the attention of the vision token is not reflected in the fusion model according to the importance of the vision token.

Preferably, similar to the processing of visual feature vectors, the text feature vectors are also divided into a plurality of sub-regions, each sub-region containing a set of text tokens representing different words or phrases in the text, which are basic units of text information, which may be a word, a phrase, a punctuation mark, a subword (subword) or a character, in an NLP task, tokens are the essential basic units of the text, bear semantic and structural information, are the basis for language understanding and generation of a model, and the model continuously optimizes the representation of the tokens by dynamically adjusting parameters during training to realize understanding and generation of the text, and performs attention calculation on each text token to generate a set of text attention weights reflecting the importance of different parts in the whole text representation.

Preferably, based on the obtained matching signals, a plurality of association relations between the vision feature vector and the cross-book feature are built according to the calculated vision attention weight and the cross-book attention weight, the association relations can be direct (such as correspondence based on space positions), indirect (such as similarity based on semantic content) or combination of the vision feature vector and the cross-book feature, the model is helpful to better understand interaction and relation between the vision and the cross-book information in the joint coding process, the vision feature vector and the cross-book feature are jointly coded inwards based on the built association relations, namely the vision and the cross-book information are fused in the feature space to generate fusion features containing comprehensive information of the vision feature vector and the cross-book feature, such as splicing, weighted summation and application of an attention mechanism of the feature vector, and finally generated fusion features are in a high dimension and simultaneously contain key features of the vision and the cross-book information.

And step S300, acquiring N PDF papers based on a plurality of information sources for visual analysis, generating N PDF image information, defining a plurality of data texts according to the text extraction requirements to generate a metadata extraction prompt set, wherein N is an integer greater than 1. N PDF papers are collected from a plurality of information sources (such as academic databases, library resources, online archives and the like), including a large number of PDF papers of different publishers, then the N PDF papers are subjected to visual analysis to generate N PDF image information corresponding to the N PDF papers (namely front page images of the N papers), then a plurality of data texts of each PDF paper image information are identified according to preset text extraction requirements, including titles, authors, abstracts, keywords and the like of the N papers, so as to obtain a plurality of metadata extraction prompts, and the metadata extraction prompts form a metadata extraction prompt set.

Step S400, synchronizing the N PDF image information and the metadata extraction prompt set to the initial multi-mode large model, and outputting N initial metadata of the N PDF papers, wherein the N PDF papers and the N initial metadata have corresponding relations. And synchronizing the N PDF image information and the metadata extraction prompt set as input data to an initial multi-mode large model, wherein the initial multi-mode large model respectively uses an internal visual encoder and a language encoder to analyze and process the PDF image information and the metadata extraction prompt set, N initial metadata corresponding to N PDF papers are extracted, wherein the N metadata actually refer to N initial metadata sets, because the metadata of each PDF paper itself contains a plurality of text data such as titles, authors, abstracts, keywords and the like of the paper and is not single data, and the N PDF papers and the generated N initial metadata have a one-to-one correspondence.

And step S500, adjusting the initial multi-mode large model based on the N PDF papers and the N initial metadata to generate a multi-mode large model. And performing data enhancement operation on the N PDF papers and the corresponding metadata to enhance the obtained N PDF papers to serve as test data sets, inputting the test data sets into an initial multi-mode large model, performing metadata extraction mode training on the model by utilizing the training data sets to obtain a plurality of training results, performing data enhancement on the N initial metadata to obtain a verification data set, verifying the plurality of training results by utilizing the verification data set, dynamically adjusting the initial multi-mode large model by utilizing the verification data set, for example, adding or modifying a specific processing module, adjusting extraction strategies of visual data and language data, correcting errors of extracted metadata and the like, and finally generating the multi-mode large model, so that the model can be suitable for PDF papers of different formats and styles, better understand and process document layout of complex PDF papers, and realize high-precision and efficient metadata extraction.

In a possible implementation manner, step S500 further includes step S510, performing data enhancement on the N PDF papers and the N initial metadata, and constructing a training data set and a verification data set. Data enhancement is a technique for improving model generalization capability by increasing diversity and number of data samples, which may include extracting text content, performing text cleansing, converting format, adding noise, synonym substitution, etc., and N initial metadata, which may include modifying field order, adding or deleting non-critical fields, generating new metadata description, etc., to divide the enhanced data into training data sets and verification data sets. Step S520, training the metadata extraction mode by using the training data set, and generating a plurality of training results. Training of the metadata extraction mode is performed by using the training data set, so that key information in the PDF paper is identified and extracted as metadata, the training process may include a plurality of iterations, and a plurality of training results are generated in the training process.

Step S530, cross-validating the plurality of training results using the validation dataset. The multiple training results are cross-validated using the validation dataset, and the data is partitioned into a training set and a test set by repeating the partitioning multiple times, each time using a different partitioning to train the model and evaluate its performance, helping to reduce the risk of overfitting and providing a more reliable estimate of the model's performance. Step S540, extracting the training result which passes the verification to dynamically adjust the initial multi-mode large model, and generating an adjustment result. And selecting a training result with the best performance from the cross verification as a reference, and dynamically adjusting the initial multi-mode large model according to the training result passing the verification, for example, adjusting the architecture, parameters, an optimization algorithm and the like of the model so as to improve the performance of the model on a specific task. Step S550, an evaluation index is introduced to evaluate the adjustment result, and the multi-modal large model is generated according to adjustment evaluation information. The method comprises the steps of introducing a plurality of evaluation indexes to evaluate an adjustment result, finding out a model or configuration which is best in performance on a verification data set, specifically, predicting the verification data set by using a candidate model, generating a prediction result, calculating values of various evaluation indexes according to the prediction result and a real label of the verification data set, comparing evaluation index values of different candidate models to determine which model is best in performance on the verification data set, generating a final multi-mode large model according to the configuration and parameters of the model, directly processing an image of a PDF document, and combining diversified training data to improve the accuracy and robustness of metadata extraction.

In a possible implementation manner, step S550 further includes step S551, setting metadata extraction targets based on the N PDF papers and the N initial metadata. Metadata extraction targets are set according to the N PDF papers and the N initial metadata, and the metadata extraction targets are used for automatically extracting metadata matched with the initial metadata or richer in metadata from the PDF papers, such as titles, authors, abstracts, keywords and the like of the papers. And step S552, extracting and training based on the metadata extraction target, testing according to the training result, and calculating the test passing rate. According to the method, a PDF paper is extracted and trained according to metadata extraction targets, specifically, PDF paper containing required metadata is collected to serve as a training data set, each metadata field is labeled with a corresponding label, then an information extraction model is constructed and trained by utilizing the training data, in the training process, the confusion degree of the model is calculated regularly, training strategies are adjusted according to the change trend of the model, specifically, as the training is carried out, if the confusion degree of the model gradually decreases, the model can be considered to learn how to effectively extract the metadata from the PDF paper, the confusion degree is an index for measuring the performance of the language model, the prediction capability of the model on the test data is reflected, the lower confusion degree means that the model can predict the next word or character better, a PDF paper which is different from the training data and has similar distribution and characteristics is prepared to serve as the test data set, the model tries to extract the metadata from the test paper, the model is compared with the true label of the test data, the extracted metadata of the model is calculated, and the test passing rate is calculated, namely, the ratio of the extracted metadata in the overall metadata field is intuitively represented on the overall data of the test field is calculated.

Step S553, setting extraction quality scores according to the test passing rate, and making the evaluation index according to the quality scores. Setting extraction quality scores according to a specific numerical range of the test passing rate, for example, setting a passing rate threshold, giving a higher quality score when the test passing rate exceeds the threshold, giving a lower quality score when the test passing rate is lower than the threshold, and formulating detailed evaluation standards, namely evaluation indexes (such as accuracy, the ratio of the number of correctly extracted metadata to the total extraction number) based on quality score results (quality score values) so as to judge whether the extraction quality of the metadata extraction model meets specific quality requirements. Step S554, determining whether the adjustment result has adjustment data smaller than the evaluation index according to the evaluation index. And using the evaluation index to evaluate the adjustment result of the model, and judging whether adjustment data smaller than the evaluation index exist in the adjustment result.

Step S555, if the adjustment data smaller than the evaluation index exists in the adjustment result, marking and recording the adjustment data smaller than the evaluation index, and adding the adjustment data to the data set to be verified. If there is said adjustment data smaller than the evaluation index in the adjustment result, annotation records are made for adjustment data that do not meet the evaluation index and they are added to the data set to be verified, which will be used for subsequent verification and possible further adjustment. Step S556, screening the adjustment result based on the data set to be verified, and generating the adjustment evaluation information. Using the data set to be validated to further validate and adjust the performance of the model, it may be found that certain adjustment data are valid in certain situations, based on which the adjustment results are screened and adjustment evaluation information is ultimately generated.

And step S600, verifying the first initial metadata by utilizing the multi-mode large model and combining the first PDF paper, and outputting the first metadata of the first PDF paper. And (3) inputting the first PDF paper as training data to a trained multi-mode large model to extract metadata, and then verifying the first initial metadata based on metadata extraction results, namely comparing the consistency of the metadata extracted from the PDF paper by the model with the initial metadata, checking the accuracy and the integrity of the initial metadata, and taking the initial metadata as the first metadata corresponding to the first PDF paper and outputting the first metadata after verification.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A structured analysis method for PDF paper metadata based on a multimodal large model, characterized in that the method comprises:

Collecting the first PDF paper for visual analysis, generating first PDF image information, marking the first PDF image information according to text extraction requirements, and generating metadata extraction prompts;

Constructing an initial multimodal large model, synchronizing the first PDF image information and the metadata extraction prompt to the initial multimodal large model, and outputting first initial metadata of the first PDF paper;

Based on multiple information sources, N PDF papers are collected for visual analysis to generate N PDF image information, and multiple data texts are defined according to the text extraction requirements to generate metadata extraction prompt sets, where N is an integer greater than 1;

Synchronizing the N PDF image information and the metadata extraction prompt set to the initial multimodal large model, and outputting N initial metadata of the N PDF papers, wherein the N PDF papers correspond to the N initial metadata;

Adjusting the initial multimodal large model based on the N PDF papers and the N initial metadata to generate a multimodal large model;

The first initial metadata is verified by using the multimodal large model in combination with the first PDF paper, and the first metadata of the first PDF paper is output.

2. The method for structured analysis of PDF paper metadata based on a multimodal large model according to claim 1, wherein the first PDF paper comprises:

Traversing the multiple information sources and performing random selection to generate multiple PDF documents;

Performing virtual generation according to the plurality of PDF documents to generate a plurality of virtual metadata and a plurality of home page texts, wherein the plurality of virtual metadata and the plurality of home page texts correspond to the plurality of PDF documents;

The multiple virtual metadata and the multiple homepage texts are aligned according to the multiple PDF documents, and the alignment results are traversed for random extraction to generate the first PDF paper.

3. The method for structured analysis of PDF paper metadata based on a multimodal large model as claimed in claim 1 is characterized in that a first PDF paper is collected for visual analysis to generate first PDF image information, the first PDF image information is marked according to text extraction requirements, and a metadata extraction prompt is generated, the method comprising:

Convert the first page of the first PDF paper to generate a document first page image;

Performing image standardization processing on the document homepage image to generate first PDF image information;

Using computer vision technology to extract features from the first PDF image information, determine multiple image features, and obtain text extraction requirements;

Traversing the first PDF image information to perform annotation according to the text extraction requirement, and generating labels to be extracted;

The metadata extraction hint is determined according to the tag to be extracted.

4. The PDF paper metadata structured parsing method based on a multimodal large model as claimed in claim 3, characterized in that the initial multimodal large model is constructed, and the method comprises:

Using a convolutional neural network to construct a visual encoder architecture, using the visual encoder architecture combined with the first PDF image information for training, and determining a visual encoder;

Perform preprocessing based on the metadata extraction prompts to generate word embedding sequences;

A language encoder architecture is constructed using a recurrent neural network, and the language encoder architecture is used in combination with the word embedding sequence for training to determine a language encoder;

The visual encoder and the language encoder are collaboratively connected to construct a model component;

Call the LLaVA architecture for processing, and combine the model components to build the initial multimodal large model, wherein LLaVA is an open source multimodal large model that aims to connect visual encoders and large language models to achieve general vision and language understanding, and can process the visual information of PDF documents and generate metadata text.

5. The method for structured analysis of PDF paper metadata based on a multimodal large model according to claim 4, characterized in that, based on synchronizing the first PDF image information and the metadata extraction prompt to the initial multimodal large model, outputting the first initial metadata of the first PDF paper, the method comprises:

Inputting the first PDF image information into the visual encoder for feature extraction to generate a visual feature vector;

Inputting the metadata extraction prompt into the language encoder for feature extraction to generate a text feature vector;

Jointly encoding the visual feature vector and the text feature vector to generate a fusion feature;

Decoding is performed based on the fusion features to generate the first initial metadata.

6. The PDF paper metadata structured parsing method based on a multimodal large model as claimed in claim 5, characterized in that before the visual feature vector and the text feature vector are jointly encoded, the method comprises:

Performing data alignment on the visual feature vector and the text feature vector to generate an image-text pair;

Introducing a contrast loss function to calculate the image-text pair to determine vector distance data;

Determine whether the vector distance data is within a vector distance threshold interval, and if so, generate a matching signal, and compare the visual feature vector with the text feature vector according to the matching signal;

When the visual feature vector is consistent with the text feature vector, the matching signal is enhanced, and when the visual feature vector is inconsistent with the text feature vector, the matching signal is reduced.

7. The PDF paper metadata structured parsing method based on a multimodal large model as claimed in claim 6, characterized in that the visual feature vector and the text feature vector are jointly encoded to generate a fusion feature, and the method includes:

Dividing a plurality of visual feature sub-regions based on the visual feature vector, wherein the plurality of visual feature sub-regions include visual tokens;

Performing attention calculation based on the visual token to generate multiple visual attention weights;

Dividing a plurality of text feature sub-regions based on the text feature vector, wherein the plurality of text feature sub-regions contain text tokens;

Performing attention calculation based on the visual token to generate multiple text attention weights;

Based on the matching signal, constructing multiple association relationships between the visual feature vector and the text feature vector according to the multiple visual attention weights and the multiple text attention weights;

The visual feature vector and the text feature vector are jointly encoded based on the multiple association relationships to generate the fusion feature.

8. The method for structured analysis of PDF paper metadata based on a multimodal large model according to claim 1, characterized in that the initial multimodal large model is adjusted based on the N PDF papers and the N initial metadata to generate a multimodal large model, the method comprising:

Performing data enhancement on the N PDF papers and the N initial metadata to construct a training data set and a verification data set;

Using the training data set to train the metadata extraction mode, generating multiple training results;

Using the validation data set to cross-validate the multiple training results;

Extract the verified training results to dynamically adjust the initial multimodal large model and generate adjustment results;

An evaluation index is introduced to evaluate the adjustment result, and the multimodal large model is generated according to the adjustment evaluation information.

9. The PDF paper metadata structured parsing method based on a multimodal large model according to claim 8 is characterized in that an evaluation index is introduced to evaluate the adjustment result, and the multimodal large model is generated according to the adjustment evaluation information, and the method comprises:

Setting a metadata extraction target based on the N PDF papers and the N initial metadata;

Performing extraction training based on the metadata extraction target, performing testing according to the training results, and calculating the test pass rate;

Set the extraction quality score according to the test pass rate, and formulate the evaluation index according to the quality score value;

Determining whether there is adjustment data in the adjustment result that is less than the evaluation index according to the evaluation index;

If the adjustment result contains the adjustment data that is smaller than the evaluation index, the adjustment data that is smaller than the evaluation index is marked and recorded, and added to the data group to be verified;

The adjustment results are screened based on the data group to be verified to generate the adjustment evaluation information.