[go: up one dir, main page]

CN119005168B - A structured analysis method for PDF paper metadata based on a multimodal large model - Google Patents

A structured analysis method for PDF paper metadata based on a multimodal large model Download PDF

Info

Publication number
CN119005168B
CN119005168B CN202411205429.4A CN202411205429A CN119005168B CN 119005168 B CN119005168 B CN 119005168B CN 202411205429 A CN202411205429 A CN 202411205429A CN 119005168 B CN119005168 B CN 119005168B
Authority
CN
China
Prior art keywords
pdf
metadata
visual
text
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411205429.4A
Other languages
Chinese (zh)
Other versions
CN119005168A (en
Inventor
胡懋地
宋东桓
钱力
常志军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Science Library Chinese Academy Of Sciences
Original Assignee
National Science Library Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Science Library Chinese Academy Of Sciences filed Critical National Science Library Chinese Academy Of Sciences
Priority to CN202411205429.4A priority Critical patent/CN119005168B/en
Publication of CN119005168A publication Critical patent/CN119005168A/en
Application granted granted Critical
Publication of CN119005168B publication Critical patent/CN119005168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了基于多模态大模型的PDF论文元数据结构化解析方法,涉及学术论文管理相关技术领域,该方法包括:采集第一PDF论文进行视觉分析,生成元数据抽取提示;构建初始多模态大模型,输出第一初始元数据;定义多个数据文本生成元数据抽取提示集合;输出N个PDF论文的N个初始元数据;生成多模态大模型;对第一初始元数据进行验证,输出第一PDF论文的第一元数据。解决了现有PDF论文元数据解析存在的OCR转化文本过程中使得文档的关键和细节信息丢失、无法精准识别PDF论文中的所有文本,进而导致元数据抽取识别的精准度和完整性不好的技术问题,达到了提高PDF论文元数据抽取结果的精准度及完整性的技术效果。

The present invention discloses a structured analysis method for PDF paper metadata based on a multimodal large model, and relates to the technical field related to academic paper management. The method comprises: collecting a first PDF paper for visual analysis, generating metadata extraction prompts; constructing an initial multimodal large model, and outputting first initial metadata; defining multiple data texts to generate a metadata extraction prompt set; outputting N initial metadata of N PDF papers; generating a multimodal large model; verifying the first initial metadata, and outputting the first metadata of the first PDF paper. The method solves the technical problem that the key and detailed information of the document is lost during the OCR text conversion process in the existing PDF paper metadata analysis, and all the texts in the PDF paper cannot be accurately identified, which leads to poor accuracy and integrity of metadata extraction and identification, and achieves the technical effect of improving the accuracy and integrity of PDF paper metadata extraction results.

Description

PDF paper metadata structured analysis method based on multi-mode large model
Technical Field
The application relates to the technical field related to academic paper management, in particular to a PDF paper metadata structured analysis method based on a multi-mode large model.
Background
Along with the increasing of scientific research activities, the number of academic papers is increased explosively, wherein a PDF format is used as a main carrier of academic documents, metadata extraction of PDF papers is an important basis for information integration and knowledge management, the existing PDF paper metadata extraction mainly depends on an OCR (optical character recognition) technology, the OCR (optical character recognition) technology scans characters in images and converts the characters into an editable text format, metadata extraction is carried out on the basis of text results, however, along with the diversification and complicacy of the academic documents, the limitation of the PDF paper metadata extraction based on the OCR technology is increasingly highlighted, firstly, the visual characteristics of font size, character arrangement, color depth and the like in PDF documents are often ignored in the process of converting the PDF documents into pure texts, and the information is very important for accurately recognizing key data such as paper titles, authors and abstracts, and PDF documents facing PDF documents comprising complex layouts, tables, charts and multi-column texts, misread and misread phenomena often occur, so that metadata extraction results are wrong, and in addition, in the practical application of PDF, the PDF is insufficient, the character size, character arrangement, color depth and other visual characteristics are seriously influenced by the fact that the defects such as poor in the quality of the documents, and the fact that the error quality of the documents are seriously caused by the defects.
Therefore, in the related technology of metadata analysis of the PDF paper at the present stage, the technical problems that key and detail information of a document is lost in the process of converting the text by OCR, all the texts in the PDF paper cannot be accurately identified, and the accuracy and the completeness of metadata extraction and identification are poor are caused.
Disclosure of Invention
The application solves the technical problems that in the prior art, in the process of OCR text conversion, key and detail information of a document is lost and all texts in a PDF paper cannot be accurately identified, so that the accuracy and the integrity of metadata extraction and identification are poor, and the technical effect of improving the accuracy and the integrity of the metadata extraction result of the PDF paper is achieved.
The application provides a multi-modal large model-based PDF paper metadata structured analysis method, which comprises the steps of collecting a first PDF paper to carry out visual analysis, generating first PDF image information, identifying the first PDF image information according to text extraction requirements, generating metadata extraction prompts, constructing an initial multi-modal large model, synchronizing the first PDF image information and the metadata extraction prompts to the initial multi-modal large model, outputting first initial metadata of the first PDF paper, collecting N PDF papers to carry out visual analysis based on a plurality of information sources, generating N PDF image information, defining a plurality of data text extraction prompt sets according to the text extraction requirements, generating metadata extraction prompt sets according to the N PDF image information and the metadata extraction prompt sets to be integers larger than 1, synchronizing the N PDF image information and the metadata extraction prompt sets to the initial multi-modal large model, outputting N initial metadata of the N PDF papers, adjusting the initial multi-modal large model based on the N PDF papers and the N initial metadata, and carrying out multi-modal large model verification on the first PDF paper, and outputting the first multi-modal large model by means of the first PDF paper.
In a possible implementation manner, the first PDF paper further performs the following processing of traversing the information sources to perform random selection to generate a plurality of PDF documents, performing virtual generation according to the plurality of PDF documents to generate a plurality of virtual metadata and a plurality of first page texts, wherein the plurality of virtual metadata and the plurality of first page texts have corresponding relations with the plurality of PDF documents, aligning the plurality of virtual metadata and the plurality of first page texts according to the plurality of PDF documents, performing random extraction on traversing alignment results, and generating the first PDF paper.
In a possible implementation mode, a first PDF paper is collected for visual analysis, first PDF image information is generated, the first PDF image information is identified according to text extraction requirements, metadata extraction prompts are generated, the following processing is carried out, page conversion is carried out on a first document page of the first PDF paper, a first document page image is generated, image standardization processing is carried out on the first document page image, first PDF image information is generated, feature extraction is carried out on the first PDF image information by means of a computer visual technology, text extraction requirements are obtained, the first PDF image information is traversed according to the text extraction requirements, labels to be extracted are generated, and the metadata extraction prompts are determined according to the labels to be extracted.
In a possible implementation manner, the method comprises the steps of constructing the initial multi-mode large model, constructing a visual encoder framework by using a convolutional neural network, training by using the visual encoder framework in combination with the first PDF image information, determining a visual encoder, preprocessing based on the metadata extraction prompt, generating a word embedding sequence, constructing a language encoder framework by using the convolutional neural network, training by using the language encoder framework in combination with the word embedding sequence, determining a language encoder, cooperatively connecting the visual encoder and the language encoder, constructing a model component, calling LLaVA framework processing, and constructing the initial multi-mode large model by combining with the model component.
In a possible implementation manner, the method includes synchronizing the first PDF image information and the metadata extraction prompt to the initial multi-mode big model, outputting first initial metadata of the first PDF paper, inputting the first PDF image information to the visual encoder for feature extraction to generate a visual feature vector, inputting the metadata extraction prompt to the language encoder for feature extraction to generate a text feature vector, performing joint coding on the visual feature vector and the text feature vector to generate a fusion feature, and performing decoding processing based on the fusion feature to generate the first initial metadata.
In a possible implementation manner, before the visual feature vector and the text feature vector are jointly encoded, the following processing is further executed, namely before data extraction prompt generation, the visual feature vector and the text feature vector are required to be aligned in data to generate an image-text pair, a contrast loss function is introduced to calculate the image-text pair, vector distance data is determined, whether the vector distance data is in a vector distance threshold interval is judged, if so, a matching signal is generated, the visual feature vector and the text feature vector are compared according to the matching signal, when the visual feature vector and the text feature vector are consistent, the matching signal is enhanced, and when the visual feature vector and the text feature vector are inconsistent, the matching signal is reduced.
In a possible implementation manner, the visual feature vector and the text feature vector are jointly coded to generate a fusion feature, and the processing is further performed that a plurality of visual feature subareas are divided based on the visual feature vector, the visual feature subareas comprise visual token, attention calculation is conducted based on the visual token to generate a plurality of visual attention weights, the text feature subareas comprise text token is divided based on the text feature vector, attention calculation is conducted based on the visual token to generate a plurality of text attention weights, a plurality of association relations between the visual feature vector and the text feature vector are constructed based on a matching signal according to the visual attention weights and the text attention weights, and joint coding is conducted on the visual feature vector and the text feature vector based on the association relations to generate the fusion feature.
In a possible implementation manner, the initial multi-mode big model is adjusted based on the N PDF papers and the N initial metadata to generate a multi-mode big model, and the following processing is further executed that the N PDF papers and the N initial metadata are subjected to data enhancement to construct a training data set and a verification data set, the training data set is used for training in a metadata extraction mode to generate a plurality of training results, the verification data set is used for carrying out cross verification on the plurality of training results, the training results which pass verification are extracted to dynamically adjust the initial multi-mode big model to generate an adjustment result, an evaluation index is introduced to evaluate the adjustment result, and the multi-mode big model is generated according to adjustment evaluation information.
In a possible implementation manner, an evaluation index is introduced to evaluate the adjustment result, the multi-mode large model is generated according to adjustment evaluation information, the following processing is further carried out, metadata extraction targets are set based on the N PDF papers and the N initial metadata, extraction training is carried out based on the metadata extraction targets, testing is carried out according to training results, test passing rate is calculated, extraction quality scores are set according to the test passing rate, the evaluation index is formulated according to the quality scores, whether adjustment data smaller than the evaluation index exist in the adjustment result or not is judged according to the evaluation index, if the adjustment data smaller than the evaluation index exist in the adjustment result, marking records are carried out on the adjustment data smaller than the evaluation index and are added to a data set to be verified, the adjustment result is screened based on the data set to be verified, and the adjustment evaluation information is generated.
According to the method for analyzing the metadata structure of the PDF paper based on the multi-mode large model, which is provided by the application, visual analysis is carried out on the first PDF paper, first PDF image information is generated, the first PDF image information is identified according to text extraction requirements, metadata extraction prompt is generated, then an initial multi-mode large model is constructed, synchronization is carried out on the first PDF image information and the metadata extraction prompt to the initial multi-mode large model, first initial metadata of the first PDF paper is output, visual analysis is carried out on N PDF papers based on a plurality of information sources, N PDF image information is generated, a plurality of data text generation metadata extraction prompt set is defined according to text extraction requirements, then N initial metadata of the N PDF image information and the metadata extraction prompt set are synchronized to the initial multi-mode large model, N initial metadata of the N PDF papers are output, the initial multi-mode large model is regulated based on the N PDF papers and the N initial metadata, finally the first initial metadata is verified by combining the multi-mode large model, the first initial metadata of the first PDF paper is output, the first metadata of the first PDF paper is analyzed, the accuracy of the first PDF paper is improved, the accuracy of the analysis of the key text and the text is improved, and the accuracy of the text extraction result is not completely lost, and the accuracy of the text is completely recognized, and the text is completely clear.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following will briefly describe the drawings of the embodiments of the present disclosure, in which flowcharts are used to illustrate operations performed by a system according to embodiments of the present disclosure. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.
Fig. 1 is a schematic flow chart of a PDF paper metadata structured analysis method based on a multi-mode large model according to an embodiment of the present application;
Fig. 2 is a schematic flow chart of generating metadata extraction prompts in a multi-mode large model-based PDF paper metadata structured analysis method according to an embodiment of the present application;
Fig. 3 is a schematic flow chart of comparative learning in a multi-mode large model-based PDF paper metadata structured analysis method according to an embodiment of the present application.
Detailed Description
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict, the term "first\second" being referred to merely as distinguishing between similar objects and not representing a particular ordering for the objects. The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or modules that may not be expressly listed or inherent to such process, method, article, or apparatus, and unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. The terminology used herein is for the purpose of describing embodiments of the application only.
The embodiment of the application provides a PDF paper metadata structured analysis method based on a multi-mode large model, which is shown in figure 1 and comprises the following steps:
Step S100, a first PDF paper is collected for visual analysis, first PDF image information is generated, the first PDF image information is identified according to text extraction requirements, and a metadata extraction prompt is generated. A target PDF paper is collected or a PDF paper (for example, a plurality of elements including texts, graphs, tables, formulas and the like) is randomly obtained to be used as a first PDF paper, then the first PDF paper is subjected to visual analysis by utilizing a computer technology, specifically, the texts, graphs, tables and the like of the first page of the first PDF paper are subjected to recognition analysis to generate a first PDF image, wherein the first PDF image is the first page image of the first PDF paper, and then the first page image is identified according to text extraction requirements, wherein the text extraction requirements comprise extracting detailed information such as the title, author, abstract, keywords, font size, text relative position and layout of the title of the paper, namely, marking corresponding texts in the first PDF image information according to specific text extraction requirements, for example, marking text areas needing to be extracted by using marking tools (such as rectangular box marks, polygon marks and the like), and finally, metadata extraction prompts are generated, and structured information of paper metadata description data is generally used for providing additional information about data. In academic papers, metadata includes titles, authors, summaries, keywords, etc., generating metadata extraction cues that help the model more efficiently extract corresponding data and detailed information from PDF papers means automatically generating or suggesting which metadata should be extracted based on the identified text regions.
In a possible implementation manner, step S100 further includes step S101 of traversing the multiple information sources to perform random selection to generate multiple PDF documents, performing virtual generation according to the multiple PDF documents to generate multiple virtual metadata and multiple top page texts, where the multiple virtual metadata and the multiple top page texts have corresponding relations with the multiple PDF documents, aligning the multiple virtual metadata and the multiple top page texts according to the multiple PDF documents, performing random extraction on the traversing alignment result, and generating the first PDF paper.
Preferably, the first PDF paper is generated based on metadata of massive PDF data synthesized by a small number of context patterns of a publishing company, namely, a small number of publishing companies are determined from a plurality of information sources in a random selection mode, context patterns are called by the small number of publishing companies, the PDF documents and the data are synthesized by the called context patterns, virtual extraction is carried out on each PDF document, the PDF documents can contain various academic papers, reports and the like, namely, a large number of PDFs of the small number of patterns and the corresponding metadata thereof are simulated at low cost through a large model such as GPT4 and the like, virtual metadata and top page texts of a plurality of PDF documents are generated, the virtual metadata comprise titles, authors, summaries, keywords, publishing information and the like of the papers, each PDF document uniquely corresponds to one virtual metadata and the top page text, then the plurality of virtual metadata and the plurality of top page texts are aligned according to the plurality of PDF documents, the corresponding virtual LAT texts are generated by using the version patterns provided by the publishing company, the PDF papers are obtained as alignment results, and finally the PDF papers are randomly extracted from the PDF papers to be used as first traversal papers.
In one possible implementation manner, as shown in fig. 2, step S100 further includes step S110, performing page conversion on the first page of the document in the first PDF paper, to generate a first page image of the document. The step of performing page conversion on the first document top page of the first PDF paper refers to converting the first document top page in PDF format into an image format (such as JPEG, PNG, etc.), and generating a document top page image. And step S120, performing image standardization processing on the document top page image to generate first PDF image information. The image normalization process typically includes adjusting the size, resolution, color space, etc. of the top page image of the document to ensure consistency and comparability of the images, resulting in a first PDF image. Step S130, extracting features of the first PDF image information by using a computer vision technology, determining a plurality of image features, and acquiring text extraction requirements. Computer vision techniques (e.g., deep learning, image processing algorithms, etc.) are used to identify and analyze key features in the image, which may include font size features (e.g., font size of key information such as title, subtitle, author name, abstract, etc., often different from other text), font relative position features (e.g., relative position of elements such as author name, title, abstract, etc., on the page, author name may be under the title, abstract may be under the author name and typically be at a certain vertical distance from the title and author name), typesetting layout features (e.g., paragraph distribution, column layout, table position, image embedding manner, etc.), by analyzing the above image features, it may be inferred that text of which portions is the focus of metadata extraction, i.e., to obtain text extraction requirements, e.g., title is typically at the top of the page and fonts are larger, author name and unit may be under the title and smaller but fixed in position.
And step S140, traversing the first PDF image information according to the text extraction requirement for marking, and generating a label to be extracted. Text content in the PDF image is converted to a processable format (e.g., text character strings or text blocks) using image processing techniques, and the first PDF image information is traversed according to text extraction requirements to find and identify potential text information, generating tags to be extracted for determining which text is extracted, e.g., tags such as "titles", "authors", "summaries", etc. And step S150, determining the metadata extraction prompt according to the label to be extracted. The method comprises the steps of determining metadata extraction prompts by using generated tags to be extracted, namely, determining metadata extraction prompts which are used for prompting how to effectively extract corresponding metadata from an original document (such as a PDF file), specifically, formulating metadata extraction strategies, comprising determining specific positions of the metadata in the document by using text position features (such as font relative position features), identifying and extracting the metadata by text content matching (such as keyword searching and regular expression matching), extracting the metadata by analyzing a document structure for a structured document (such as a PDF with an HTML tag), and the like, and generating and determining metadata extraction prompts according to the metadata extraction strategies, wherein the metadata extraction prompts comprise position prompts (page positions or text areas where the metadata possibly appear), format prompts (sizes, formats and the like of words), content prompts (keywords or phrases possibly contained by the metadata) and the like.
Step S200, an initial multi-mode large model is built, synchronization is carried out on the basis of the first PDF image information and the metadata extraction prompt to the initial multi-mode large model, and first initial metadata of the first PDF paper is output. The method comprises the steps of constructing a visual encoder architecture by utilizing first PDF image information based on a convolutional neural network, specifically, training the visual encoder architecture by utilizing the first PDF image information as a training set, learning how to automatically extract useful visual information from PDF images in the training process so as to obtain a visual encoder, constructing the language encoder architecture by utilizing metadata extraction prompts based on a convolutional neural network, training the language encoder architecture by utilizing metadata extraction prompts as training data, learning how to extract useful language information from a metadata extraction prompt sequence in the training process so as to obtain a language encoder, integrating the constructed visual encoder, the language encoder and a connection mechanism between the visual encoder and the language encoder by utilizing the LLaVA architecture, thereby constructing an initial multi-mode large model, and simultaneously processing data of multiple modes (such as images, texts and the like) by integrating information of different modes, realizing more comprehensive understanding and processing capacity, simultaneously processing the images of documents and the extracted prompts of the metadata by the multi-mode large model for PDF metadata extraction, and outputting metadata, and particularly, integrating the first PDF image information with the initial PDF image information, the initial PDF information, the first key word and the first key word, the first word, and the other words and the like.
In a possible implementation manner, step S200 further includes step S210, constructing a visual encoder architecture using a convolutional neural network, and training the visual encoder architecture in combination with the first PDF image information to determine a visual encoder. Convolutional Neural Networks (CNNs) are used to construct a visual encoder capable of processing image data, typically comprising only convolutional and pooling layers (or convolutional layers with a step size greater than 1) in a feature extraction network, which is trained using a first PDF image as training data, which learns how to identify and extract visual features useful for metadata extraction for extracting advanced features of PDF image information (which may have been converted to an image format suitable for processing), such as font size, typesetting layout, etc. And step S220, preprocessing is performed based on the metadata extraction prompt, and a word embedding sequence is generated. Preprocessing the text data according to the metadata extraction prompt may include text cleaning, word segmentation, removal of stop words, etc., and then converting the text into vector representations, i.e., word embedding sequences, using Word embedding techniques (e.g., word2Vec, gloVe or BERT embedding layers) to capture semantic relationships between words.
Step S230, constructing a language encoder framework by adopting a cyclic neural network, and training by utilizing the language encoder framework and combining the word embedding sequence to determine a language encoder. The method comprises the steps of constructing an encoder capable of receiving word embedding sequences as input and outputting encoded language features by using a cyclic neural network, capturing semantic relationships between words by a word embedding technology, enabling similar words to be closer in vector space, preprocessing text data (such as texts extracted from PDFs), such as word segmentation, removal of stop words and the like, converting each word into a corresponding vector representation by using the word embedding technology, arranging the words in sequence in the texts to form word embedding sequences, training a language encoder framework by using the word embedding sequences, receiving a word embedding vector at each time step by using the language encoder framework, updating the internal states of the word embedding vectors to capture information in the sequences, and determining the language encoder and converting how the word embedding sequences are learned into useful language feature representations after enough training iteration.
And step S240, the visual encoder and the language encoder are cooperatively connected to construct a model component. The visual encoder for processing visual information and the language encoder for processing language information are combined through a specific mechanism to realize interaction and fusion of the visual information and the language information, for example, a double-flow attention mechanism is adopted by a VisualBERT model, images and texts are respectively input into the visual encoder and the language encoder and are interacted through a multi-layer attention mechanism, the images and the texts are subjected to preliminary processing in the respective encoders and then are interacted through a cross-modal attention layer, specifically, the input images and texts are respectively subjected to embedded representation, for the images, a pre-trained convolutional neural network is generally used for extracting features and converting the features into embedded vectors, for the texts, word embedding and position coding are used for representing text sequences, then the embedded representation is respectively input into the visual encoder and the language encoder, preliminary feature extraction and coding are carried out, and interaction and information transmission between the images and the texts are realized through the cross-modal attention mechanism.
Step S250, invoking LLaVA architecture processing, and constructing the initial multi-mode large model by combining the model components. LLaVA is an open-source multi-modal large model designed to connect a visual encoder and a Large Language Model (LLM) together to achieve general visual and language understanding, to be able to process visual information (images) of PDF documents and to effectively perform accurate generation of metadata text, specifically, to integrate the visual encoder and the language encoder together through a specific connection mechanism to form a unified multi-modal model, to perform end-to-end training on the model using multi-modal instruction following dataset (e.g., dataset containing images and corresponding text descriptions), during which the model learns how to align and fuse visual features and language features to better accomplish multi-modal tasks, and finally to obtain the initial multi-modal large model.
In a possible implementation manner, step S200 further includes step S260, where the first PDF image information is input to the visual encoder to perform feature extraction, so as to generate a visual feature vector. The first PDF image information is input into the visual encoder, the visual encoder performs feature extraction on the image through a multi-layer convolutional neural network, the visual encoder comprises the steps of extracting local features of the image layer by using a convolutional layer, such as edges, corner points, textures and the like, downsampling a feature map by using a pooling layer, reducing dimensions, retaining main features, flattening the pooled feature map by using a full-connection layer, mapping the feature map to a high-dimensional feature space, generating visual feature vectors, representing points in the high-dimensional space, and representing key information in the image. Step S270, inputting the metadata extraction prompt into the language encoder for feature extraction, and generating text feature vectors. The metadata extraction prompt is input into a language encoder, the language encoder performs word segmentation, embedding and the like on the text, and applies a transducer and other structures to capture semantic and contextual information in the text, and finally, the information is converted into text feature vectors, wherein the vectors are also points in a high-dimensional space and can represent key information in the text.
And step S280, the visual feature vector and the text feature vector are combined and encoded to generate a fusion feature. Jointly encoding a visual feature vector and a text feature vector typically involves stitching the two vectors together, or fusing them by some form of cross-modal attention mechanism, the purpose of which is to generate a fused feature that contains both the critical information in the image and text, and which is interrelated in vector space. And step S290, performing decoding processing based on the fusion feature, and generating the first initial metadata. The decoding process is performed based on the fusion features, and the decoder receives the fusion features as input and applies a series of transformations (e.g., linear layer, softmax, etc.) to map the final text content output to generate the off-site initial metadata, i.e., structured initial metadata such as author name, title, abstract, etc.
In a possible implementation manner, as shown in fig. 3, step S200 further includes a step S201 of aligning the visual feature vector with the text feature vector to generate an image-text pair, a step S202 of introducing a contrast loss function to calculate the image-text pair to determine vector distance data, a step S203 of determining whether the vector distance data is in a vector distance threshold interval, if so, generating a matching signal, and comparing the visual feature vector with the text feature vector according to the matching signal, and a step S204 of enhancing the matching signal when the visual feature vector is consistent with the text feature vector, and if not, cutting down the matching signal.
Preferably, the visual feature vector is a high-dimensional vector extracted from the image by image processing techniques (such as convolutional neural network CNN), the text feature vector is a vector extracted from the text by Natural Language Processing (NLP) techniques (such as word embedding, BERT, etc.), the image and text of each PDF document are aligned to form a data pair of image-text pairs, each pair of data is ensured to come from the top page of the same PDF document (but not necessarily accurate, i.e. not necessarily capable of determining that the image is actually from the same document, and therefore needs to be trained by a subsequent contrast, if the vectors are consistent, the contrast loss function is a loss function commonly used for learning the similarity or difference between samples to quantify the similarity between the image and text, the smaller the vector distance data is, the more similar the image and the text are represented in the feature space, the image and the text are as similar as possible (distance is smaller), the image and text in different PDF documents are ensured to be as possible, if the feature vectors and feature vectors in the different PDF documents are not necessarily identical, the contrast is better than the feature is determined, the distance is more likely to be set, the distance is better than the feature is calculated, the threshold is better is calculated, and the visual feature signal is matched between the text is more consistent, the threshold is better than the text is calculated, and the feature is better the threshold is matched between the text and the text is better than the feature is compared to the text and the text is better the feature signal if the contrast is better than the text and the feature is better than the text and the text is better contrast, when the visual feature vector and the text feature vector do not fall within the set vector distance threshold interval, namely the matching signal is negative at the moment, and the matching signal is reduced, namely when the consistency of the visual feature vector and the text feature vector is weaker, the matching signal is reduced in the negative direction.
In one possible implementation manner, the step S280 further includes a step S285 of dividing a plurality of visual feature subareas based on the visual feature vector, wherein the plurality of visual feature subareas comprise a visual token, a step S286 of performing attention calculation based on the visual token to generate a plurality of visual attention weights, a step S287 of dividing a plurality of text feature subareas based on the text feature vector, wherein the plurality of text feature subareas comprise a text token, a step S288 of performing attention calculation based on the text token to generate a plurality of text attention weights, a step S289 of constructing a plurality of association relations between the visual feature vector and the text feature vector according to the plurality of visual attention weights and the plurality of text attention weights based on a matching signal, and a step S2810 of jointly encoding the visual feature vector and the text feature vector based on the plurality of association relations to generate the fusion feature.
Preferably, the original visual feature vector (usually a high-dimensional, continuous vector) is divided into smaller, more easily processed sub-regions, each sub-region containing a set of visual tokens, specifically, in a model such as a visual transducer, the visual tokens generally refer to a series of image blocks into which the image is divided, each image block is regarded as a token, the linear projection is mapped to the input dimension of the transducer to obtain a series of image block vectors, specifically, if the resolution of an image is H×W, the size of the divided image blocks is P×P, then (H/P) x (W/P) image blocks, namely (HW)/(P2) image blocks, are obtained, the token is taken as a basic unit when the model processes the image, the local information and the features of the image are carried, the attention of each visual token is calculated to determine the importance of the whole visual token, namely, the correlation between each token and other visual token or the whole visual token is calculated, and the attention of the vision token is not reflected in the fusion model according to the importance of the vision token.
Preferably, similar to the processing of visual feature vectors, the text feature vectors are also divided into a plurality of sub-regions, each sub-region containing a set of text tokens representing different words or phrases in the text, which are basic units of text information, which may be a word, a phrase, a punctuation mark, a subword (subword) or a character, in an NLP task, tokens are the essential basic units of the text, bear semantic and structural information, are the basis for language understanding and generation of a model, and the model continuously optimizes the representation of the tokens by dynamically adjusting parameters during training to realize understanding and generation of the text, and performs attention calculation on each text token to generate a set of text attention weights reflecting the importance of different parts in the whole text representation.
Preferably, based on the obtained matching signals, a plurality of association relations between the vision feature vector and the cross-book feature are built according to the calculated vision attention weight and the cross-book attention weight, the association relations can be direct (such as correspondence based on space positions), indirect (such as similarity based on semantic content) or combination of the vision feature vector and the cross-book feature, the model is helpful to better understand interaction and relation between the vision and the cross-book information in the joint coding process, the vision feature vector and the cross-book feature are jointly coded inwards based on the built association relations, namely the vision and the cross-book information are fused in the feature space to generate fusion features containing comprehensive information of the vision feature vector and the cross-book feature, such as splicing, weighted summation and application of an attention mechanism of the feature vector, and finally generated fusion features are in a high dimension and simultaneously contain key features of the vision and the cross-book information.
And step S300, acquiring N PDF papers based on a plurality of information sources for visual analysis, generating N PDF image information, defining a plurality of data texts according to the text extraction requirements to generate a metadata extraction prompt set, wherein N is an integer greater than 1. N PDF papers are collected from a plurality of information sources (such as academic databases, library resources, online archives and the like), including a large number of PDF papers of different publishers, then the N PDF papers are subjected to visual analysis to generate N PDF image information corresponding to the N PDF papers (namely front page images of the N papers), then a plurality of data texts of each PDF paper image information are identified according to preset text extraction requirements, including titles, authors, abstracts, keywords and the like of the N papers, so as to obtain a plurality of metadata extraction prompts, and the metadata extraction prompts form a metadata extraction prompt set.
Step S400, synchronizing the N PDF image information and the metadata extraction prompt set to the initial multi-mode large model, and outputting N initial metadata of the N PDF papers, wherein the N PDF papers and the N initial metadata have corresponding relations. And synchronizing the N PDF image information and the metadata extraction prompt set as input data to an initial multi-mode large model, wherein the initial multi-mode large model respectively uses an internal visual encoder and a language encoder to analyze and process the PDF image information and the metadata extraction prompt set, N initial metadata corresponding to N PDF papers are extracted, wherein the N metadata actually refer to N initial metadata sets, because the metadata of each PDF paper itself contains a plurality of text data such as titles, authors, abstracts, keywords and the like of the paper and is not single data, and the N PDF papers and the generated N initial metadata have a one-to-one correspondence.
And step S500, adjusting the initial multi-mode large model based on the N PDF papers and the N initial metadata to generate a multi-mode large model. And performing data enhancement operation on the N PDF papers and the corresponding metadata to enhance the obtained N PDF papers to serve as test data sets, inputting the test data sets into an initial multi-mode large model, performing metadata extraction mode training on the model by utilizing the training data sets to obtain a plurality of training results, performing data enhancement on the N initial metadata to obtain a verification data set, verifying the plurality of training results by utilizing the verification data set, dynamically adjusting the initial multi-mode large model by utilizing the verification data set, for example, adding or modifying a specific processing module, adjusting extraction strategies of visual data and language data, correcting errors of extracted metadata and the like, and finally generating the multi-mode large model, so that the model can be suitable for PDF papers of different formats and styles, better understand and process document layout of complex PDF papers, and realize high-precision and efficient metadata extraction.
In a possible implementation manner, step S500 further includes step S510, performing data enhancement on the N PDF papers and the N initial metadata, and constructing a training data set and a verification data set. Data enhancement is a technique for improving model generalization capability by increasing diversity and number of data samples, which may include extracting text content, performing text cleansing, converting format, adding noise, synonym substitution, etc., and N initial metadata, which may include modifying field order, adding or deleting non-critical fields, generating new metadata description, etc., to divide the enhanced data into training data sets and verification data sets. Step S520, training the metadata extraction mode by using the training data set, and generating a plurality of training results. Training of the metadata extraction mode is performed by using the training data set, so that key information in the PDF paper is identified and extracted as metadata, the training process may include a plurality of iterations, and a plurality of training results are generated in the training process.
Step S530, cross-validating the plurality of training results using the validation dataset. The multiple training results are cross-validated using the validation dataset, and the data is partitioned into a training set and a test set by repeating the partitioning multiple times, each time using a different partitioning to train the model and evaluate its performance, helping to reduce the risk of overfitting and providing a more reliable estimate of the model's performance. Step S540, extracting the training result which passes the verification to dynamically adjust the initial multi-mode large model, and generating an adjustment result. And selecting a training result with the best performance from the cross verification as a reference, and dynamically adjusting the initial multi-mode large model according to the training result passing the verification, for example, adjusting the architecture, parameters, an optimization algorithm and the like of the model so as to improve the performance of the model on a specific task. Step S550, an evaluation index is introduced to evaluate the adjustment result, and the multi-modal large model is generated according to adjustment evaluation information. The method comprises the steps of introducing a plurality of evaluation indexes to evaluate an adjustment result, finding out a model or configuration which is best in performance on a verification data set, specifically, predicting the verification data set by using a candidate model, generating a prediction result, calculating values of various evaluation indexes according to the prediction result and a real label of the verification data set, comparing evaluation index values of different candidate models to determine which model is best in performance on the verification data set, generating a final multi-mode large model according to the configuration and parameters of the model, directly processing an image of a PDF document, and combining diversified training data to improve the accuracy and robustness of metadata extraction.
In a possible implementation manner, step S550 further includes step S551, setting metadata extraction targets based on the N PDF papers and the N initial metadata. Metadata extraction targets are set according to the N PDF papers and the N initial metadata, and the metadata extraction targets are used for automatically extracting metadata matched with the initial metadata or richer in metadata from the PDF papers, such as titles, authors, abstracts, keywords and the like of the papers. And step S552, extracting and training based on the metadata extraction target, testing according to the training result, and calculating the test passing rate. According to the method, a PDF paper is extracted and trained according to metadata extraction targets, specifically, PDF paper containing required metadata is collected to serve as a training data set, each metadata field is labeled with a corresponding label, then an information extraction model is constructed and trained by utilizing the training data, in the training process, the confusion degree of the model is calculated regularly, training strategies are adjusted according to the change trend of the model, specifically, as the training is carried out, if the confusion degree of the model gradually decreases, the model can be considered to learn how to effectively extract the metadata from the PDF paper, the confusion degree is an index for measuring the performance of the language model, the prediction capability of the model on the test data is reflected, the lower confusion degree means that the model can predict the next word or character better, a PDF paper which is different from the training data and has similar distribution and characteristics is prepared to serve as the test data set, the model tries to extract the metadata from the test paper, the model is compared with the true label of the test data, the extracted metadata of the model is calculated, and the test passing rate is calculated, namely, the ratio of the extracted metadata in the overall metadata field is intuitively represented on the overall data of the test field is calculated.
Step S553, setting extraction quality scores according to the test passing rate, and making the evaluation index according to the quality scores. Setting extraction quality scores according to a specific numerical range of the test passing rate, for example, setting a passing rate threshold, giving a higher quality score when the test passing rate exceeds the threshold, giving a lower quality score when the test passing rate is lower than the threshold, and formulating detailed evaluation standards, namely evaluation indexes (such as accuracy, the ratio of the number of correctly extracted metadata to the total extraction number) based on quality score results (quality score values) so as to judge whether the extraction quality of the metadata extraction model meets specific quality requirements. Step S554, determining whether the adjustment result has adjustment data smaller than the evaluation index according to the evaluation index. And using the evaluation index to evaluate the adjustment result of the model, and judging whether adjustment data smaller than the evaluation index exist in the adjustment result.
Step S555, if the adjustment data smaller than the evaluation index exists in the adjustment result, marking and recording the adjustment data smaller than the evaluation index, and adding the adjustment data to the data set to be verified. If there is said adjustment data smaller than the evaluation index in the adjustment result, annotation records are made for adjustment data that do not meet the evaluation index and they are added to the data set to be verified, which will be used for subsequent verification and possible further adjustment. Step S556, screening the adjustment result based on the data set to be verified, and generating the adjustment evaluation information. Using the data set to be validated to further validate and adjust the performance of the model, it may be found that certain adjustment data are valid in certain situations, based on which the adjustment results are screened and adjustment evaluation information is ultimately generated.
And step S600, verifying the first initial metadata by utilizing the multi-mode large model and combining the first PDF paper, and outputting the first metadata of the first PDF paper. And (3) inputting the first PDF paper as training data to a trained multi-mode large model to extract metadata, and then verifying the first initial metadata based on metadata extraction results, namely comparing the consistency of the metadata extracted from the PDF paper by the model with the initial metadata, checking the accuracy and the integrity of the initial metadata, and taking the initial metadata as the first metadata corresponding to the first PDF paper and outputting the first metadata after verification.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims (9)

1.基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,所述方法包括:1. A structured analysis method for PDF paper metadata based on a multimodal large model, characterized in that the method comprises: 采集第一PDF论文进行视觉分析,生成第一PDF图像信息,根据文本抽取需求对所述第一PDF图像信息进行标识,生成元数据抽取提示;Collecting the first PDF paper for visual analysis, generating first PDF image information, marking the first PDF image information according to text extraction requirements, and generating metadata extraction prompts; 构建初始多模态大模型,基于所述第一PDF图像信息与所述元数据抽取提示同步至所述初始多模态大模型,输出所述第一PDF论文的第一初始元数据;Constructing an initial multimodal large model, synchronizing the first PDF image information and the metadata extraction prompt to the initial multimodal large model, and outputting first initial metadata of the first PDF paper; 基于多个信息源采集N个PDF论文进行视觉分析,生成N个PDF图像信息,根据所述文本抽取需求定义多个数据文本生成元数据抽取提示集合,N为大于1的整数;Based on multiple information sources, N PDF papers are collected for visual analysis to generate N PDF image information, and multiple data texts are defined according to the text extraction requirements to generate metadata extraction prompt sets, where N is an integer greater than 1; 将所述N个PDF图像信息与所述元数据抽取提示集合同步至所述初始多模态大模型,输出所述N个PDF论文的N个初始元数据,所述N个PDF论文与所述N个初始元数据存在对应关系;Synchronizing the N PDF image information and the metadata extraction prompt set to the initial multimodal large model, and outputting N initial metadata of the N PDF papers, wherein the N PDF papers correspond to the N initial metadata; 基于所述N个PDF论文与所述N个初始元数据对所述初始多模态大模型进行调整,生成多模态大模型;Adjusting the initial multimodal large model based on the N PDF papers and the N initial metadata to generate a multimodal large model; 利用所述多模态大模型结合所述第一PDF论文对所述第一初始元数据进行验证,输出所述第一PDF论文的第一元数据。The first initial metadata is verified by using the multimodal large model in combination with the first PDF paper, and the first metadata of the first PDF paper is output. 2.如权利要求1所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,所述第一PDF论文,方法包括:2. The method for structured analysis of PDF paper metadata based on a multimodal large model according to claim 1, wherein the first PDF paper comprises: 遍历所述多个信息源进行随机选择,生成多个PDF文档;Traversing the multiple information sources and performing random selection to generate multiple PDF documents; 按照所述多个PDF文档进行虚拟生成,生成多个虚拟元数据、多个首页文本,所述多个虚拟元数据、所述多个首页文本与所述多个PDF文档存在对应关系;Performing virtual generation according to the plurality of PDF documents to generate a plurality of virtual metadata and a plurality of home page texts, wherein the plurality of virtual metadata and the plurality of home page texts correspond to the plurality of PDF documents; 将所述多个虚拟元数据、所述多个首页文本按照所述多个PDF文档进行对齐,遍历对齐结果进行随机提取,生成所述第一PDF论文。The multiple virtual metadata and the multiple homepage texts are aligned according to the multiple PDF documents, and the alignment results are traversed for random extraction to generate the first PDF paper. 3.如权利要求1所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,采集第一PDF论文进行视觉分析,生成第一PDF图像信息,根据文本抽取需求对所述第一PDF图像信息进行标识,生成元数据抽取提示,方法包括:3. The method for structured analysis of PDF paper metadata based on a multimodal large model as claimed in claim 1 is characterized in that a first PDF paper is collected for visual analysis to generate first PDF image information, the first PDF image information is marked according to text extraction requirements, and a metadata extraction prompt is generated, the method comprising: 将所述第一PDF论文的文档首页进行页面转换,生成文档首页图像;Convert the first page of the first PDF paper to generate a document first page image; 对所述文档首页图像进行图像标准化处理,生成第一PDF图像信息;Performing image standardization processing on the document homepage image to generate first PDF image information; 利用计算机视觉技术对所述第一PDF图像信息进行特征提取,确定多个图像特征,并获取文本抽取需求;Using computer vision technology to extract features from the first PDF image information, determine multiple image features, and obtain text extraction requirements; 按照所述文本抽取需求遍历所述第一PDF图像信息进行标注,生成待抽取标签;Traversing the first PDF image information to perform annotation according to the text extraction requirement, and generating labels to be extracted; 根据所述待抽取标签确定所述元数据抽取提示。The metadata extraction hint is determined according to the tag to be extracted. 4.如权利要求3所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,构建所述初始多模态大模型,方法包括:4. The PDF paper metadata structured parsing method based on a multimodal large model as claimed in claim 3, characterized in that the initial multimodal large model is constructed, and the method comprises: 采用卷积神经网络构建视觉编码器架构,利用所述视觉编码器架构结合所述第一PDF图像信息进行训练,确定视觉编码器;Using a convolutional neural network to construct a visual encoder architecture, using the visual encoder architecture combined with the first PDF image information for training, and determining a visual encoder; 基于所述元数据抽取提示进行预处理,生成词嵌入序列;Perform preprocessing based on the metadata extraction prompts to generate word embedding sequences; 采用循环神经网络构建语言编码器架构,利用所述语言编码器架构结合所述词嵌入序列进行训练,确定语言编码器;A language encoder architecture is constructed using a recurrent neural network, and the language encoder architecture is used in combination with the word embedding sequence for training to determine a language encoder; 将所述视觉编码器与所述语言编码器进行协同连接,构建模型组件;The visual encoder and the language encoder are collaboratively connected to construct a model component; 调用LLaVA架构处理,结合所述模型组件构建所述初始多模态大模型,其中,LLaVA是一个开源的多模态大模型,旨在将视觉编码器和大型语言模型连接起来,以实现通用的视觉和语言理解,能够处理PDF文档的视觉信息,生成元数据文本。Call the LLaVA architecture for processing, and combine the model components to build the initial multimodal large model, wherein LLaVA is an open source multimodal large model that aims to connect visual encoders and large language models to achieve general vision and language understanding, and can process the visual information of PDF documents and generate metadata text. 5.如权利要求4所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,基于所述第一PDF图像信息与所述元数据抽取提示同步至所述初始多模态大模型,输出所述第一PDF论文的第一初始元数据,方法包括:5. The method for structured analysis of PDF paper metadata based on a multimodal large model according to claim 4, characterized in that, based on synchronizing the first PDF image information and the metadata extraction prompt to the initial multimodal large model, outputting the first initial metadata of the first PDF paper, the method comprises: 将所述第一PDF图像信息输入至所述视觉编码器进行特征提取,生成视觉特征向量;Inputting the first PDF image information into the visual encoder for feature extraction to generate a visual feature vector; 将所述元数据抽取提示输入所述语言编码器进行特征提取,生成文本特征向量;Inputting the metadata extraction prompt into the language encoder for feature extraction to generate a text feature vector; 将所述视觉特征向量与所述文本特征向量进行联合编码,生成融合特征;Jointly encoding the visual feature vector and the text feature vector to generate a fusion feature; 基于所述融合特征进行解码处理,生成所述第一初始元数据。Decoding is performed based on the fusion features to generate the first initial metadata. 6.如权利要求5所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,将所述视觉特征向量与所述文本特征向量进行联合编码之前,方法包括:6. The PDF paper metadata structured parsing method based on a multimodal large model as claimed in claim 5, characterized in that before the visual feature vector and the text feature vector are jointly encoded, the method comprises: 将所述视觉特征向量与所述文本特征向量进行数据对齐,生成图像-文本对;Performing data alignment on the visual feature vector and the text feature vector to generate an image-text pair; 引入对比损失函数对所述图像-文本对进行计算,确定向量距离数据;Introducing a contrast loss function to calculate the image-text pair to determine vector distance data; 判断所述向量距离数据是否处于向量距离阈值区间,若处于,则生成匹配信号,根据所述匹配信号对所述视觉特征向量与所述文本特征向量进行比较;Determine whether the vector distance data is within a vector distance threshold interval, and if so, generate a matching signal, and compare the visual feature vector with the text feature vector according to the matching signal; 当所述视觉特征向量与所述文本特征向量一致时,增强所述匹配信号,当所述视觉特征向量与所述文本特征向量不一致时,削减所述匹配信号。When the visual feature vector is consistent with the text feature vector, the matching signal is enhanced, and when the visual feature vector is inconsistent with the text feature vector, the matching signal is reduced. 7.如权利要求6所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,将所述视觉特征向量与所述文本特征向量进行联合编码,生成融合特征,方法包括:7. The PDF paper metadata structured parsing method based on a multimodal large model as claimed in claim 6, characterized in that the visual feature vector and the text feature vector are jointly encoded to generate a fusion feature, and the method includes: 基于所述视觉特征向量划分多个视觉特征子区域,所述多个视觉特征子区域包含视觉token;Dividing a plurality of visual feature sub-regions based on the visual feature vector, wherein the plurality of visual feature sub-regions include visual tokens; 基于所述视觉token进行注意力计算,生成多个视觉注意力权重;Performing attention calculation based on the visual token to generate multiple visual attention weights; 基于所述文本特征向量划分多个文本特征子区域,所述多个文本特征子区域包含文本token;Dividing a plurality of text feature sub-regions based on the text feature vector, wherein the plurality of text feature sub-regions contain text tokens; 基于所述视觉token进行注意力计算,生成多个文本注意力权重;Performing attention calculation based on the visual token to generate multiple text attention weights; 基于匹配信号,按照所述多个视觉注意力权重、所述多个文本注意力权重构建所述视觉特征向量与所述文本特征向量的多个关联关系;Based on the matching signal, constructing multiple association relationships between the visual feature vector and the text feature vector according to the multiple visual attention weights and the multiple text attention weights; 基于所述多个关联关系对所述视觉特征向量与所述文本特征向量进行联合编码,生成所述融合特征。The visual feature vector and the text feature vector are jointly encoded based on the multiple association relationships to generate the fusion feature. 8.如权利要求1所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,基于所述N个PDF论文与所述N个初始元数据对所述初始多模态大模型进行调整,生成多模态大模型,方法包括:8. The method for structured analysis of PDF paper metadata based on a multimodal large model according to claim 1, characterized in that the initial multimodal large model is adjusted based on the N PDF papers and the N initial metadata to generate a multimodal large model, the method comprising: 对所述N个PDF论文与所述N个初始元数据进行数据增强,构建训练数据集、验证数据集;Performing data enhancement on the N PDF papers and the N initial metadata to construct a training data set and a verification data set; 利用所述训练数据集进行元数据抽取模式的训练,生成多个训练结果;Using the training data set to train the metadata extraction mode, generating multiple training results; 利用所述验证数据集对所述多个训练结果进行交叉验证;Using the validation data set to cross-validate the multiple training results; 提取验证通过的训练结果对所述初始多模态大模型进行动态调整,生成调整结果;Extract the verified training results to dynamically adjust the initial multimodal large model and generate adjustment results; 引入评估指标对所述调整结果进行评估,根据调整评估信息生成所述多模态大模型。An evaluation index is introduced to evaluate the adjustment result, and the multimodal large model is generated according to the adjustment evaluation information. 9.如权利要求8所述的基于多模态大模型的PDF论文元数据结构化解析方法,其特征在于,引入评估指标对所述调整结果进行评估,根据调整评估信息生成所述多模态大模型,方法包括:9. The PDF paper metadata structured parsing method based on a multimodal large model according to claim 8 is characterized in that an evaluation index is introduced to evaluate the adjustment result, and the multimodal large model is generated according to the adjustment evaluation information, and the method comprises: 基于所述N个PDF论文与所述N个初始元数据设定元数据抽取目标;Setting a metadata extraction target based on the N PDF papers and the N initial metadata; 基于所述元数据抽取目标进行抽取训练,根据训练结果进行测试,计算测试通过率;Performing extraction training based on the metadata extraction target, performing testing according to the training results, and calculating the test pass rate; 按照所述测试通过率设定抽取质量评分,按照质量评分值制定所述评估指标;Set the extraction quality score according to the test pass rate, and formulate the evaluation index according to the quality score value; 根据所述评估指标判断所述调整结果是否存在小于评估指标的调整数据;Determining whether there is adjustment data in the adjustment result that is less than the evaluation index according to the evaluation index; 若所述调整结果中存在小于所述评估指标的所述调整数据,则对小于所述评估指标的所述调整数据进行标注记录,添加至待验证数据组;If the adjustment result contains the adjustment data that is smaller than the evaluation index, the adjustment data that is smaller than the evaluation index is marked and recorded, and added to the data group to be verified; 基于所述待验证数据组对所述调整结果进行筛选,生成所述调整评估信息。The adjustment results are screened based on the data group to be verified to generate the adjustment evaluation information.
CN202411205429.4A 2024-08-30 2024-08-30 A structured analysis method for PDF paper metadata based on a multimodal large model Active CN119005168B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411205429.4A CN119005168B (en) 2024-08-30 2024-08-30 A structured analysis method for PDF paper metadata based on a multimodal large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411205429.4A CN119005168B (en) 2024-08-30 2024-08-30 A structured analysis method for PDF paper metadata based on a multimodal large model

Publications (2)

Publication Number Publication Date
CN119005168A CN119005168A (en) 2024-11-22
CN119005168B true CN119005168B (en) 2025-03-25

Family

ID=93487132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411205429.4A Active CN119005168B (en) 2024-08-30 2024-08-30 A structured analysis method for PDF paper metadata based on a multimodal large model

Country Status (1)

Country Link
CN (1) CN119005168B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120996032A (en) * 2025-10-22 2025-11-21 之江实验室 A BERT-based academic paper title classification device and method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275021A (en) * 2023-09-07 2023-12-22 山东浪潮科学研究院有限公司 Key information extraction method and system based on multi-modal model
CN118076982A (en) * 2021-11-26 2024-05-24 巴西石油公司 Information extraction and structuring method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004066125A2 (en) * 2003-01-14 2004-08-05 V-Enable, Inc. Multi-modal information retrieval system
CN118093689A (en) * 2024-01-11 2024-05-28 珠海金智维信息科技有限公司 RPA-based multimodal document parsing and structured processing system
CN118395205B (en) * 2024-05-14 2025-05-30 北京邮电大学 Multi-mode cross-language detection method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118076982A (en) * 2021-11-26 2024-05-24 巴西石油公司 Information extraction and structuring method
CN117275021A (en) * 2023-09-07 2023-12-22 山东浪潮科学研究院有限公司 Key information extraction method and system based on multi-modal model

Also Published As

Publication number Publication date
CN119005168A (en) 2024-11-22

Similar Documents

Publication Publication Date Title
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN116151256A (en) A Few-Shot Named Entity Recognition Method Based on Multi-task and Hint Learning
CN116843175A (en) A contract clause risk inspection method, system, equipment and storage medium
CN116450834A (en) Archive knowledge graph construction method based on multi-mode semantic features
CN117689963B (en) Visual entity linking method based on multi-mode pre-training model
CN118886427B (en) A prompt word optimization method combining expert evaluation rules and large language model
CN118965192A (en) A generative AI service website identification method based on multimodal fusion learning
CN115329755A (en) Entity link model processing method and device and entity link processing method and device
CN118799690A (en) Marine remote sensing visual question answering method and system based on multi-order knowledge comparison
CN114416925B (en) Sensitive word identification method, device, equipment, storage medium and program product
CN118551044B (en) Cross-prompt automatic composition scoring method and device based on category countermeasure joint learning and electronic equipment
CN119005168B (en) A structured analysis method for PDF paper metadata based on a multimodal large model
CN117873487B (en) A method for generating code function annotations based on GVG
CN119272106A (en) An adaptive multimodal false news detection method and model based on dual features
CN117252205A (en) Semantically enhanced Chinese entity relationship extraction method and system based on relationship constraints
CN120832418A (en) Automatic extraction method of deep semantic entities and relations for industry standard documents
CN119003465B (en) Method of structuring PDF files based on OCR and large models
CN115221299A (en) Table question-answer data processing and model training method, electronic device and storage medium
CN119578519B (en) A few-shot Chinese network security incident detection method based on meta-learning
CN119514523A (en) A document content review method and system based on combined artificial intelligence
CN118095261A (en) A text data processing method, device, equipment and readable storage medium
CN119670729B (en) A method for out-of-context false information detection based on global information enhancement
CN119007230B (en) Electronic file case element cognition method and system based on Transformer framework
CN116227486B (en) Emotion analysis method based on retrieval and contrast learning
CN119540972B (en) A method and system for automatic examination paper review based on generative artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant