CN119128192B

CN119128192B - Content retrieval method combining PDF search mode with OCR recognition

Info

Publication number: CN119128192B
Application number: CN202411628999.4A
Authority: CN
Inventors: 贾若; 何小敏; 郑俐; 刘从清; 李志勇
Original assignee: Beijing Honghu Yuntu Technology Co ltd
Current assignee: Beijing Honghu Yuntu Technology Co ltd
Priority date: 2024-11-15
Filing date: 2024-11-15
Publication date: 2025-03-04
Anticipated expiration: 2044-11-15
Also published as: CN119128192A

Abstract

The application provides a content retrieval method combining PDF (portable document format) search mode with OCR (optical character) recognition, which relates to the technical field of information retrieval, and comprises the steps of obtaining a plurality of content formats by recognizing the content format of a current imaged PDF file, carrying out OCR recognition heterogeneity analysis among the plurality of content formats to obtain a plurality of corresponding heterogeneity indexes, classifying according to the plurality of heterogeneity indexes, outputting N types of content formats, constructing N content recognition models according to text characteristics of the content formats, obtaining user retrieval keywords, calling the N content recognition models to carry out keyword retrieval in the file, outputting a plurality of corresponding content retrieval return results, and carrying out positioning display in the file. The application solves the technical problems that the existing retrieval method can not effectively identify the relation and the difference between different formats, so that identification errors and omission are easy to occur, and achieves the technical effect of improving the efficiency and the accuracy of the retrieval of the content of the imaged PDF file.

Description

Content retrieval method combining PDF searching mode with OCR recognition

Technical Field

The application relates to the technical field of information retrieval, in particular to a content retrieval method combining PDF searching mode with OCR (optical character recognition).

Background

Along with the acceleration of informatization process, PDF files become the mainstream choice for storing and sharing documents due to good portability and cross-platform compatibility. However, with the widespread use of PDF files, how to efficiently and accurately retrieve desired information from these files has become an increasingly prominent challenge.

Existing PDF file content retrieval methods typically rely on extracting text information, recognizing static text by OCR (optical character recognition) techniques, a process that often can only be operated on standard text formats. In the face of imaging PDFs, particularly those documents in non-standard text formats that contain various annotations and multiple fonts and text colors, these methods are not effective in identifying relationships and differences between different formats, resulting in a significant compromise in information retrieval accuracy. Secondly, the existing searching method usually adopts a single OCR model to process all contents, the searching process lacks pertinence, the difference between different content formats is ignored, the classification and recognition capability of complex contents are limited, and the risks of false detection and omission are further increased. These problems result in more time and effort being required for the user to find specific information, which reduces the retrieval efficiency and affects the information acquisition experience.

Disclosure of Invention

The application provides a content retrieval method combining PDF searching mode with OCR recognition, which solves the technical problems that the existing PDF file content retrieval method cannot effectively identify the relation and difference between different formats when processing complex documents containing multiple formats, so that identification errors and omission are easy to occur and the efficiency and accuracy of information acquisition are affected, and achieves the technical effect of improving the efficiency and accuracy of the content retrieval of the imaged PDF file.

In view of the above problems, the application provides a content retrieval method combining PDF searching modes with OCR (optical character recognition), which comprises the steps of identifying the content formats of a current imaged PDF file to obtain a plurality of content formats, carrying out OCR identification heterogeneity analysis among the plurality of content formats to obtain a plurality of heterogeneity indexes corresponding to the plurality of content formats, classifying according to the plurality of heterogeneity indexes, outputting N types of content formats, constructing N content identification models according to text characteristics of the N types of content formats, obtaining user retrieval keywords, carrying out keyword retrieval in the imaged PDF file by calling the N content identification models, outputting a plurality of content retrieval return results corresponding to the N content identification models, and carrying out positioning display in the imaged PDF file according to the plurality of content retrieval return results.

One or more technical schemes provided by the application have at least the following technical effects or advantages:

By identifying the content formats of the current imaging PDF file, a plurality of content formats are obtained, the understanding of the document structure is improved, the follow-up steps can be processed based on accurate format information, and a foundation is laid for content retrieval. Based on the OCR recognition heterogeneity analysis among the plurality of content formats, a plurality of heterogeneity indexes corresponding to the plurality of content formats are obtained, the characteristics of different contents are recognized more accurately, and the effectiveness of subsequent classification and retrieval is improved. And classifying according to the plurality of heterogeneity indexes, outputting N types of content formats, systematically sorting different types of content, enhancing the pertinence of a retrieval model, and reducing information loss and misunderstanding caused by format differences. Based on the classification result, a special recognition model is built for each type of content format according to the text characteristics of the N types of content formats, N content recognition models are determined, recognition accuracy in processing specific formats is greatly improved, and accuracy and efficiency of overall retrieval are improved. And acquiring user search keywords, carrying out keyword search in the imaged PDF file by calling the N content recognition models, searching different content features by different models, maximizing coverage rate, outputting a plurality of content search return results corresponding to the N content recognition models, and improving search accuracy and comprehensiveness. And according to the multiple content retrieval return results, positioning and displaying in the imaging PDF file. So that a user can quickly find the required information, and the information acquisition efficiency is improved.

In summary, the application systematically classifies and builds a targeted content recognition model by recognizing and analyzing the content format of the imaged PDF file and combining with the OCR technology to carry out heteromorphism research, thereby remarkably improving the accuracy and efficiency of PDF content retrieval. The user keyword retrieval is carried out by flexibly calling different models, and the result is accurately positioned in the document, so that the adaptability to diversified contents is enhanced, the comprehensiveness and the accuracy of information acquisition are integrally improved, the user is helped to quickly acquire the required information, and the information retrieval experience is remarkably optimized.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

Fig. 1 is a flowchart of a content retrieval method combining PDF search mode with OCR recognition according to an embodiment of the present application;

fig. 2 is a schematic flow chart of acquiring a plurality of content formats in a content retrieval method combining PDF search mode with OCR recognition according to an embodiment of the present application;

fig. 3 is a schematic flow chart of constructing N content recognition models in the content retrieval method combining PDF search mode with OCR recognition according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a content retrieval method combining PDF searching mode with OCR (optical character recognition), which constructs a corresponding content recognition model for each content format by recognizing the content format of an imaged PDF file, invokes a plurality of content recognition models to respectively perform keyword retrieval in the imaged PDF file and outputs a plurality of corresponding content retrieval results. The technical problems that when complex documents containing multiple formats are processed, relations and differences among different formats cannot be effectively identified, identification errors and omission are prone to occur, and information acquisition efficiency and accuracy are affected are solved, and the technical effect of improving the efficiency and accuracy of image PDF file content retrieval is achieved.

As shown in fig. 1, an embodiment of the present application provides a content retrieval method combining PDF search mode with OCR recognition, where the method includes:

and step S1, identifying the content formats of the current imaging PDF file to obtain a plurality of content formats.

Specifically, the imaged PDF file refers to that contents of a PDF document are stored in the form of an image, and may be a scanned document or a PDF file containing a complex format. Content formats refer to different types of text fonts, text colors, endorsements, etc. that are present in a document.

The input imaged PDF file is first analyzed and various content formats in the file are identified using OCR technology. For example, fonts of different colors, different font styles (e.g., bold, italic) and annotated layouts and styles. Recording the identified content format to provide basic data for subsequent processing.

And S2, based on the OCR recognition heterogeneity analysis among the plurality of content formats, acquiring a plurality of heterogeneity indexes corresponding to the plurality of content formats.

Specifically, OCR (optical character recognition) technology is a technology of converting characters in an image into an editable text format. The heterogeneity index is used to evaluate feature differences and similarities between different content formats.

And (3) respectively applying an OCR technology to the plurality of content formats identified in the step (S1) to extract various text contents. Dissimilarity analysis is performed by comparing features of the text (e.g., font style, text size, text color). For example, an OCR tool such as TESSERACT may be used to extract text, record the accuracy, speed and specific pattern of recognition errors, quantify the recognition performance of each format by statistical methods such as average, standard deviation, etc., and calculate different identity indicators for different formats, which may help determine similarity and difference between different content formats.

And step S3, classifying according to the plurality of heterogeneity indexes, and outputting N types of content formats.

Specifically, the content formats are classified by using the dissimilarity index generated in the previous step and using unsupervised learning algorithms such as K-means clustering, DBSCAN or hierarchical clustering, and the like, the similar content formats are classified into one type, and N types of content formats are output. Such as "black font", "red font", "endorse text", etc., for facilitating subsequent processing.

And S4, constructing N content recognition models according to the text characteristics of the N content formats.

In particular, text features are attributes describing text, such as font, layout, sharpness, etc., used in the construction of a content recognition model. The content recognition model refers to an algorithm model established for a specific content format, and aims to improve the recognition accuracy of the format content.

According to the characteristics of each type of content format, N special content recognition models are constructed by using a machine learning algorithm, and each model can focus on the recognition of a specific format so as to improve the accuracy of subsequent retrieval. For example, a deep learning framework (e.g., tensorFlow or PyTorch) may be used to train a Convolutional Neural Network (CNN) to train and optimize for different formats, generating N content recognition models.

And S5, acquiring user search keywords, performing keyword search in the imaged PDF file by calling the N content recognition models, and outputting a plurality of content search return results corresponding to the N content recognition models.

Specifically, the user search keyword is a text string input by the user for searching for specific information, and the keyword may be a word, phrase or sentence. The content retrieval return result refers to a text segment or region found in the PDF file that matches the user's keywords. These results typically include the context in which the keywords appear and their location information in the PDF file.

And acquiring search keywords input by a user, and respectively calling N content recognition models to search. Each model performs keyword matching according to the specific content format, and outputs a plurality of search results related to the user keywords. For example, if the keyword input by the user is "annual report", the text model is searched for corresponding text content, and the endorsement text model is searched for corresponding content.

And S6, according to the multiple content retrieval return results, positioning and displaying in the imaging PDF file.

Specifically, a plurality of content retrieval return results output in step S5 are acquired, and the corresponding area of the original PDF file is marked according to the position information of each return result. The results of these markings may be displayed to the user in a visually apparent manner (e.g., highlighting, underlining, border, etc.) using graphics rendering techniques in a PDF reader or editor. For example, if a certain keyword is in a specific page number and position in the PDF file, the text or content block is highlighted, so that the user can quickly locate the required information, and the efficiency and convenience of searching information in a large PDF file by the user are improved.

Further, as shown in fig. 2, step S1 of the embodiment of the present application further includes:

The method comprises the steps of carrying out image layer extraction on an image PDF file to obtain a plurality of image layers, inputting the plurality of image layers into an image optimization module, wherein the image optimization module comprises definition standardization, image denoising and contrast synchronization, outputting a plurality of optimized image layers according to the image optimization module, and identifying the plurality of optimized image layers to obtain a plurality of content formats.

Specifically, first, a plurality of image layers are extracted from an imaged PDF file. PDF content may be converted to a separate image file using PDF processing tools (e.g., pyMuPDF or pdf.js). Each image layer contains different information such as black fonts, red fonts, text annotations, etc.

Next, the extracted plurality of image layers are input to an image optimization module. The image optimization module is a set of processing tools for improving the image quality, and comprises technologies of definition standardization, image denoising, contrast synchronization and the like. In the module, definition standardization processing is carried out through an image enhancement algorithm in an image processing library (such as OpenCV), the definition of images in each image layer is adjusted, so that different image layers keep consistent in resolution and detail, noise in the images is removed through a filter (such as Gaussian filtering or median filtering), and the quality of the images is improved. By contrast synchronization, it is ensured that the contrast of all image layers is relatively consistent so that the content is more easily identified at the time of subsequent processing. And after the image optimization module processes, outputting a plurality of optimized image layers. The quality of the optimized image layers is obviously improved, and the method is suitable for more accurate content format identification.

And finally, identifying the optimized image layers, and extracting a plurality of content formats by utilizing an OCR technology and an image identification algorithm. Text extraction is performed, for example, using TESSERACT, while classifying the image in conjunction with a deep learning model, ensures that different types of content can be accurately identified.

The steps improve the image quality through the extraction and optimization of the image layer, so that the contents with different formats can be clearly identified, and the identification accuracy of the contents in the PDF file is greatly improved.

Further, after the image layer extraction is performed on the imaged PDF file, the method further includes:

After extracting an image layer of the imaging PDF file, judging whether the imaging PDF file has a residual digital text layer or not; and if the residual digital text layer exists in the imaging PDF file, identifying the optimized image layers and the digital text layer to acquire a plurality of content formats.

Specifically, after image layer extraction is performed on the imaged PDF file, it is checked whether there are any remaining digital text layers. The digital text layer refers to editable text content directly contained in a PDF file, not text that exists only in the form of an image. Text information in the PDF may be read and its presence determined by a PDF processing library (e.g., pyMuPDF or PDFMiner). If a digital text layer is found to be present in the file, this information is recorded and ready for subsequent processing.

And simultaneously, identifying a plurality of optimized image layers and digital text layers, extracting texts of the optimized image layers by utilizing an OCR technology, and simultaneously, directly reading the contents of the digital text layers. For the image layer, an OCR tool such as TESSERACT can be used, while for the digital text layer, the text extraction function of the PDF processing library can be directly called to acquire the content.

By identifying the optimized image layer and the digital text layer simultaneously, the information of the two sources can be integrated, a plurality of content formats can be finally obtained, the comprehensiveness of identification is ensured, all information in the document is extracted as much as possible, and a complete content structure is formed.

Illustratively, there is an imaged PDF file containing a scanned document (as an image layer) and some directly generated editable text (as a digital text layer). Firstly, extracting an image layer, and judging whether editable digital text exists in the file after optimization processing. If so, the optimized image content (such as scanned characters) and the direct text (such as titles and paragraphs) in the PDF are extracted at the same time, and finally, all possible information is ensured to be identified.

Further, step S2 of the embodiment of the present application further includes:

The method comprises the steps of obtaining characteristic indexes corresponding to each content format in a plurality of content formats, wherein the characteristic indexes comprise identification accuracy, identification speed, identification ambiguity and character intervals, analyzing according to the characteristic indexes corresponding to each content format, outputting a plurality of OCR identification characteristic vectors corresponding to the plurality of content formats, quantifying similarity among the plurality of OCR identification characteristic vectors, and outputting a plurality of heterogeneity indexes corresponding to the plurality of content formats.

Specifically, the feature index is a quantization index for describing and quantizing the content format recognition effect. In OCR recognition, the feature index includes recognition accuracy (degree of matching of a recognition result with actual content), recognition speed (time required for a recognition process), recognition ambiguity (uncertainty or degree of ambiguity of a recognition result), character spacing (spatial distance between characters in a text), and the like.

The plurality of content formats identified are analyzed using an OCR tool to extract a feature index corresponding to each content format. These metrics include recognition accuracy, recognition speed, recognition ambiguity, character spacing, etc. For example, the recognition accuracy may be determined by comparing the recognition result with the manually noted content. By recording the time required for processing each content format, the average processing speed is calculated as the recognition speed. The recognition ambiguity is determined by evaluating the sharpness of the recognition text by image processing techniques such as edge detection. Spaces between characters in the recognition text are analyzed to determine character spacing.

The feature indicators of each content format are combined into a multi-dimensional vector for representing the OCR recognition characteristics of the content format, and a plurality of OCR recognition feature vectors are generated. And carrying out similarity quantification on a plurality of OCR recognition feature vectors, calculating the distance between the feature vectors by adopting cosine similarity or Euclidean distance, judging the similarity and the difference of different content formats, and outputting a plurality of dissimilarity indexes corresponding to each content format. If the feature vectors of the two content formats are similar, it is indicated that they are similar in recognition effect, whereas the difference is larger.

The steps are that the characteristic index of each content format is analyzed and converted into a characteristic vector, and the similarity quantization is adopted to deeply understand the difference between different content formats, so that basic data is provided for subsequent classification and identification models.

Further, as shown in fig. 3, step S4 of the embodiment of the present application further includes:

Collecting N training data sets according to the text characteristics of the N content formats, initializing a convolutional neural network to respectively train the N training data sets independently, outputting N initial content recognition models, carrying out integrated fusion learning on the N initial content recognition models to generate an integrated content recognition model, optimizing learning parameters of the N initial content recognition models according to learning parameters of the integrated content recognition model, and constructing N content recognition models.

Specifically, corresponding training data sets are collected based on the text features of the N-class content format. Each type of content format requires sufficient example data for the model to learn efficiently. Such data may originate from existing labeling data sets or be generated by manually labeling and synthesizing the data. By way of example, content formats have been divided into three categories, black fonts, red fonts, and text with annotations. For each type of content format, a corresponding training data set is collected. The black font training dataset contains samples of multiple black fonts, such as document content of different font types, newspaper articles, and the like. The label for each sample is in "black font". The red font training dataset contains different styles of red font text, including markers for important information, warning cues, etc. The label for each sample is in "red font". The text data set with annotations contains text samples with annotations, which may be boxes, underlines or notes of different colors. The label of each sample is "text with endorsement".

A convolutional neural network is initialized for each content format using a deep learning framework, such as TensorFlow or PyTorch, and the initialized convolutional neural networks are individually trained with corresponding training data sets. These models learn the characteristics of each type of content through the process of forward propagation and backward propagation and constantly optimize parameters. And outputting N initial content recognition models after training. Each model focuses on its corresponding content format identification.

Illustratively, different convolutional neural network models are initialized according to the three types of content formats, and each type of content format is identified. Firstly, initializing a convolutional neural network model, marking the model as a black font model, optimizing the black font characteristic, selecting proper structures of a convolutional layer, a pooling layer and a full-connection layer, and setting a learning rate and a loss function. Another convolutional neural network model, noted as a red font model, is initialized, focusing on the recognition of the red font. According to the characteristics of the red font, the model parameters are adjusted to cope with different colors and definition. A third convolutional neural network model is initialized, training is carried out on the text with the endorsement, the text model is recorded as an endorsement text model, and the model focuses on the spatial relationship among the endorsement frame, color and text.

Using the black font training dataset, a black font model is trained that learns how to identify features of black text through a back propagation algorithm. In the training process, the model parameters are continuously adjusted to improve the recognition accuracy. The same training procedure is performed using the red font training dataset, training a red font model that will focus on identifying the features of the red text, including its representation and clarity in the image. The text training data set with the endorsements is used for training an endorsement text model, and the model learns various styles and relative positions of the endorsements, so that the endorsement content can be accurately identified.

After training, the following three initial content recognition models, namely a black font recognition model, can efficiently recognize black texts, and comprise black contents with different fonts and word sizes. The red font recognition model is used for recognizing the red fonts and is suitable for extracting warning information or important marks. The annotation text recognition model is optimized for the content with the annotation, and can recognize various annotation styles.

And carrying out integrated fusion learning on the N initial content recognition models, combining prediction results of the N initial models, and training a stronger integrated content recognition model by an integrated learning method. The integration fusion can adopt strategies such as weighted voting, average and the like, and comprehensive evaluation can be carried out according to the performance contributions of different models. And optimizing the learning parameters of the N initial models based on the learning parameters of the integrated content recognition model. Algorithms (e.g., gradient descent) are typically used to adjust model parameters so that each model performs better in the face of new data, thereby constructing more sophisticated N content recognition models.

The steps are used for constructing an efficient content recognition model through methods of systematically collecting training data, initializing and training a convolutional neural network, integrating fusion learning and the like. The models can be optimized for different content formats, and the effectiveness and accuracy in identifying complex documents are improved.

Further, optimizing the learning parameters of the N initial content recognition models according to the learning parameters of the integrated content recognition model, further includes:

The method comprises the steps of respectively testing according to learning parameters of each initial content recognition model, outputting N recognition errors, obtaining a mark-initial content recognition model with the recognition errors larger than preset recognition errors, obtaining updated model parameters according to the learning parameters of the integrated content recognition model, and iteratively adjusting the learning parameters of the mark-initial content recognition model according to the updated model parameters until the content recognition model with the recognition errors smaller than or equal to the preset recognition errors is obtained, so that N content recognition models are constructed.

Specifically, the learning parameters refer to parameters adjusted by the model in the training process, and mainly include weights and deviations, so that the prediction capability of the model is determined. Recognition errors refer to the difference between the model predictions and the actual labels, typically expressed in terms of error rate or loss values. The preset recognition error is a model performance threshold set in advance. The marker-initial content recognition model refers to those initial models whose recognition errors exceed the preset recognition errors, which require further optimization.

Firstly, testing each initial content identification model, comparing the prediction result of the model with a real label, calculating the identification errors of the model on a verification set, and outputting N identification errors. For example, if a certain model has an error rate of 5% in text recognition and another model has 10%, the recognition errors of the two models are 5% and 10%, respectively.

And comparing the N recognition errors with preset recognition errors respectively, finding out an initial content recognition model with the recognition error higher than the preset recognition error, marking the initial content recognition model, and determining the initial content recognition model as a mark-initial content recognition model. And extracting learning parameters of the integrated content recognition model as updated model parameters. And adopting a gradient descent algorithm, iteratively adjusting the learning parameters of each mark-initial content recognition model according to the updated model parameters, and gradually reducing recognition errors by continuously calculating loss and adjusting the parameters. And after each adjustment, retesting the model, and checking whether the identification error is smaller than or equal to a preset identification error. And performing iterative adjustment for a plurality of times until the model identification error is smaller than or equal to the preset identification error, and finally outputting the optimized N content identification models. These models can exhibit good recognition within a given error range.

For example, in the recognition model optimization stage, the recognition error of the text model exceeds the preset recognition error, and parameters such as better learning rate, regularization coefficient and the like are obtained from the integrated model and are applied to training of the text model. And (3) through multiple iterations, the final text model construction is completed until the recognition error of the text model is reduced below a preset value.

Further, iteratively adjusting learning parameters of the marker-initial content recognition model according to the updated model parameters, wherein the expression is as follows:

;

Wherein, For the learning parameters of the post-label-initial content recognition model after the t +1 iteration,For the learning parameters of the tag-initial content recognition model after the t-th iteration,For learning rate, for controlling the speed of parameter updating,Representing the loss function L versus the parameterIs used for the gradient of (a),Representing input samplesAnd corresponding labelThe loss function L is used to measure the recognition error between the model predicted value and the true value.

The method comprises iteratively adjusting learning parameters of the marker-initial content identification model by gradient descent algorithm, wherein the method comprises obtaining initial learning parameters of each marker-initial content identification model。

For a given training sampleUsing current learning parametersPredicting the samples, calculating model predicted values and corresponding real values (labels) By means of automatic differentiation means, calculating the loss function versus parameterGradient of (2)The gradient points in the direction in which the loss function grows fastest. Setting a learning rateThe magnitude of the parameter update is determined. Using the formulaThe parameters are updated, i.e. the parameters are moved in the opposite direction of the gradient, i.e. the direction in which the loss function decreases.

Predicting training samples again by using the updated parameters, repeating the steps, continuously iterating the learning parameters of the mark-initial content recognition model, and after each iteration, evaluating the performance of the model by using a verification set until the model recognition error is smaller than or equal to a preset recognition error, and determining the final parametersAnd a content recognition model constructed using the learning parameters.

Further, step S5 of the embodiment of the present application further includes, after outputting a plurality of content retrieval return results corresponding to the N content recognition models:

And performing secondary OCR recognition on the content retrieval returned results with the reliability scores lower than a preset threshold value, and updating the content retrieval returned results.

Specifically, the reliability score is an index for evaluating the reliability and accuracy of the search result, and is typically expressed in the form of a percentage or a score. The preset threshold is the minimum requirement for a set confidence score below which results will be considered unreliable.

After a plurality of content retrieval returned results corresponding to the N content recognition models are output, the results are required to be evaluated, and the credibility of the retrieval results is verified. A confidence score is generated for each result by analyzing the characteristic metrics (e.g., recognition accuracy, character spacing, etc.) of each result. For example, if the content of a result is clear and highly relevant to the user's search keywords, the confidence score may be high (e.g., 90%), while the ambiguous result score may be low (e.g., 60%). And comparing the reliability score of each search result with a preset threshold value. If the score of a result is below a preset threshold (e.g., set to 75%), the result will be marked as an object requiring further processing.

And (3) performing secondary OCR recognition on the marked content retrieval returned result with low credibility score, and performing additional optimization on the image so as to improve the accuracy of recognition. For example, if the initially recognized text contains ambiguous characters or is misidentified, OCR techniques may be re-applied by adjusting the image contrast and sharpness to extract more accurate text. And after the secondary OCR recognition is finished, replacing and updating the original content retrieval return result by using the secondary recognition result, thereby improving the reliability of the whole retrieval. The method ensures that the information acquired by the user is more accurate, and simultaneously enhances the processing capacity of complex texts.

In summary, the content retrieval method combining PDF searching mode with OCR recognition provided by the embodiment of the application has the following technical effects:

By identifying the content formats of the current imaging PDF file, a plurality of content formats are obtained, the understanding of the document structure is improved, the follow-up steps can be processed based on accurate format information, and a foundation is laid for content retrieval. Based on the OCR recognition heterogeneity analysis among the plurality of content formats, a plurality of heterogeneity indexes corresponding to the plurality of content formats are obtained, the characteristics of different contents are recognized more accurately, and the effectiveness of subsequent classification and retrieval is improved. And classifying according to the plurality of heterogeneity indexes, outputting N types of content formats, systematically sorting different types of content, enhancing the pertinence of a retrieval model, and reducing information loss and misunderstanding caused by format differences. Based on the classification result, training different content formats by using a convolutional neural network, constructing a plurality of initial content recognition models, improving the overall performance of the models by integrating, fusing and learning, constructing N content recognition models, greatly improving the recognition precision when processing specific formats, and improving the accuracy and efficiency of overall search. And acquiring user search keywords, carrying out keyword search in the imaged PDF file by calling the N content recognition models, searching different content features by different models, maximizing coverage rate, outputting a plurality of content search return results corresponding to the N content recognition models, and improving search accuracy and comprehensiveness. Further, through evaluation and secondary OCR recognition of the recognition result, accuracy and credibility of the content retrieval result are effectively improved. And according to the multiple content retrieval return results, positioning and displaying in the imaging PDF file. So that a user can quickly find the required information, and the information acquisition efficiency is improved.

In the embodiment of the application, through identifying and analyzing the content format of the imaged PDF file and combining with OCR technology, the dissimilarity research is carried out, and a targeted content identification model is systematically classified and constructed, so that the accuracy and efficiency of PDF content retrieval are obviously improved. The user keyword retrieval is carried out by flexibly calling different models, and the result is accurately positioned in the document, so that the adaptability to diversified contents is enhanced, the comprehensiveness and the accuracy of information acquisition are integrally improved, the user is helped to quickly acquire the required information, and the information retrieval experience is remarkably optimized.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

A content retrieval method combining pdf search mode with OCR recognition, the method comprising:

identifying the content formats of the current imaging PDF file to obtain a plurality of content formats;

Based on the OCR recognition heterogeneity analysis among the plurality of content formats, a plurality of heterogeneity indexes corresponding to the plurality of content formats are obtained;

classifying according to the multiple heterogeneity indexes, and outputting N types of content formats;

constructing N content recognition models according to the text characteristics of the N content formats;

Acquiring user search keywords, carrying out keyword search in the imaged PDF file by calling the N content recognition models, and outputting a plurality of content search return results corresponding to the N content recognition models;

Positioning and displaying in the imaging PDF file according to the multiple content retrieval return results;

Based on the OCR recognition heterogeneity analysis between the plurality of content formats, a plurality of heterogeneity indexes corresponding to the plurality of content formats are obtained, and the method further comprises:

acquiring a characteristic index corresponding to each content format in the plurality of content formats, wherein the characteristic index comprises identification accuracy, identification speed, identification ambiguity and character interval;

Analyzing according to the characteristic index corresponding to each content format, and outputting a plurality of OCR recognition characteristic vectors corresponding to the plurality of content formats;

and carrying out similarity quantification on the plurality of OCR recognition feature vectors, and outputting a plurality of heterogeneity indexes corresponding to the plurality of content formats.
2. The method of claim 1, wherein the content formats of the current imaged PDF file are identified to obtain a plurality of content formats, the method further comprising:

Extracting image layers from the imaged PDF file to obtain a plurality of image layers;

Inputting the plurality of image layers into an image optimization module, wherein the image optimization module comprises definition standardization, image denoising and contrast synchronization;

Outputting a plurality of optimized image layers according to the image optimizing module;

And identifying the plurality of optimized image layers to obtain a plurality of content formats.
3. The method of claim 2, wherein after performing image layer extraction on the imaged PDF file, the method further comprises:

after extracting an image layer of the imaging PDF file, judging whether the imaging PDF file has a residual digital text layer or not;

and if the residual digital text layer exists in the imaging PDF file, identifying the optimized image layers and the digital text layer to acquire a plurality of content formats.
4. The method of claim 1, wherein the N content recognition models are constructed according to text features of the N types of content formats, the method comprising:

collecting N types of training data sets according to the text characteristics of the N types of content formats;

initializing a convolutional neural network to respectively train the N training data sets independently, and outputting N initial content recognition models;

performing integrated fusion learning on the N initial content recognition models to generate an integrated content recognition model;

and optimizing the learning parameters of the N initial content recognition models according to the learning parameters of the integrated content recognition model, and constructing N content recognition models.
5. The method of claim 4, wherein the learning parameters of the N initial content recognition models are optimized based on the learning parameters of the integrated content recognition model, the method comprising:

Respectively testing according to the learning parameters of each initial content recognition model, and outputting N recognition errors;

Acquiring a mark-initial content identification model with an identification error larger than a preset identification error;

Acquiring updated model parameters according to the learning parameters of the integrated content recognition model;

and iteratively adjusting learning parameters of the mark-initial content recognition model according to the updated model parameters until the content recognition models with the preset recognition errors less than or equal to the preset recognition errors are obtained, and constructing N content recognition models.
6. The method of claim 5, wherein the learning parameters of the marker-initial content identification model are iteratively adjusted according to the updated model parameters by:

;

Wherein, For the learning parameters of the post-label-initial content recognition model after the t +1 iteration,For the learning parameters of the tag-initial content recognition model after the t-th iteration,For learning rate, for controlling the speed of parameter updating,Representing the loss function L versus the parameterIs used for the gradient of (a),Representing input samplesAnd corresponding labelThe loss function L is used to measure the recognition error between the model predicted value and the true value.
7. The method of claim 1, wherein after outputting a plurality of content retrieval return results corresponding to the N content recognition models, the method further comprises:

evaluating the returned results of the content retrieval, and outputting the credibility score of each retrieval result;

and performing secondary OCR (optical character recognition) on the content retrieval return results with the reliability scores lower than a preset threshold value, and updating the content retrieval return results.