[go: up one dir, main page]

CN113448918B - Enterprise scientific research result management method, management platform, equipment and storage medium - Google Patents

Enterprise scientific research result management method, management platform, equipment and storage medium Download PDF

Info

Publication number
CN113448918B
CN113448918B CN202111010269.4A CN202111010269A CN113448918B CN 113448918 B CN113448918 B CN 113448918B CN 202111010269 A CN202111010269 A CN 202111010269A CN 113448918 B CN113448918 B CN 113448918B
Authority
CN
China
Prior art keywords
pdf
document
file
enterprise
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111010269.4A
Other languages
Chinese (zh)
Other versions
CN113448918A (en
Inventor
许宁
邓洋
黄文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Fifth Engineering Bureau Co Ltd
Original Assignee
China Construction Fifth Engineering Bureau Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Fifth Engineering Bureau Co Ltd filed Critical China Construction Fifth Engineering Bureau Co Ltd
Priority to CN202111010269.4A priority Critical patent/CN113448918B/en
Publication of CN113448918A publication Critical patent/CN113448918A/en
Application granted granted Critical
Publication of CN113448918B publication Critical patent/CN113448918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a management method, a management platform, equipment and a storage medium for enterprise scientific research results, wherein after a PDF document is imported, the PDF document is processed to be converted into an XML file so as to automatically extract keywords, relevant contents of a bibliographic field are correspondingly extracted from the document contents to be stored, each bibliographic field content corresponds to different XML nodes, and each XML node is anchor point matched with the storage position of the bibliographic field so that a user can search according to bibliographic field information. The method comprises the steps of performing word vector conversion on keywords extracted from any two PDF documents, performing dynamic association between any two PDF documents by calculating the similarity between the two word vectors, automatically mining the technical association between the PDF documents based on the similarity calculation of the keywords, and performing dynamic association on technical points to form a mesh-shaped interconnection relationship.

Description

Enterprise scientific research result management method, management platform, equipment and storage medium
Technical Field
The present invention relates to the field of block chain technology, and in particular, to a method, a management platform, a device, and a computer-readable storage medium for managing an enterprise research result.
Background
Innovation is the first power for leading development, the main position of enterprises as science and technology innovation is continuously strengthened, the competition among enterprises is the competition of science and technology strength and innovation capability fundamentally, and the competition of science and technology achievement commercialization, industrialization degree and market share is finally achieved. In order to make big, strong, fine and sharp, enterprises strive to build and perfect an enterprise independent innovation system, the improvement and the upgrade of the traditional labor-intensive industry are accelerated by means of technological innovation and technical progress, the fundamental change of development modes is realized, and various scientific research developments and scientific research achievement management become important contents for improving the strength and the competitiveness of the enterprises. The scientific achievements converge the intelligence of generations of enterprises and are extremely precious intellectual resources. How to fully exploit the value of the precious resources, and better protect the achieved rich scientific achievements from illegal spreading and utilization while helping and guiding the business work and technical innovation of enterprises by using the obtained achievements, becomes a difficult problem which troubles the enterprises to carry out scientific and technological achievement management and transformation.
At present, some enterprises realize unified management of scientific research result files of the enterprises by building the scientific research result management platform of the enterprises, but the existing scientific research result management platform of the enterprises only stores the scientific research result files separately, has no correlation with each other, and cannot realize dynamic correlation of technical data.
Disclosure of Invention
The invention provides an enterprise scientific research result management method, a management platform, equipment and a computer readable storage medium, which are used for solving the technical problem that the existing enterprise scientific research result management platform cannot realize dynamic association of technical data.
According to one aspect of the invention, the method for managing the scientific research results of the enterprise comprises the following steps:
importing an enterprise scientific research result file, wherein the file format is a PDF (Portable document Format) file;
storing an imported PDF document, performing structuralization processing on the PDF document to extract document content and a document logic structure, generating a standardized XML file based on the extracted document content, extracting bibliographic field information from the document content and correspondingly storing the bibliographic field information to a bibliographic field storage position in a database, wherein each bibliographic field content in the document content corresponds to different XML nodes, and each XML node is respectively subjected to anchor point matching with the bibliographic field storage position of the database;
automatically extracting keywords from the generated XML file, and storing the extracted keywords to a keyword storage position of a database;
and respectively converting the keywords extracted from the two PDF documents into two characteristic vectors, calculating the similarity of the two PDF documents based on the two characteristic vectors, and associating the two PDF documents of which the similarity calculation result is greater than a threshold value.
Further, the process of converting the keywords extracted from the two PDF documents into two feature vectors respectively, calculating the similarity of the two PDF documents based on the two feature vectors, and associating the two PDF documents with the similarity calculation result larger than the threshold specifically includes the following steps:
respectively converting the key words extracted from the two PDF documents into two Word vectors by adopting a trained Word2vec model;
and calculating the similarity between the two word vectors by adopting a cosine distance calculation formula, and automatically associating the two PDF documents when the calculated cosine distance is greater than a threshold value.
Further, the method further comprises the following:
the web crawler technology is adopted to capture the web information from the Internet, automatically extract the text content of the web, and the extracted web content is cleaned, removed of noise and then imported into a database.
Further, the method further comprises the following:
and encrypting the enterprise scientific research result files stored in the database.
Further, the method further comprises the following:
and carrying out enterprise scientific research result file retrieval in the database.
In addition, another embodiment of the present invention further provides an enterprise research result management platform, including:
the file import module is used for importing the enterprise scientific research result files, and the file format is a PDF (Portable document Format) file;
the file processing module is used for storing imported PDF files, carrying out structuralization processing on the PDF files to extract file contents and file logic structures, generating standardized XML files based on the extracted file contents, extracting bibliographic field information from the file contents and correspondingly storing the bibliographic field information to bibliographic field storage positions in a database, wherein each bibliographic field content in the file contents corresponds to different XML nodes, and each XML node is in anchor point matching with the bibliographic field storage position of the database;
the keyword extraction module is used for automatically extracting keywords from the generated XML file and storing the extracted keywords to a keyword storage position of a database;
and the document association module is used for respectively converting the keywords extracted from the two PDF documents into two feature vectors, calculating the similarity of the two PDF documents based on the two feature vectors, and associating the two PDF documents of which the similarity calculation result is greater than a threshold value.
Further, the platform further comprises:
and the external information capturing module is used for capturing webpage information from the Internet by adopting a web crawler technology, automatically extracting the text content of the webpage, cleaning and eliminating the noise of the extracted webpage content and then importing the webpage content into the database.
Further, the platform further comprises:
and the encryption module is used for encrypting the enterprise scientific research result files stored in the database.
In addition, another embodiment of the present invention further provides an apparatus, which includes a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the steps of the method described above by calling the computer program stored in the memory.
In addition, another embodiment of the present invention further provides a computer-readable storage medium for storing a computer program for managing enterprise research results, where the computer program performs the steps of the method described above when running on a computer.
The invention has the following effects:
according to the enterprise scientific research result management method, after an enterprise scientific research result file in a PDF document format is imported, the PDF document is processed to be converted into an XML file so as to be convenient for automatic extraction of a keyword subsequently, relevant contents of a bibliography field are correspondingly extracted from document contents to be stored, each bibliography field content corresponds to different XML nodes, and each XML node is matched with a storage position of the bibliography field through an anchor point, so that a user can search according to bibliography field information. Then, keywords are automatically extracted from the XML file to be stored, word vector conversion is carried out on the keywords extracted from any two PDF documents, dynamic association between any two PDF documents is carried out by calculating the similarity between the two word vectors, the technical association between the PDF documents can be automatically found out by calculating the similarity based on the keywords, the technical points are dynamically associated, and a mesh-shaped interconnection relation is formed.
In addition, the enterprise scientific research result management platform, the equipment and the computer readable storage medium have the advantages.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart illustrating a method for managing scientific results of an enterprise according to a preferred embodiment of the present invention.
Fig. 2 is a schematic view of a sub-flow of step S2 in fig. 1.
Fig. 3 is a sub-flowchart of step S3 in fig. 1.
Fig. 4 is a sub-flowchart of step S4 in fig. 1.
FIG. 5 is a schematic block diagram of an enterprise research result management platform according to another embodiment of the present invention.
Detailed Description
The embodiments of the invention will be described in detail below with reference to the accompanying drawings, but the invention can be embodied in many different forms, which are defined and covered by the following description.
As shown in fig. 1, a preferred embodiment of the present invention provides a method for managing scientific achievements of an enterprise, which includes the following steps:
step S1: importing an enterprise scientific research result file, wherein the file format is a PDF (Portable document Format) file;
step S2: storing an imported PDF document, performing structuralization processing on the PDF document to extract document content and a document logic structure, generating a standardized XML file based on the extracted document content, extracting bibliographic field information from the document content and correspondingly storing the bibliographic field information to a bibliographic field storage position in a database, wherein each bibliographic field content in the document content corresponds to different XML nodes, and each XML node is respectively subjected to anchor point matching with the bibliographic field storage position of the database;
step S3: automatically extracting keywords from the generated XML file, and storing the extracted keywords to a keyword storage position of a database;
step S4: and respectively converting the keywords extracted from the two PDF documents into two characteristic vectors, calculating the similarity of the two PDF documents based on the two characteristic vectors, and associating the two PDF documents of which the similarity calculation result is greater than a threshold value.
It can be understood that, in the method for managing an enterprise scientific research result of this embodiment, after the enterprise scientific research result file in the PDF document format is imported, the PDF document is processed first to be converted into an XML file, so as to perform subsequent automatic extraction of a keyword, and further, relevant contents of a bibliography field are extracted from document contents correspondingly to be stored, each bibliography field content corresponds to a different XML node, and anchor point matching is performed between each XML node and a storage location of the bibliography field, so that a user can perform retrieval according to bibliography field information. Then, keywords are automatically extracted from the XML file to be stored, word vector conversion is carried out on the keywords extracted from any two PDF documents, dynamic association between any two PDF documents is carried out by calculating the similarity between the two word vectors, the technical association between the PDF documents can be automatically found out by calculating the similarity based on the keywords, the technical points are dynamically associated, and a mesh-shaped interconnection relation is formed.
It is to be understood that, in the step S1, in order to facilitate uniform management of the scientific research results of the enterprise, the imported file format is required to be a PDF document. The types of the scientific research results of the enterprises comprise computer software development and application, BIM awards, patents, construction methods, standard specifications, atlas, monograph, treatises, scientific awards, topics, demonstration projects and the like.
It is understood that in the step S2, by performing deep structural processing on the PDF document to extract the document content, and at the same time, extracting the logical structure of the document, such as the location information of each paragraph, title, and the like, so as to implement automatic labeling of document metadata, automatic parsing of the logical structure of the body, such as identification of a chapter, a section, a title, an icon, and a formula, and identification of the logical structure. In addition, if the directory title exists in the imported PDF document, the directory title in the document can be extracted, then the anchor point is added to the corresponding position in the text paragraph, and the added anchor point is matched with the extracted directory title, so as to realize bidirectional indexing between the directory and the text content.
It is understood that, as shown in fig. 2, the process of extracting document contents from the PDF document and generating a standardized XML file in step S2 specifically includes the following steps:
step S21: recognizing each character from the PDF document by adopting an OCR (Optical character Recognition) technology, and representing the Recognition result of each character as a character block, wherein the content of each character block comprises position information of the character, format information of the character and the character, and a plurality of character blocks form an XML data set. The characters include Chinese characters, English, numbers, punctuation marks, and the like, and the punctuation marks are mainly commas, periods, semicolons, and the like which are helpful for punctuation of sentences. The position information of the characters specifically comprises row position information and column position information, and the position of each character in the whole PDF document can be accurately positioned according to the row position information and the column position information.
Step S22: and combining the character blocks based on the position information of the characters to obtain a plurality of character block combinations. Specifically, a plurality of characters between two punctuations can be combined according to the position information of the adjacent punctuations, so that a plurality of single characters can be spliced into a statement, and the XML data set is converted from a single character architecture into a statement architecture with semantic information. For example, the statement "i am a chinese regardless of where in the original PDF document, i am a luxury for this reason. The character is expressed as a plurality of single characters after being recognized by OCR, two single characters between the two characters can be combined into a sentence of 'I' is a Chinese person ', a sentence' and 'can be combined' through the position information of adjacent punctuations. The combination of the characters "I", "is", "this", "self" and "luxury" between "is the sentence" I is self-luxury for this reason ". It is understood that when there are no more punctuations before a certain punctuation, all character blocks located before the punctuation are merged into one character block combination.
Step S23: and extracting a phrase from each character block combination by adopting a preset word segmentation model to generate a phrase block, wherein the phrase block comprises at least two character blocks. Specifically, a dictionary-based word segmentation model or a statistical-based machine learning word segmentation model may be used to extract a phrase from each character block combination in the XML data set, that is, extract a phrase from a sentence. For example, the phrase "chinese" is extracted from the character block combination "i am a chinese", the phrase can better reflect the characteristics of the sentence, the probability of becoming a keyword is higher, and the probability of a single character "i" or "is" being a keyword is lower. The word group is extracted from the sentence by adopting the word segmentation model, so that the processing quantity of the characters for automatically extracting the subsequent keywords is reduced, and the automatic extraction speed of the keywords is improved.
Step S24: and obtaining the position information of the phrase block according to the position information of the first character block and the last character block in the phrase block. Specifically, the line position information and the column position information of the first character block are used as the start position information of the phrase block, and the position information of the last character block is used as the end position information of the phrase block, so that the position of the phrase block in the whole PDF document can be obtained.
Step S25: and (5) carrying out verification processing on the phrase block to generate a standardized XML file. Considering that the OCR recognition technology is prone to recognition errors, for example, multiple overlapped characters are prone to occur, multiple overlapped characters can be corrected into one character by performing the verification process, and the position information of the phrase block is correspondingly corrected.
It can be understood that, in the step S2, the related information of the bibliographic field is extracted from the document content and the extracted information is correspondingly stored in the bibliographic field storage location in the database, where the bibliographic field includes at least one of author, unit, journal, subject, organization, fund, subject, area, abstract and custom field, and the specific bibliographic field content can be set in the custom field according to the actual search requirement. The contents of each bibliographic field in the document contents correspond to different XML nodes, and each XML node is respectively matched with a bibliographic field storage position of the database in an anchor point mode, so that bidirectional index between the bibliographic fields and the XML file is achieved. And the generated XML file is stored to a corresponding storage position of the database in a long text mode.
It can be understood that, in the step S3, the keywords may be automatically extracted from the XML file based on the trained deep learning model, for example, the keyword is automatically extracted by using a decision tree model, a naive bayes model, a support vector machine model, a conditional random field model, etc., but the essence of automatically extracting the keywords by using the deep learning model is to classify words, which requires a large amount of corpora for training, and the model training difficulty is large, and the phenomenon of model overfitting is likely to occur, so that the method is not very suitable for the scientific research achievement management platform of the enterprise. Preferably, as shown in fig. 3, the process of automatically extracting the keywords from the XML file in step S3 specifically includes:
step S31: determining the number of paragraphs included in the PDF document and the number and position range of each paragraph according to the document logic structure, and determining the paragraph number of each phrase block in the PDF document based on the position information of each phrase block and the number and position range of each paragraph. Specifically, a document logic structure can be extracted by using an OCR technology, so as to determine the number of paragraphs, and the start position and the end position of each paragraph included in the PDF document, and if the position of a phrase block is located between the start position and the end position of a certain paragraph, it is determined that the phrase block belongs to the paragraph, so as to determine the paragraph number of the phrase block in the whole PDF document.
Step S32: obtaining the number of sentences and the number and position range of each sentence contained in the paragraph based on the position information of the punctuation mark character block contained in the paragraph, and determining the sentence number of the phrase block in the paragraph based on the position information of each phrase block and the number and position range of each sentence. Specifically, the paragraph number of the phrase block may be determined based on step S31, and then the sentence breaking processing is performed on the paragraph based on the position information of the punctuation character block included in the paragraph, so as to obtain the number of the sentences included in the paragraph and the number and position range of each sentence, and if the position information of the phrase block is located in the position range of a sentence, it is determined that the phrase block belongs to the sentence, so as to obtain the sentence number of the phrase block in the paragraph.
Step S33: and counting the occurrence frequency of each phrase block, calculating the position weight based on the paragraph number and the sentence number where the phrase block appears each time, and summing the position weights obtained by multiple times of calculation to obtain the total position weight of each phrase block.
Step S34: and sequentially arranging the plurality of phrase blocks according to the sequence of the total position weight from high to low, screening out the first N phrase blocks, and outputting the phrase blocks as keywords.
Specifically, through research, the inventor of the present application finds that, in an enterprise research result document, a first paragraph is a whole summary of a full text and bears core elements of the full text, and a last paragraph is a summary and summary of the full text, whereas for a single paragraph, the first sentence is a subject stated by the content of the paragraph which is introduced first, and the last sentence is a summary and summary of the content of the paragraph, so that the possibility that keywords are included in the first paragraph and the last paragraph, and the position information of the phrases has a great influence on the extraction of the keywords. However, it is difficult to extract the position of the phrase in the sentence, which has a generalization effect. Therefore, the position characteristics and the statistical characteristics of the phrases in the document are used as evaluation indexes for keyword extraction, and the keywords can be extracted more accurately.
Wherein, the calculation formula of the total position weight of each phrase block is as follows:
Figure 46368DEST_PATH_IMAGE001
the total position weight of the phrase block is represented, n represents the number of times of appearance of the phrase block, represents the weight of the paragraph of the ith appearance of the phrase block, and represents the weight of the sentence of the ith appearance of the phrase block. Wherein,
Figure 963508DEST_PATH_IMAGE002
the method comprises the steps of representing paragraph weight, representing sentence weight, representing paragraph number of a phrase block by x, representing sentence number of the phrase block by y, representing number of paragraphs included in a PDF document by m, representing number of sentences included in the paragraphs by k, and representing position parameters by alpha and beta, wherein the position parameters can be obtained through training optimization.
By adopting the formula, whether the phrase block has the characteristics of the keywords or not can be comprehensively evaluated based on the reproducibility of the phrase block and the position weight of each occurrence, the automatic extraction precision of the keywords is greatly improved, and the phrase block is combined with the extraction of the phrases by adopting a word segmentation model in the step S2, and meanwhile, the extraction speed and the extraction precision are considered.
It is understood that, as shown in fig. 4, in the preferred embodiment of the present invention, the step S4 specifically includes the following steps:
step S41: respectively converting the keywords extracted from the two PDF documents into two word vectors by adopting a trained neural network model;
step S42: and calculating the similarity between the two word vectors by adopting a cosine distance calculation formula, and automatically associating the two PDF documents when the calculated cosine distance is greater than a threshold value.
Specifically, at least one keyword extracted from two PDF documents is converted into two Word vectors respectively by adopting a trained Word2vec model, and each Word vector corresponds to one PDF document respectively. Since the first N phrase blocks are used in step S34, the dimensions of the feature vectors of the two PDF documents are the same, and the feature vectors output by the Word2vec model are normalized, so that the two feature vectors can directly use the cosine distance calculation formula to perform similarity calculation. Then, calculating cosine similarity between two word vectors by adopting a cosine distance calculation formula, wherein the specific calculation formula is as follows:
Figure 805562DEST_PATH_IMAGE003
wherein, A represents the word vector corresponding to the first PDF document, and B represents the word vector corresponding to the second PDF document. And when the calculated cosine distance value is larger than a preset threshold value, judging that the similarity between the two word vectors is higher, and automatically associating the two PDF documents if the keywords of the two PDF documents are similar, so that the associated documents with similar technologies can be simultaneously found in subsequent scientific research result document retrieval.
It is understood that, in another embodiment of the present invention, the method for managing the scientific results of the enterprise further includes the following steps after step S4:
step S5: and carrying out enterprise scientific research result file retrieval in the database.
Specifically, the user can search files by any one or combination of authors, units, periodicals, subjects, organizations, funds, subjects, regions, abstracts, custom fields and keywords, and can automatically associate documents with similar technical points to form a knowledge association network.
It is understood that, in another embodiment of the present invention, the method for managing the scientific results of the enterprise further includes the following steps after step S4:
step S6: the web crawler technology is adopted to capture the web information from the Internet, automatically extract the text content of the web, and the extracted web content is cleaned, removed of noise and then imported into a database.
It can be understood that the internet information is captured from a portal or an industry site and aggregated on a platform through an internet acquisition technology, and special information is customized according to requirements and read to help grasp industry dynamics. And key information in the industry can be monitored in real time, information bulletins can be generated regularly, and decision making is assisted, so that the maximization of the internet information value is realized. Specifically, a multithreading concurrency technology is adopted by a web spider program, information is regularly searched by monitoring an entry address of a website on the basis of an HTTP (hyper text transport protocol), a webpage is grabbed, the text content of the webpage is automatically extracted, the webpage content is cleaned and denoised, so that the content analysis and filtering accuracy is improved, and then the webpage is led into a platform.
It is understood that, in another embodiment of the present invention, the method for managing the scientific results of the enterprise further includes the following steps after step S4:
step S7: and encrypting the enterprise scientific research result files stored in the database.
It can be understood that there are three ways of encryption: the method comprises server encryption, certificate encryption and server and certificate dual encryption, wherein the server encryption is implemented by using a DRM server, a reader sends a permission verification request to the DRM server when a user opens a file, and the file can be opened after verification is passed, so that the file reading is carried out in a networking state, time limitation and opening time limitation can be carried out when the server is encrypted, the file can be copied, and the file can be read on a plurality of computers at the same time without registering the computers; the certificate encryption is implemented by using a local certificate, when a user opens a file, a reader can verify the authority state through the local certificate, and the file can be opened only after the verification is passed, so that the file reading can be implemented under the condition of network disconnection, the certificate encryption can be used for file copy limitation, and a computer needs to be registered before reading; the double encryption of the server and the certificate is realized by using two encryption modes simultaneously, so that file copying limitation can be performed, time limitation and opening times limitation can be performed, files can be read only when networking is performed, and a computer needs to be registered before reading. Through encrypting the storage file, the scientific and technological achievement can be effectively promoted to obtain effective sharing under the data of relative safety, a relatively perfect scientific and technological achievement management organization system is formed, and the effective development of the scientific and technological achievement management work is supported. Not only can make things convenient for enterprise science and technology research personnel's new work like this, reduce unnecessary repeated work, can also introduce new thought for the research and development work in later stage, the effect of scientific and technological achievement is shown to the full aspect, let in the past scientific and technological achievement better for future scientific and technological research and development provide reference and guide, solved the inside digital knowledge copyright control protection problem of enterprise through copyright protection simultaneously.
It can be understood that, as shown in fig. 5, another embodiment of the present invention further provides an enterprise research result management platform, preferably using the above-mentioned enterprise research result management method, where the platform specifically includes:
the file import module is used for importing the enterprise scientific research result files, and the file format is a PDF (Portable document Format) file;
the file processing module is used for storing imported PDF files, carrying out structuralization processing on the PDF files to extract file contents and file logic structures, generating standardized XML files based on the extracted file contents, extracting bibliographic field information from the file contents and correspondingly storing the bibliographic field information to bibliographic field storage positions in a database, wherein each bibliographic field content in the file contents corresponds to different XML nodes, and each XML node is in anchor point matching with the bibliographic field storage position of the database;
the keyword extraction module is used for automatically extracting keywords from the generated XML file and storing the extracted keywords to a keyword storage position of a database;
and the document association module is used for respectively converting the keywords extracted from the two PDF documents into two feature vectors, calculating the similarity of the two PDF documents based on the two feature vectors, and associating the two PDF documents of which the similarity calculation result is greater than a threshold value.
It can be understood that, the scientific research result management platform of the enterprise according to the embodiment processes the PDF document to convert the PDF document into an XML document after importing the scientific research result document in the PDF document format, so as to facilitate subsequent automatic extraction of the keyword, and further stores the relevant content of the bibliographic field correspondingly extracted from the document content, where each bibliographic field content corresponds to a different XML node, and anchors each XML node to the storage location of the bibliographic field, so that the user can retrieve the content according to the bibliographic field information. Then, keywords are automatically extracted from the XML file to be stored, word vector conversion is carried out on the keywords extracted from any two PDF documents, dynamic association between any two PDF documents is carried out by calculating the similarity between the two word vectors, the technical association between the PDF documents can be automatically found out by calculating the similarity based on the keywords, the technical points are dynamically associated, and a mesh-shaped interconnection relation is formed.
In addition, the enterprise scientific research result management platform further comprises:
the retrieval module is used for the user to carry out enterprise scientific research result file retrieval in the database;
the external information capturing module is used for capturing webpage information from the Internet by adopting a web crawler technology, automatically extracting the text content of the webpage, cleaning and eliminating noise of the extracted webpage content and then importing the webpage content into a database;
and the encryption module is used for encrypting the enterprise scientific research result files stored in the database.
Wherein the document processing module comprises:
the character recognition unit is used for recognizing each character from the PDF document by adopting an OCR technology, wherein the character comprises Chinese characters, English, numbers, punctuation marks and the like, the punctuation marks are commas, periods, semicolons and other punctuation marks which are beneficial to sentence break, the recognition result of each character is represented as a character block, the content of each character block comprises position information of the character, format information of the character and the character, and a plurality of character blocks form an XML data set. The position information of the characters specifically comprises row position information and column position information, and the position of each character in the whole PDF document can be accurately positioned according to the row position information and the column position information.
And the character block combination construction unit is used for combining a plurality of character blocks based on the position information of the characters to obtain a plurality of character block combinations. Specifically, a plurality of characters between two punctuations can be combined according to the position information of the adjacent punctuations, so that a plurality of single characters can be spliced into a statement, and the XML data set is converted from a single character architecture into a statement architecture with semantic information. For example, the statement "i am a chinese regardless of where in the original PDF document, i am a luxury for this reason. The character is expressed as a plurality of single characters after being recognized by OCR, two single characters between the two characters can be combined into a sentence of 'I' is a Chinese person ', a sentence' and 'can be combined' through the position information of adjacent punctuations. The combination of the characters "I", "is", "this", "self" and "luxury" between "is the sentence" I is self-luxury for this reason ". It is understood that when there are no more punctuations before a certain punctuation, all character blocks located before the punctuation are merged into one character block combination.
And the phrase building unit is used for extracting phrases from each character block combination by adopting a preset word segmentation model to generate a phrase block, and the phrase block comprises at least two character blocks. Specifically, a dictionary-based word segmentation model or a statistical-based machine learning word segmentation model may be used to extract a phrase from each character block combination in the XML data set, that is, extract a phrase from a sentence. For example, the phrase "chinese" is extracted from the character block combination "i am a chinese", the phrase can better reflect the characteristics of the sentence, the probability of becoming a keyword is higher, and the probability of a single character "i" or "is" being a keyword is lower. The word group is extracted from the sentence by adopting the word segmentation model, so that the processing quantity of the characters for automatically extracting the subsequent keywords is reduced, and the automatic extraction speed of the keywords is improved.
And the phrase position calculating unit is used for obtaining the position information of the phrase block according to the position information of the first character block and the last character block in the phrase block. Specifically, the line position information and the column position information of the first character block are used as the start position information of the phrase block, and the position information of the last character block is used as the end position information of the phrase block, so that the position of the phrase block in the whole PDF document can be obtained.
And the checking unit is used for checking the phrase block to generate a standardized XML file. Considering that the OCR recognition technology is prone to recognition errors, for example, multiple overlapped characters are prone to occur, multiple overlapped characters can be corrected into one character by performing the verification process, and the position information of the phrase block is correspondingly corrected.
The keyword extraction module comprises:
and the paragraph number calculating unit is used for determining the number of paragraphs included in the PDF document and the number and the position range of each paragraph according to the document logic structure, and determining the paragraph number of each phrase block in the PDF document based on the position information of each phrase block and the number and the position range of each paragraph. Specifically, a document logic structure can be extracted by using an OCR technology, so as to determine the number of paragraphs, and the start position and the end position of each paragraph included in the PDF document, and if the position of a phrase block is located between the start position and the end position of a certain paragraph, it is determined that the phrase block belongs to the paragraph, so as to determine the paragraph number of the phrase block in the whole PDF document.
And the sentence number calculation unit is used for obtaining the number of sentences contained in the paragraph and the number and the position range of each sentence based on the position information of the punctuation mark character block contained in the paragraph, and determining the sentence number of the phrase block in the paragraph based on the position information of each phrase block and the number and the position range of each sentence. Specifically, the paragraph number of the phrase block may be determined based on step S31, and then the sentence breaking processing is performed on the paragraph based on the position information of the punctuation character block included in the paragraph, so as to obtain the number of the sentences included in the paragraph and the number and position range of each sentence, and if the position information of the phrase block is located in the position range of a sentence, it is determined that the phrase block belongs to the sentence, so as to obtain the sentence number of the phrase block in the paragraph.
And the position weight calculation unit is used for counting the occurrence frequency of each phrase block, calculating the position weight based on the paragraph number and the sentence number where the phrase block appears each time, and summing the position weights obtained by multiple times of calculation to obtain the total position weight of each phrase block.
And the keyword screening unit is used for sequentially arranging the plurality of phrase blocks according to the sequence of the total position weight from high to low, screening the front N phrase blocks and outputting the phrase blocks as the keywords.
Wherein, the calculation formula of the total position weight of each phrase block is as follows:
Figure 528668DEST_PATH_IMAGE004
the total position weight of the phrase block is represented, n represents the number of times of appearance of the phrase block, represents the weight of the paragraph of the ith appearance of the phrase block, and represents the weight of the sentence of the ith appearance of the phrase block. Wherein,
Figure 343040DEST_PATH_IMAGE005
the method comprises the steps of representing paragraph weight, representing sentence weight, representing paragraph number of a phrase block by x, representing sentence number of the phrase block by y, representing number of paragraphs included in a PDF document by m, representing number of sentences included in the paragraphs by k, and representing position parameters by alpha and beta, wherein the position parameters can be obtained through training optimization.
The document association module comprises:
the word vector conversion unit is used for converting the keywords extracted from the two PDF documents into two word vectors by adopting a trained neural network model;
and the automatic association unit is used for calculating the similarity between the two word vectors by adopting a cosine distance calculation formula, and when the calculated cosine distance is greater than a threshold value, automatically associating the two PDF documents.
Specifically, at least one keyword extracted from two PDF documents is converted into two Word vectors respectively by adopting a trained Word2vec model, and each Word vector corresponds to one PDF document respectively. Since the first N phrase blocks are used in step S34, the dimensions of the feature vectors of the two PDF documents are the same, and the feature vectors output by the Word2vec model are normalized, so that the two feature vectors can directly use the cosine distance calculation formula to perform similarity calculation. Then, calculating cosine similarity between two word vectors by adopting a cosine distance calculation formula, wherein the specific calculation formula is as follows:
Figure 962240DEST_PATH_IMAGE003
wherein, A represents the word vector corresponding to the first PDF document, and B represents the word vector corresponding to the second PDF document. And when the calculated cosine distance value is larger than a preset threshold value, judging that the similarity between the two word vectors is higher, and automatically associating the two PDF documents if the keywords of the two PDF documents are similar, so that the associated documents with similar technologies can be simultaneously found in subsequent scientific research result document retrieval.
It can be understood that each module and unit in the platform of this embodiment correspond to each step in the foregoing method embodiment, and therefore detailed working processes and working principles of each module are not described herein again, and reference may be made to the foregoing method embodiment.
In addition, another embodiment of the present invention further provides an apparatus, which includes a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the steps of the method described above by calling the computer program stored in the memory.
In addition, another embodiment of the present invention further provides a computer-readable storage medium for storing a computer program for managing enterprise research results, where the computer program performs the steps of the method described above when running on a computer.
Typical forms of computer-readable storage media include: floppy disk (floppy disk), flexible disk (flexible disk), hard disk, magnetic tape, any of its magnetic media, CD-ROM, any of the other optical media, punch cards (punch cards), paper tape (paper tape), any of the other physical media with patterns of holes, Random Access Memory (RAM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), FLASH erasable programmable read only memory (FLASH-EPROM), any of the other memory chips or cartridges, or any of the other media from which a computer can read. The instructions may further be transmitted or received by a transmission medium. The term transmission medium may include any tangible or intangible medium that is operable to store, encode, or carry instructions for execution by the machine, and includes digital or analog communications signals or intangible medium that facilitates communication of the instructions. Transmission media include coaxial cables, copper wire and fiber optics, including the wires that comprise a bus for transmitting a computer data signal.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The method for managing the scientific research achievements of the enterprises is characterized by comprising the following steps:
importing an enterprise scientific research result file, wherein the file format is a PDF (Portable document Format) file;
storing an imported PDF document, performing structuralization processing on the PDF document to extract document content and a document logic structure, generating a standardized XML file based on the extracted document content, extracting bibliographic field information from the document content and correspondingly storing the bibliographic field information to a bibliographic field storage position in a database, wherein each bibliographic field content in the document content corresponds to different XML nodes, and each XML node is respectively subjected to anchor point matching with the bibliographic field storage position of the database;
automatically extracting keywords from the generated XML file, and storing the extracted keywords to a keyword storage position of a database;
respectively converting keywords extracted from the two PDF documents into two characteristic vectors, calculating the similarity of the two PDF documents based on the two characteristic vectors, and associating the two PDF documents of which the similarity calculation result is greater than a threshold value;
the process of extracting document contents from a PDF document and generating a standardized XML file specifically includes the following:
recognizing each character from the PDF document by adopting an OCR technology, and expressing the recognition result of each character as a character block, wherein the content of each character block comprises the position information of the character, the format information of the character and the character, and a plurality of character blocks form an XML data set;
combining a plurality of character blocks based on the position information of the characters to obtain a plurality of character block combinations;
extracting a phrase from each character block combination by adopting a preset word segmentation model to generate a phrase block, wherein the phrase block comprises at least two character blocks;
obtaining the position information of the phrase block according to the position information of the first character block and the last character block in the phrase block;
carrying out verification processing on the phrase block to generate a standardized XML file;
the process of automatically extracting keywords from an XML file includes the following:
determining the number of paragraphs included in the PDF document and the number and position range of each paragraph according to the logical structure of the document, and determining the paragraph number of each phrase block in the PDF document based on the position information of each phrase block and the number and position range of each paragraph;
obtaining the number of sentences and the number and position range of each sentence contained in the paragraph based on the position information of the punctuation mark character block contained in the paragraph, and determining the sentence number of the phrase block in the paragraph based on the position information of each phrase block and the number and position range of each sentence;
counting the occurrence frequency of each phrase block, calculating the position weight based on the paragraph number and the sentence number where the phrase block appears each time, and summing the position weights obtained by multiple times of calculation to obtain the total position weight of each phrase block;
and sequentially arranging the plurality of phrase blocks according to the sequence of the total position weight from high to low, screening out the first N phrase blocks, and outputting the phrase blocks as keywords.
2. The method for managing enterprise achievements in scientific research according to claim 1, wherein the steps of converting the keywords extracted from the two PDF documents into two feature vectors, calculating the similarity of the two PDF documents based on the two feature vectors, and associating the two PDF documents with the similarity calculation result larger than a threshold value specifically include the following steps:
respectively converting the key words extracted from the two PDF documents into two Word vectors by adopting a trained Word2vec model;
and calculating the similarity between the two word vectors by adopting a cosine distance calculation formula, and automatically associating the two PDF documents when the calculated cosine distance is greater than a threshold value.
3. The method of managing an enterprise of research results as claimed in claim 1, further comprising:
the web crawler technology is adopted to capture the web information from the Internet, automatically extract the text content of the web, and the extracted web content is cleaned, removed of noise and then imported into a database.
4. The method of managing an enterprise of research results as claimed in claim 1, further comprising:
and encrypting the enterprise scientific research result files stored in the database.
5. The method of managing an enterprise of research results as claimed in claim 1, further comprising:
and carrying out enterprise scientific research result file retrieval in the database.
6. An enterprise research result management platform adopting the enterprise research result management method according to any one of claims 1 to 5, comprising:
the file import module is used for importing the enterprise scientific research result files, and the file format is a PDF (Portable document Format) file;
the file processing module is used for storing imported PDF files, carrying out structuralization processing on the PDF files to extract file contents and file logic structures, generating standardized XML files based on the extracted file contents, extracting bibliographic field information from the file contents and correspondingly storing the bibliographic field information to bibliographic field storage positions in a database, wherein each bibliographic field content in the file contents corresponds to different XML nodes, and each XML node is in anchor point matching with the bibliographic field storage position of the database;
the keyword extraction module is used for automatically extracting keywords from the generated XML file and storing the extracted keywords to a keyword storage position of a database;
and the document association module is used for respectively converting the keywords extracted from the two PDF documents into two feature vectors, calculating the similarity of the two PDF documents based on the two feature vectors, and associating the two PDF documents of which the similarity calculation result is greater than a threshold value.
7. The enterprise achievements research management platform of claim 6, wherein the platform further comprises:
and the external information capturing module is used for capturing webpage information from the Internet by adopting a web crawler technology, automatically extracting the text content of the webpage, cleaning and eliminating the noise of the extracted webpage content and then importing the webpage content into the database.
8. The enterprise achievements research management platform of claim 6, wherein the platform further comprises:
and the encryption module is used for encrypting the enterprise scientific research result files stored in the database.
9. An apparatus comprising a processor and a memory, the memory having stored therein a computer program, the processor being configured to perform the steps of the method of any one of claims 1 to 5 by invoking the computer program stored in the memory.
10. A computer-readable storage medium for storing a computer program for managing an enterprise's scientific results, wherein the computer program, when executed on a computer, performs the steps of the method according to any one of claims 1 to 5.
CN202111010269.4A 2021-08-31 2021-08-31 Enterprise scientific research result management method, management platform, equipment and storage medium Active CN113448918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111010269.4A CN113448918B (en) 2021-08-31 2021-08-31 Enterprise scientific research result management method, management platform, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111010269.4A CN113448918B (en) 2021-08-31 2021-08-31 Enterprise scientific research result management method, management platform, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113448918A CN113448918A (en) 2021-09-28
CN113448918B true CN113448918B (en) 2021-11-12

Family

ID=77819277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111010269.4A Active CN113448918B (en) 2021-08-31 2021-08-31 Enterprise scientific research result management method, management platform, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113448918B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987325B (en) * 2021-11-02 2024-12-03 北京师范大学珠海分校 Semi-automatic search method for global corporate scientific research funding information based on information aggregation
CN114493939A (en) * 2022-02-14 2022-05-13 山东大学 A system for promoting scientific research achievements based on deep learning
CN114780567A (en) * 2022-05-25 2022-07-22 江苏优集科技有限公司 A system and method for updating file layout based on distributed file system
CN114997830B (en) * 2022-06-13 2025-02-14 西安远诺技术转移有限公司 A technical manager work system
CN115809649A (en) * 2022-11-23 2023-03-17 明度智云(浙江)科技有限公司 An eCTD conversion method, system and storage medium for NeeS electronic documents

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1587010A2 (en) * 2004-04-15 2005-10-19 Microsoft Corporation Verifying relevance between keywords and web site contents
CN101004737A (en) * 2007-01-24 2007-07-25 贵阳易特软件有限公司 Individualized document processing system based on keywords
CN102024002A (en) * 2009-09-10 2011-04-20 上海中信信息发展股份有限公司 Safe storage method and system of filing of electronic documents
CN103218351A (en) * 2013-03-15 2013-07-24 杭州中元数据科技有限公司 Modern local literature electronic book manufacture method
JP2016126356A (en) * 2014-12-26 2016-07-11 ブラザー工業株式会社 Image processing program, image processing method, and image processing apparatus
CN112597267A (en) * 2020-12-14 2021-04-02 北京理工大学 English thesis document multi-granularity content processing method based on pattern recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1770552A3 (en) * 2005-07-13 2007-05-09 Rivergy, Inc. System for building a website for easier search engine retrieval.

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1587010A2 (en) * 2004-04-15 2005-10-19 Microsoft Corporation Verifying relevance between keywords and web site contents
CN101004737A (en) * 2007-01-24 2007-07-25 贵阳易特软件有限公司 Individualized document processing system based on keywords
CN102024002A (en) * 2009-09-10 2011-04-20 上海中信信息发展股份有限公司 Safe storage method and system of filing of electronic documents
CN103218351A (en) * 2013-03-15 2013-07-24 杭州中元数据科技有限公司 Modern local literature electronic book manufacture method
JP2016126356A (en) * 2014-12-26 2016-07-11 ブラザー工業株式会社 Image processing program, image processing method, and image processing apparatus
CN112597267A (en) * 2020-12-14 2021-04-02 北京理工大学 English thesis document multi-granularity content processing method based on pattern recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种数字教学资源聚类重构系统的实现;张成昱;《现代图书情报技术》;20070825(第08期);全文 *

Also Published As

Publication number Publication date
CN113448918A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
CN113448918B (en) Enterprise scientific research result management method, management platform, equipment and storage medium
CN110163478B (en) Risk examination method and device for contract clauses
US12277385B2 (en) Text keyword extraction method, electronic device, and computer readable storage medium
CN108132929A (en) A kind of similarity calculation method of magnanimity non-structured text
Al-Ash et al. Fake news identification characteristics using named entity recognition and phrase detection
WO2024109619A1 (en) Sensitive data identification method and apparatus, device, and computer storage medium
CN110532352B (en) Text duplication checking method and device, computer readable storage medium and electronic equipment
Nguyen et al. Exploiting syntactic and semantic information for relation extraction from wikipedia
WO2023060634A1 (en) Case concatenation method and apparatus based on cross-chapter event extraction, and related component
Munyer et al. DeepTextMark: a deep learning-driven text watermarking approach for identifying large language model generated text
Kužina et al. Methods for automatic sensitive data detection in large datasets: a review
Lin et al. Sensitive information detection based on convolution neural network and bi-directional lstm
Oswal Identifying and categorizing offensive language in social media
CN119377415B (en) Chinese bad language theory detection method and system
CN119202213B (en) Multi-level domain knowledge question-answering method and device based on large model
CN114118089A (en) A method and system for the construction of enterprise judicial litigation relationship based on judgment documents
CN117313721A (en) Document management method and device based on natural language processing technology
US20230127562A1 (en) Composite extraction systems and methods for artificial intelligence platform
Baryshev et al. Information System for the Fact-checker Support
Van Der Elst Extracting ESG data from business documents
Liu IntelliExtract: An End-to-End Framework for Chinese Resume Information Extraction from Document Images
CN120337937B (en) Academic opinion extraction method and system applied to academic literature
CN117573956B (en) Metadata management method, device, equipment and storage medium
KR102834704B1 (en) Method and system for searching similar precedents using a deep learning language model with dual encoders
CN120011480B (en) Text tracing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant