[go: up one dir, main page]

CN109815500A - Management method, device, computer equipment and the storage medium of unstructured official document - Google Patents

Management method, device, computer equipment and the storage medium of unstructured official document Download PDF

Info

Publication number
CN109815500A
CN109815500A CN201910074336.5A CN201910074336A CN109815500A CN 109815500 A CN109815500 A CN 109815500A CN 201910074336 A CN201910074336 A CN 201910074336A CN 109815500 A CN109815500 A CN 109815500A
Authority
CN
China
Prior art keywords
unstructured
unstructured document
document
preset
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910074336.5A
Other languages
Chinese (zh)
Inventor
吴雄辉
王丽娟
秦锋剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Green Bay Network Technology Co Ltd
Original Assignee
Hangzhou Green Bay Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Green Bay Network Technology Co Ltd filed Critical Hangzhou Green Bay Network Technology Co Ltd
Priority to CN201910074336.5A priority Critical patent/CN109815500A/en
Publication of CN109815500A publication Critical patent/CN109815500A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application proposes management method, device, computer equipment and the storage medium of a kind of unstructured official document, wherein method includes: by obtaining unstructured official document to be identified;Unstructured official document to be identified is identified according to preset identification model, obtains the attribute information in unstructured official document to be identified;Unstructured official document to be identified is stored according to attribute information.The validity and accuracy of the management of unstructured official document are improved as a result,.

Description

Management method and device of unstructured official document, computer equipment and storage medium
Technical Field
The application relates to the technical field of electronic government affairs, in particular to a method and a device for managing unstructured documents, computer equipment and a storage medium.
Background
At present, there are two modes of management means and technical scheme in government official document processing, wherein the management means is to realize objectification of all official documents to be issued in all official document issuing departments, and mainly manually enter a management system of official document abstracts, receiving departments, official document related personnel, contact information and the like, but the management means has low efficiency, no special staff enters the management system, and no questions are asked about historical official documents, so that numerous official documents in parallel and cross departments in the whole government system cannot be effectively entered; the technical scheme is mainly characterized in that all the official documents are input, simple matching query of partial official documents or contents is carried out, effective identification and organized management are not available in the management process, and official document relations, official document associations and the like between cross departments and parallel departments cannot be managed. Therefore, neither of the above two schemes can effectively manage the documents.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the application provides a method, a device and a storage medium for managing an unstructured document, which are used for solving the technical problem that the unstructured document cannot be effectively managed in the prior art.
In order to achieve the above object, an embodiment of a first aspect of the present application provides a method for managing an unstructured document, including:
acquiring an unstructured document to be identified;
identifying the unstructured document to be identified according to a preset identification model, and acquiring attribute information in the unstructured document to be identified;
and storing the unstructured document to be identified according to the attribute information.
According to the management method of the unstructured document, the unstructured document to be identified is obtained; identifying the unstructured document to be identified according to a preset identification model, and acquiring attribute information in the unstructured document to be identified; and storing the unstructured documents to be identified according to the attribute information. Therefore, the effectiveness and the accuracy of the management of the unstructured official documents are improved.
In order to achieve the above object, a second aspect of the present application provides an apparatus for managing unstructured documents, including:
the acquisition module is used for acquiring the unstructured document to be identified;
the identification module is used for identifying the unstructured document to be identified according to a preset identification model and acquiring attribute information in the unstructured document to be identified;
and the storage module is used for storing the unstructured document to be identified according to the attribute information.
The management device of the unstructured official document calculates the text similarity between the text to be recognized and the attribute information corresponding to different enterprise identifications according to the preset enterprise brand word set and the preset enterprise attribute word set, inputs the text to be recognized into the semantic similarity model obtained through pre-training, obtains the semantic similarity between the text to be recognized and the attribute information corresponding to different enterprise identifications, and determines the target enterprise identification matched with the text to be recognized according to the text similarity and the semantic similarity. Therefore, the accuracy of management of the unstructured documents is improved, and the management recall rate of the unstructured documents is also improved.
To achieve the above object, a third aspect of the present application provides a computer device, including: a processor and a memory; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the management method of the unstructured document according to the embodiment of the first aspect.
To achieve the above object, a fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for managing an unstructured document according to the first aspect.
To achieve the above object, a fifth aspect of the present application provides a computer program product, where instructions of the computer program product, when executed by a processor, implement the method for managing an unstructured document according to the first aspect.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart illustrating a method for managing unstructured documents according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for managing unstructured documents according to a second embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for managing unstructured documents according to a third embodiment of the present application;
FIG. 4 is a flowchart illustrating a method for managing unstructured documents according to a fourth embodiment of the present application;
FIG. 5 is a schematic structural diagram of a management apparatus for unstructured documents according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a management apparatus for unstructured documents according to a second embodiment of the present application;
FIG. 7 is a schematic structural diagram of a management apparatus for unstructured documents according to a third embodiment of the present application;
FIG. 8 is a schematic structural diagram of a management apparatus for unstructured documents according to a fourth embodiment of the present application;
fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
A method, an apparatus, a computer device, and a storage medium for managing an unstructured document according to embodiments of the present application are described below with reference to the accompanying drawings.
Fig. 1 is a flowchart illustrating a method for managing an unstructured document according to an embodiment of the present application.
As shown in fig. 1, the method for managing an unstructured document may include the following steps:
step 101, acquiring an unstructured document to be identified.
In practical application, a lot of government official documents are not stored according to a certain mode, are unstructured, and users without full-time work effectively enter the official documents and the like, so that the official documents cannot be effectively managed.
Firstly, acquiring an unstructured document to be identified, wherein one or more unstructured documents to be identified which are not stored according to a certain mode exist, and the one or more unstructured documents to be identified can be determined according to actual application needs.
Step 102, identifying an unstructured document to be identified according to a preset identification model, and acquiring attribute information in the unstructured document to be identified.
And 103, storing the unstructured document to be identified according to the attribute information.
Specifically, the pre-generating of the recognition model, as shown in fig. 2, includes:
step 201, an annotation corpus is determined.
Step 202, performing word segmentation processing on the plurality of training unstructured documents to obtain a plurality of training word segments in each training unstructured document.
And 203, processing the labeled corpus and the plurality of training participles according to a preset algorithm to generate a preset recognition model.
Specifically, the mode of confirming the annotation corpus is very various, can be directly regard the existence corpus that has been marked as the annotation corpus directly to use for example people's daily newspaper corpus, can also carry out the annotation through the manual selection a plurality of not mark official documents and generate the annotation corpus, can also be partly select the corpus that has been marked and partly carry out the manual annotation and generate the annotation corpus, can select according to practical application needs.
For example, the bi-directional LSTM using a model with bilstm + CRF being a bidirectional LSTM (long short-Term Memory network) + CRF (Conditional Random Field) layer can obtain context information, so that deep learning can be better performed, the manual participation of later-stage labeling is reduced, and the generation efficiency of the recognition model is further improved.
It should be noted that the annotated corpus stores the linguistic material that actually appears in the practical use of the language, the annotated corpus is the basic resource that takes the electronic computer as the carrier to carry the linguistic knowledge, and the actual corpus needs to be processed (such as analysis and processing) to become useful resource.
The tagging may be part-of-speech tagging, which is a process of determining a grammar category of each word in a given sentence, determining a part-of-speech thereof, and tagging the part-of-speech thereof, such as a location attribute vector, a part-of-speech tagging sequence vector, a clustering or classification algorithm, and the like.
It can be understood that, before generating the recognition model, a plurality of training unstructured documents need to be determined, and word segmentation processing is performed on each training unstructured document to obtain a plurality of training word segmentations in each training unstructured document.
As an example, a training unstructured official document a is obtained, and the content of the official document in the training unstructured official document a is subjected to word segmentation, it can be understood that the content of the official document can be subjected to word segmentation in a preset word segmentation mode, for example, in an NlpAnalysis word segmentation (word segmentation with a new word discovery function) mode in Ansj chinese word segmentation (java-based chinese word segmentation tool), more specifically, by introducing a jar (software package file format) package of Ansj and executing a NlpAnalysis word segmentation method, an unknown word can be identified, and the identification of a person name, an organization name and a number can be well expressed, and a dictionary defined by a user is supported.
The Chinese word segmentation refers to the process of segmenting a Chinese character sequence into a single word and recombining continuous character sequences into word sequences according to a certain standard.
And finally, processing the labeled corpus and the plurality of training participles through a preset algorithm to generate a preset recognition model. The preset algorithm may be selected according to needs, such as a deep neural network model or a preprogrammed algorithm.
Specifically, the generated recognition model stores attribute information corresponding to different corpora or different participles, such as position information, name information, contact information, and the like, so that the attribute information in the unstructured document to be recognized can be acquired by recognizing the unstructured document to be recognized through the preset recognition model, and the unstructured document to be recognized can be stored according to the attribute information.
The attribute information may be determined according to actual application requirements, such as extracting names, contact addresses, document abstracts, organizational relationships, and the like from the unstructured documents.
For example, the analysis of the organization may be performed, such as by extracting a texting/receiving institution; extracting a personnel lexicon to realize personnel attribute classification; and extracting the abstract of the document to realize the classification of similar documents and the like.
In order to further guarantee the quality of the stored official documents, the illegal official documents can be specially classified through official document violation filtering so as to be convenient for subsequent processing.
It should be noted that, in order to ensure the validity of the recognition model, after the preset recognition model is generated, it is required to test the recognition model, specifically as shown in fig. 3, the method includes:
step 301, acquiring an unstructured document to be tested.
Step 302, performing word segmentation processing on the unstructured document to be tested to obtain a plurality of test words in the unstructured document to be tested.
And 303, identifying the plurality of test participles according to a preset identification model to obtain test values.
And step 304, judging the effectiveness of the preset identification model according to the test value and the preset threshold value.
Specifically, after the identification model is generated, the unstructured document to be tested is determined, and the unstructured document to be tested is subjected to word segmentation, the specific process can be described in step 202, then, a plurality of test word segmentations are identified according to the preset identification model, several identified attribute information can be determined, whether the identified attribute information is correct or not can be determined, so that a test value is determined, and finally, the test value is compared with a preset threshold value to judge the effectiveness of the preset identification model.
As an example, the test values include: and the accuracy and the recall rate are obtained, the ratio of the accuracy to the recall rate is obtained, and if the ratio is greater than or equal to a preset threshold value, the preset identification model is determined to be valid.
For example, the recognition model is LDA (document topic Allocation), also called a three-layer bayesian probability model, and may include three layers of structures, i.e., words, topics, and documents. In the recognition model of the present application, it is said that each word in a document is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability", where the accuracy rate is the total number of correctly recognized individuals/the total number of recognized individuals; recall-the total number of correctly identified individuals/total number of individuals present in the test set; f value ═ precision ═ recall × (precision + recall) × (2).
Therefore, one or more of the accuracy, the recall rate and the F value or the ratio of the accuracy, the recall rate and the F value can be selected as a test value according to the actual application requirement, and a corresponding preset threshold value is selected for comparison to determine the effectiveness of the identification model.
Based on the description of the above embodiment, after step 103, as shown in fig. 4, the method further includes:
step 401, obtaining extraction keywords.
And step 402, extracting the target unstructured official document according to the extraction keywords.
That is, the target unstructured document can be extracted from a plurality of unstructured documents stored according to the attribute information by determining extraction keywords such as an organization, a contact address, and the like, thereby improving the accuracy of the extraction of the unstructured documents.
The method can solve the most difficult official document identification and object extraction in the process of official document vertical and modal management, so that the subsequent hierarchical management of the official documents is realized, and the effective management of the official documents in the processes of receiving and issuing is realized.
According to the management method of the unstructured document, the unstructured document to be recognized is obtained, the unstructured document to be recognized is recognized according to the preset recognition model, the attribute information in the unstructured document to be recognized is obtained, and the unstructured document to be recognized is stored according to the attribute information. Therefore, the effectiveness and accuracy of management of the unstructured official documents can be achieved.
In order to implement the above embodiments, the present application further provides a management device for unstructured documents.
Fig. 5 is a schematic structural diagram of a management apparatus for unstructured documents according to an embodiment of the present application.
As shown in fig. 5, the management apparatus 50 for unstructured documents includes: an acquisition module 510, a recognition module 520, and a storage module 530. Wherein,
an obtaining module 510, configured to obtain an unstructured document to be identified.
The identifying module 520 is configured to identify the unstructured document to be identified according to a preset identifying model, and acquire attribute information in the unstructured document to be identified.
And the storage module 530 is configured to store the to-be-identified unstructured document according to the attribute information.
Further, in a possible implementation manner of the embodiment of the present application, as shown in fig. 6, on the basis of the embodiment shown in fig. 5, the apparatus further includes: a determination module 540, a word segmentation module 550, and a generation module 560.
A determining module 540, configured to determine an annotated corpus.
The word segmentation module 550 is configured to perform word segmentation processing on the plurality of training unstructured documents to obtain a plurality of training words in each training unstructured document.
The generating module 560 is configured to process the labeled corpus and the plurality of training segmented words according to a preset algorithm, and generate a preset recognition model.
In a possible implementation manner of the embodiment of the present application, as shown in fig. 7, on the basis of the embodiment shown in fig. 6, the apparatus further includes: a decision block 570. Wherein,
the obtaining module 510 is further configured to obtain an unstructured document to be tested.
The word segmentation module 550 is further configured to perform word segmentation on the unstructured document to be tested, and obtain a plurality of test words in the unstructured document to be tested.
The recognition module 520 is further configured to recognize the plurality of test segments according to a preset recognition model, and obtain a test value.
The judging module 570 is configured to judge validity of the preset recognition model according to the test value and a preset threshold.
In one possible implementation manner of the embodiment of the present application, the test value includes: accuracy and recall; the determining module 570 is specifically configured to obtain a ratio between the accuracy and the recall rate, and determine that the preset recognition model is valid if the ratio is greater than or equal to a preset threshold.
In a possible implementation manner of the embodiment of the present application, as shown in fig. 8, on the basis of the embodiment shown in fig. 5, the apparatus further includes: a decimation module 580.
The obtaining module 510 is further configured to obtain an extraction keyword.
And the extraction module 580 is used for extracting the target unstructured official document according to the extraction keywords.
It should be noted that the foregoing explanation of the embodiment of the method for managing an unstructured document is also applicable to the management apparatus of an unstructured document of this embodiment, and the implementation principle is similar, and is not repeated here.
The management device of the unstructured document of the embodiment of the application identifies the unstructured document to be identified according to the preset identification model by acquiring the unstructured document to be identified, acquires the attribute information in the unstructured document to be identified, and stores the unstructured document to be identified according to the attribute information. Therefore, the effectiveness and accuracy of management of the unstructured official documents can be achieved.
In order to implement the foregoing embodiments, the present application also provides a computer device, including: a processor and a memory. Wherein, the processor runs the program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the management method of the unstructured document as described in the foregoing embodiments.
FIG. 9 is a block diagram of a computer device provided in an embodiment of the present application, illustrating an exemplary computer device 90 suitable for use in implementing embodiments of the present application. The computer device 90 shown in fig. 9 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.
As shown in fig. 9, the computer device 90 is in the form of a general purpose computer device. The components of computer device 90 may include, but are not limited to: one or more processors or processing units 906, a system memory 910, and a bus 908 that couples the various system components (including the system memory 910 and the processing unit 906).
Bus 908 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Computer device 90 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 90 and includes both volatile and nonvolatile media, removable and non-removable media.
The system Memory 910 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 911 and/or cache Memory 912. The computer device 90 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 913 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, and commonly referred to as a "hard disk drive"). Although not shown in FIG. 9, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 908 by one or more data media interfaces. System memory 910 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
Program/utility 914 having a set (at least one) of program modules 9140 may be stored, for example, in system memory 910, such program modules 9140 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination of these examples may comprise an implementation of a network environment. Program modules 9140 generally perform the functions and/or methods of embodiments described herein.
The computer device 90 may also communicate with one or more external devices 10 (e.g., keyboard, pointing device, display 100, etc.), with one or more devices that enable a user to interact with the terminal device 90, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 90 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 902. Moreover, computer device 90 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 900. As shown in FIG. 9, network adapter 900 communicates with the other modules of computer device 90 via bus 908. It should be appreciated that although not shown in FIG. 9, other hardware and/or software modules may be used in conjunction with computer device 90, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 906 executes various functional applications and data processing by executing programs stored in the system memory 910, for example, implementing the management method of the unstructured documents mentioned in the foregoing embodiments.
In order to implement the foregoing embodiments, the present application also proposes a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the management method of the unstructured documents as described in the foregoing embodiments.
In order to implement the foregoing embodiments, the present application also proposes a computer program product, wherein when the instructions in the computer program product are executed by a processor, the management method of the unstructured document as described in the foregoing embodiments is implemented.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A management method of unstructured documents is characterized by comprising the following steps:
acquiring an unstructured document to be identified;
identifying the unstructured document to be identified according to a preset identification model, and acquiring attribute information in the unstructured document to be identified;
and storing the unstructured document to be identified according to the attribute information.
2. The method of claim 1, before the recognizing the plurality of participles according to a preset recognition model and obtaining attribute information in the unstructured document to be recognized, further comprising:
determining an annotation corpus;
performing word segmentation processing on the plurality of training unstructured documents to obtain a plurality of training words in each training unstructured document;
and processing the labeling corpus and the training participles according to a preset algorithm to generate the preset recognition model.
3. The method of claim 2, after generating the predetermined recognition model, further comprising:
acquiring an unstructured document to be tested;
performing word segmentation processing on the unstructured document to be tested to obtain a plurality of test words in the unstructured document to be tested;
identifying the test participles according to the preset identification model to obtain test values;
and judging the effectiveness of the preset identification model according to the test value and a preset threshold value.
4. The method of claim 3, wherein the test value comprises: accuracy and recall;
the judging the effectiveness of the preset identification model according to the test value and a preset threshold value comprises the following steps:
acquiring the ratio of the accuracy rate to the recall rate;
and if the ratio is greater than or equal to a preset threshold value, determining that the preset identification model is valid.
5. The method of claim 1, after storing the unstructured document to be recognized according to the attribute information corresponding to the target participle, further comprising:
acquiring an extraction keyword;
and extracting the target unstructured official document according to the extraction keywords.
6. An apparatus for managing unstructured documents, comprising:
the acquisition module is used for acquiring the unstructured document to be identified;
the identification module is used for identifying the unstructured document to be identified according to a preset identification model and acquiring attribute information in the unstructured document to be identified;
and the storage module is used for storing the unstructured document to be identified according to the attribute information.
7. The apparatus of claim 6, further comprising:
the determining module is used for determining an annotation corpus;
the word segmentation module is used for carrying out word segmentation processing on a plurality of training unstructured documents to obtain a plurality of training words in each training unstructured document;
and the generating module is used for processing the labeled corpus and the training participles according to a preset algorithm to generate the preset recognition model.
8. The apparatus of claim 7, further comprising:
the acquisition module is also used for acquiring an unstructured document to be tested;
the word segmentation module is further used for carrying out word segmentation processing on the unstructured document to be tested to obtain a plurality of test words in the unstructured document to be tested;
the recognition module is further used for recognizing the test participles according to the preset recognition model to obtain test values;
and the judging module is used for judging the effectiveness of the preset identification model according to the test value and a preset threshold value.
9. A computer device comprising a processor and a memory;
wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for implementing the management method of the unstructured document according to any one of claims 1 to 5.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a method of managing an unstructured document according to any of claims 1 to 5.
CN201910074336.5A 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document Pending CN109815500A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910074336.5A CN109815500A (en) 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910074336.5A CN109815500A (en) 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document

Publications (1)

Publication Number Publication Date
CN109815500A true CN109815500A (en) 2019-05-28

Family

ID=66604984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910074336.5A Pending CN109815500A (en) 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document

Country Status (1)

Country Link
CN (1) CN109815500A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507968A (en) * 2020-12-24 2021-03-16 成都网安科技发展有限公司 Method and device for identifying official document text based on feature association
CN112541373A (en) * 2019-09-20 2021-03-23 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
CN112948347A (en) * 2019-12-11 2021-06-11 北京懿医云科技有限公司 Text data structuring processing method, device, equipment and storage medium
CN113449525A (en) * 2021-07-08 2021-09-28 安徽商信政通信息技术股份有限公司 Intelligent file transfer method and system based on entity identification
CN113656353A (en) * 2021-08-03 2021-11-16 煤炭科学研究总院 BIM model processing method, device, computer equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
US8407217B1 (en) * 2010-01-29 2013-03-26 Guangsheng Zhang Automated topic discovery in documents
CN103310025A (en) * 2013-07-08 2013-09-18 北京邮电大学 Unstructured-data description method and device
CN107992597A (en) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 A kind of text structure method towards electric network fault case
CN108228101A (en) * 2017-12-28 2018-06-29 北京盛和大地数据科技有限公司 A kind of method and system for managing data
CN108416032A (en) * 2018-03-12 2018-08-17 腾讯科技(深圳)有限公司 A kind of file classification method, device and storage medium
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium
CN108920656A (en) * 2018-07-03 2018-11-30 龙马智芯(珠海横琴)科技有限公司 Document properties description content extracting method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8407217B1 (en) * 2010-01-29 2013-03-26 Guangsheng Zhang Automated topic discovery in documents
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN103310025A (en) * 2013-07-08 2013-09-18 北京邮电大学 Unstructured-data description method and device
CN107992597A (en) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 A kind of text structure method towards electric network fault case
CN108228101A (en) * 2017-12-28 2018-06-29 北京盛和大地数据科技有限公司 A kind of method and system for managing data
CN108416032A (en) * 2018-03-12 2018-08-17 腾讯科技(深圳)有限公司 A kind of file classification method, device and storage medium
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium
CN108920656A (en) * 2018-07-03 2018-11-30 龙马智芯(珠海横琴)科技有限公司 Document properties description content extracting method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541373A (en) * 2019-09-20 2021-03-23 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
WO2021051957A1 (en) * 2019-09-20 2021-03-25 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method, and related device
CN112541373B (en) * 2019-09-20 2023-10-31 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
CN112948347A (en) * 2019-12-11 2021-06-11 北京懿医云科技有限公司 Text data structuring processing method, device, equipment and storage medium
CN112507968A (en) * 2020-12-24 2021-03-16 成都网安科技发展有限公司 Method and device for identifying official document text based on feature association
CN112507968B (en) * 2020-12-24 2024-03-05 成都网安科技发展有限公司 Document text recognition method and device based on feature association
CN113449525A (en) * 2021-07-08 2021-09-28 安徽商信政通信息技术股份有限公司 Intelligent file transfer method and system based on entity identification
CN113656353A (en) * 2021-08-03 2021-11-16 煤炭科学研究总院 BIM model processing method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108597519B (en) Call bill classification method, device, server and storage medium
CN107679039B (en) Method and apparatus for determining sentence intent
US10489439B2 (en) System and method for entity extraction from semi-structured text documents
CN109815500A (en) Management method, device, computer equipment and the storage medium of unstructured official document
CN108733778B (en) Industry type identification method and device of object
US9858385B2 (en) Identifying errors in medical data
CN104503998B (en) For the kind identification method and device of user query sentence
US20110218947A1 (en) Ontological categorization of question concepts from document summaries
US20180025121A1 (en) Systems and methods for finer-grained medical entity extraction
CN109284374B (en) Method, apparatus, device and computer readable storage medium for determining entity class
US20180225276A1 (en) Document segmentation, interpretation, and re-organization
US11663407B2 (en) Management of text-item recognition systems
CN111475603A (en) Enterprise identifier identification method and device, computer equipment and storage medium
US20160300154A1 (en) Determining off-topic questions in a question answering system using probabilistic language models
CN110162786A (en) Construct the method, apparatus of configuration file and drawing-out structure information
CN114417871A (en) Model training and named entity recognition method, device, electronic device and medium
CN112182150A (en) Aggregation retrieval method, device, equipment and storage medium based on multivariate data
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN111369980A (en) Voice detection method and device, electronic equipment and storage medium
CN111079432A (en) Text detection method and device, electronic equipment and storage medium
CN109299227B (en) Information query method and device based on voice recognition
CN113220999A (en) User feature generation method and device, electronic equipment and storage medium
CN113806500A (en) Information processing method and device and computer equipment
CN107844531B (en) Answer output method and device and computer equipment
CN112579781B (en) Text classification method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 1901, building 1, No. 1782 Jiangling Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: HANGZHOU LVWAN NETWORK TECHNOLOGY Co.,Ltd.

Address before: 2, No. 2630, building 2, superior Science Park, No. 310026 South Ring Road, Hangzhou, Binjiang District, Zhejiang, China

Applicant before: HANGZHOU LVWAN NETWORK TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190528

WD01 Invention patent application deemed withdrawn after publication