[go: up one dir, main page]

CN108804408A - Information extraction system based on domain-specialist knowledge system and information extraction method - Google Patents

Information extraction system based on domain-specialist knowledge system and information extraction method Download PDF

Info

Publication number
CN108804408A
CN108804408A CN201710289555.6A CN201710289555A CN108804408A CN 108804408 A CN108804408 A CN 108804408A CN 201710289555 A CN201710289555 A CN 201710289555A CN 108804408 A CN108804408 A CN 108804408A
Authority
CN
China
Prior art keywords
information extraction
domain
knowledge
rule
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710289555.6A
Other languages
Chinese (zh)
Inventor
司华建
贾真
耿伟
金重九
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Fu Chi Information Technology Co Ltd
Original Assignee
Anhui Fu Chi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Fu Chi Information Technology Co Ltd filed Critical Anhui Fu Chi Information Technology Co Ltd
Priority to CN201710289555.6A priority Critical patent/CN108804408A/en
Publication of CN108804408A publication Critical patent/CN108804408A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of information extraction system and its information extraction method based on domain-specialist knowledge system, including:Resource management module, preprocessing module, core processing module, output module information extraction method are:The expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit, and the expert of judicial domain also combs out knowledge point by resource management module and defined to it;Maintenance personnel needs to write decimation rule to form information extraction rules library by regular library unit according to information extraction;Regular and segmentation is carried out to judgement document content using preprocessing module;Using core processing module according to the information extraction rules library of domain knowledge base and manual compiling, using information extraction algorithm Extracting Information point, exported by output module by result is extracted.The present invention has the advantages that universality is high, maintenance cost is low etc..

Description

Information extraction system based on domain-specialist knowledge system and information extraction method
Technical field
The present invention relates to information extraction field, a kind of specifically information extraction system based on domain-specialist knowledge system System and its information extraction method.
Background technology
Court verdict, law term refer to the document that law court is write as according to judgement.It is a kind of law circle's commonly Applied Composition Style, including paper of civil judgment, criminal judgment, administrative judgment book and incidental civil court verdict.
The new rule of Supreme People's Court's publication:Law court's binding judgment book is comprehensively public in internet from 1 day January in 2014 Cloth, except be related to state secret, individual privacy, teenage crime and unsuitable " solarization " 4 class court verdicts in addition to, the public can look at any time It reads.
Currently, existing document extraction technique is mainly based on rules technology, there are the information point of extraction it is scattered and Immethodical defect, therefore the changeable demand of extraction task is cannot be satisfied, in addition, existing text extraction technique maintenance cost It is high, it is not suitable for and is widely used to promote.
Invention content
The technical problem to be solved by the present invention is in order to overcome the prior art not have universality and safeguard threshold height Defect, and provide a kind of information extraction system and its information extraction method based on domain-specialist knowledge system.
The present invention solves the technical solution that above-mentioned technical problem provides:The invention discloses one kind to be known based on domain expert The information extraction system of knowledge system, including:Resource management module, preprocessing module, core processing module, output module, it is described Resource management module be used for management domain knowledge base and information extraction rules library, the preprocessing module be used for judgement text Book content carries out regular and segmentation, and the core processing module is used to be provided according to the rule of domain knowledge base and manual compiling Source, using information extraction algorithm Extracting Information point, the output module is exported for that will extract result.
Preferably, the resource management module includes expertise library unit and regular library unit, the expert Repository unit is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and is combed out by the expert of judicial domain Knowledge point simultaneously defines it, and the regular library unit is used to write decimation rule by maintenance personnel according to the needs of information extraction Form information extraction rules library.
Preferably, the invention also discloses a kind of letters of the above-mentioned information extraction system based on domain-specialist knowledge system Abstracting method is ceased, is as follows:
(1), the expert of judicial domain is combed out the knowledge of judicial domain by expertise library unit and is known with building field Know library, the expert of judicial domain also combs out knowledge point by resource management module and defined to it;
(2), maintenance personnel needs to write decimation rule to form information pumping by regular library unit according to information extraction Take rule base;
(3), regular and segmentation is carried out to judgement document content using preprocessing module;
(4), information is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module Extraction algorithm Extracting Information point;
(5), it is exported by output module by result is extracted.
Preferably, in the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses Hackberry Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category As a result.
Preferably, the rule classification method is classified according to the rule that maintenance personnel writes.
Preferably, the sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
Preferably, the step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph Different information points, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method go to identify.
Compared with prior art, the present invention has following beneficial advantage:
The emphasis of the present invention is based on the domain business knowledge system of combing, by using preprocessing module and core The architecture design of processing module first uses preprocessing module to carry out regular and segmentation to judgement document content, although judgement document There is the specification write, but it should include which information and rough piecemeal, therefore each judge that judgement document is only illustrated in specification When writing, there are certain degree of freedom, the purpose of segmentation is to determine the content that each paragraph states clearly, and beats each paragraph Label is the premise of follow-up Extracting Information point, then uses core processing module according to domain knowledge base and manual compiling again Using information extraction algorithm Extracting Information point, therefore the universality and dimension of extraction system greatly improved in information extraction rules library Threshold is protected, to cope with changeable information extraction demand.
Description of the drawings
Fig. 1 is a kind of system block diagram of the information extraction system based on domain-specialist knowledge system of the present invention;
Fig. 2 is the schematic diagram of the embodiment of the present invention 1;
The structural representation of the step of Fig. 3 is a kind of information extraction system based on domain-specialist knowledge system of the present invention (3) Figure.
Specific implementation mode
Referring to Fig.1 shown in -3, the invention discloses a kind of information extraction system based on domain-specialist knowledge system, packets It includes:Resource management module 1, preprocessing module 2, core processing module 3, output module 4, the resource management module 1 are used for Management domain knowledge base and information extraction rules library, the preprocessing module 2 be used for judgement document content carry out it is regular and Segmentation, the core processing module 3 are used for the regular resource according to domain knowledge base and manual compiling, are calculated using information extraction Method Extracting Information point, the output module 4 are exported for that will extract result.
Preferably, the resource management module 1 includes expertise library unit 11 and regular library unit 12, it is described Expertise library unit 11 is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and by the expert of judicial domain It combs out knowledge point and it is defined, the regular library unit 12 according to the needs of information extraction by maintenance personnel for being write Decimation rule forms information extraction rules library.
Preferably, the invention also discloses a kind of letters of the above-mentioned information extraction system based on domain-specialist knowledge system Abstracting method is ceased, is as follows:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build field by expertise library unit 11 The expert of knowledge base, judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information by regular library unit 12 according to information extraction Decimation rule library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module 2;
(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3 Cease extraction algorithm Extracting Information point;
(5), it is exported by output module 4 by result is extracted.
Preferably, in the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses Hackberry Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category As a result.
Preferably, the rule classification method is classified according to the rule that maintenance personnel writes.
Preferably, the sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
Preferably, the step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph Different information points, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method go to identify.
Embodiment 1
A kind of information extraction method of the above-mentioned information extraction system based on domain-specialist knowledge system, specific steps are such as Under:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build field by expertise library unit 11 The expert of knowledge base, judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information by regular library unit 12 according to information extraction Decimation rule library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module 2, the specific steps are each section of determinations The content stated clearly is fallen, is then classified using hackberry Bayes Method, then be ranked up, that is, realizes automatic paragraphing, most Output category result afterwards, the sort algorithm are fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training;
(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3 Cease extraction algorithm Extracting Information point, the process according to according to the automatic paragraphing in step (3) as a result, being extracted not in each paragraph Same information point, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method goes to identify, by taking name, place name, time, institutional framework name as an example, this kind of type identification is claimed in natural language understanding field The side being combined using statistics and rule for name Entity recognition (Named EntitiesRecognition, NER), this system Method, and it is aided with part of speech comprehensive descision, there are the relationship descriptions of various complexity, this system mainly to use the side of rule in judgement document Formula, defines the extraction template of a variety of relationships, then is aided with simple reasoning and judging;
(5), it is exported by output module 4 by result is extracted.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should by the present invention claim be covered.

Claims (7)

1. a kind of information extraction system based on domain-specialist knowledge system, including:Resource management module, preprocessing module, core Heart processing module, output module, it is characterised in that:The resource management module is used for management domain knowledge base and information extraction Rule base, the preprocessing module are used to carry out regular and segmentation to judgement document content, and the core processing module is used In the regular resource according to domain knowledge base and manual compiling, information extraction algorithm Extracting Information point, the output mould are utilized Block is exported for that will extract result.
2. a kind of information extraction system based on domain-specialist knowledge system according to claim 1, it is characterised in that:Institute The resource management module stated includes expertise library unit and regular library unit, and the expertise library unit is combed for expert The knowledge of judicial domain is managed to build domain knowledge base, and is combed out knowledge point by the expert of judicial domain and it is defined, institute The regular library unit stated forms information extraction rules library for writing decimation rule by maintenance personnel according to the needs of information extraction.
3. a kind of information extraction of information extraction system based on domain-specialist knowledge system according to claim 1 or 2 Method, it is characterised in that:It is as follows:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit, The expert of judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information extraction rule by regular library unit according to information extraction Then library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module;
(4), information extraction is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module Algorithm Extracting Information point;
(5), it is exported by output module by result is extracted.
4. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3 Method, it is characterised in that:In the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses Piao Tree Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category knot Fruit.
5. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4 Method, it is characterised in that:The rule classification method is classified according to the rule that maintenance personnel writes.
6. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4 Method, it is characterised in that:The sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianFor the Bayes's classification score that the paragraph is label A, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
7. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3 Method, it is characterised in that:The step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph different Information point, due to judgement document in information point quantity it is more, the more features of type, for different types need use different sides Method goes to identify.
CN201710289555.6A 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method Pending CN108804408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710289555.6A CN108804408A (en) 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710289555.6A CN108804408A (en) 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method

Publications (1)

Publication Number Publication Date
CN108804408A true CN108804408A (en) 2018-11-13

Family

ID=64069303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710289555.6A Pending CN108804408A (en) 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method

Country Status (1)

Country Link
CN (1) CN108804408A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753020A (en) * 2019-03-28 2020-10-09 阿里巴巴集团控股有限公司 A method and device for establishing a relationship network model
CN115374291A (en) * 2022-08-23 2022-11-22 浪潮软件科技有限公司 Method and system for constructing knowledge base based on business object
CN120219799A (en) * 2025-02-25 2025-06-27 南京迈拓医药科技有限公司 Identification and classification methods for lesions in medical images

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024329B1 (en) * 2006-06-01 2011-09-20 Monster Worldwide, Inc. Using inverted indexes for contextual personalized information retrieval
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN103049490A (en) * 2012-12-05 2013-04-17 北京海量融通软件技术有限公司 Attribute generation system and generation method among knowledge network nodes
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN103618652A (en) * 2013-12-17 2014-03-05 沈阳觉醒软件有限公司 Audit and depth analysis system and audit and depth analysis method of business data
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN105069560A (en) * 2015-07-30 2015-11-18 中国科学院软件研究所 Resume information extraction and characteristic identification analysis system and method based on knowledge base and rule base
CN105574084A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Extraction method of case information in webpage

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024329B1 (en) * 2006-06-01 2011-09-20 Monster Worldwide, Inc. Using inverted indexes for contextual personalized information retrieval
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN103049490A (en) * 2012-12-05 2013-04-17 北京海量融通软件技术有限公司 Attribute generation system and generation method among knowledge network nodes
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN103618652A (en) * 2013-12-17 2014-03-05 沈阳觉醒软件有限公司 Audit and depth analysis system and audit and depth analysis method of business data
CN105069560A (en) * 2015-07-30 2015-11-18 中国科学院软件研究所 Resume information extraction and characteristic identification analysis system and method based on knowledge base and rule base
CN105574084A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Extraction method of case information in webpage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱莉榕 等编著: "《算法设计与优化》", 31 December 2016, 北京:中央民族大学出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753020A (en) * 2019-03-28 2020-10-09 阿里巴巴集团控股有限公司 A method and device for establishing a relationship network model
CN115374291A (en) * 2022-08-23 2022-11-22 浪潮软件科技有限公司 Method and system for constructing knowledge base based on business object
CN120219799A (en) * 2025-02-25 2025-06-27 南京迈拓医药科技有限公司 Identification and classification methods for lesions in medical images

Similar Documents

Publication Publication Date Title
CN110334212A (en) A kind of territoriality audit knowledge mapping construction method based on machine learning
CN105095190B (en) A kind of sentiment analysis method combined based on Chinese semantic structure and subdivision dictionary
CN110807328A (en) Named entity identification method and system oriented to multi-strategy fusion of legal documents
CN107273295B (en) Software problem report classification method based on text chaos
CN111259160B (en) Knowledge graph construction method, device, equipment and storage medium
CN104573094B (en) Network account identifies matching process
CN106649223A (en) Financial report automatic generation method based on natural language processing
CN114265931B (en) Consumer policy perception analysis method and system based on big data text mining
CN111047428B (en) Identification method of bank high-risk fraudulent customers based on a small number of fraud samples
CN106682236A (en) Machine learning based patent data processing method and processing system adopting same
CN106709804A (en) Interactive wealth planning consulting robot system
CN109241297A (en) A kind of classifying content polymerization, electronic equipment, storage medium and engine
CN109165337A (en) A kind of method and system of knowledge based map construction bidding field association analysis
Wróblewska et al. Robotic Process Automation of Unstructured Data with Machine Learning.
Ji et al. Confucian dynamism and Dunning's framework: Direct and moderation associations in internationalized Chinese private firms
CN108804408A (en) Information extraction system based on domain-specialist knowledge system and information extraction method
Ashayeri et al. Unraveling energy justice in NYC urban buildings through social media sentiment analysis and transformer deep learning
CN110413795A (en) A kind of professional knowledge map construction method of data-driven
CN107798137A (en) A kind of multi-source heterogeneous data fusion architecture system based on additive models
CN111104975A (en) Credit assessment model based on breadth learning
CN116910238A (en) A knowledge-aware fake news detection method based on Siamese network
CN108229565A (en) A kind of image understanding method based on cognition
CN104657422A (en) Classification decision tree-based intelligent content distribution classification method
CN113222471B (en) Asset risk control method and device based on new media data
Fariha et al. A new framework for mining frequent interaction patterns from meeting databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181113