[go: up one dir, main page]

CN101430714A - Content structuring process method and system based on model - Google Patents

Content structuring process method and system based on model Download PDF

Info

Publication number
CN101430714A
CN101430714A CNA2008102389945A CN200810238994A CN101430714A CN 101430714 A CN101430714 A CN 101430714A CN A2008102389945 A CNA2008102389945 A CN A2008102389945A CN 200810238994 A CN200810238994 A CN 200810238994A CN 101430714 A CN101430714 A CN 101430714A
Authority
CN
China
Prior art keywords
structured
content
style
document
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008102389945A
Other languages
Chinese (zh)
Other versions
CN101430714B (en
Inventor
余忠华
闫国龙
赵朝阳
魏超鹏
苏勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN2008102389945A priority Critical patent/CN101430714B/en
Publication of CN101430714A publication Critical patent/CN101430714A/en
Application granted granted Critical
Publication of CN101430714B publication Critical patent/CN101430714B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

本发明涉及一种基于样式的内容结构化加工方法和系统,属于信息内容加工技术领域。现有技术中,内容结构化的方法要么需要手工录入,要么只能对纯内容的数据进行结构化,无法带入内容的版式信息和样式信息。本发明所述的方法和系统,根据需要结构化的文件建立内容结构化体系,然后建立样式与结构化关键字的对应关系,最后解析文档提取内容形成结构化内容,从而完成内容结构化的加工。采用本发明所述的方法和系统,内容结构化的加工过程中既不会受数据字段的约束,还可以同时保留原文档内容的样式属性,一方面利用了已有的文档信息,版式特征,另一方面是实现了内容结构化的自动加工。

The invention relates to a style-based content structured processing method and system, belonging to the technical field of information content processing. In the prior art, the method of content structuring either requires manual input, or can only structure the data of pure content, and cannot bring in the format information and style information of the content. The method and system of the present invention establishes a structured content system according to the required structured documents, then establishes the corresponding relationship between styles and structured keywords, and finally analyzes the document to extract content to form structured content, thereby completing the processing of structured content . By adopting the method and system of the present invention, the process of content structuring will not be restricted by the data field, and the style attributes of the original document content can also be preserved at the same time. On the one hand, the existing document information and layout features are used. On the other hand, it realizes automatic processing of content structure.

Description

A kind of content structure job operation and system based on pattern
Technical field
The invention belongs to information content processing technique field, be specifically related to a kind of content structure job operation and system based on pattern.
Background technology
Along with deeply popularizing of IT application, all trades and professions have all accumulated a large amount of information resources.The key that these inside and outside information resources of scientific management and reasonable development have become enterprise's correct decisions, enhanced the competitiveness, before these important informations are utilized, usually need carry out structurized processing to the data content of these information, to adapt to the different needs of different user.
The method of current content structureization is that important document comes manufacturing system with storage, flow process, metadata (Metadata) mainly, wherein, stores many modes with the correlation database, and what also have stores in the GDF general data file mode, or two kinds and deposit as required.At present, the method of content structureization mainly contains two kinds: a kind of mode that is to use specific software the particular data field to be carried out typing, the shortcoming of this mode is that the data field that carries out typing is subjected to software limitations, and needs manual typing, is not easy to realize robotization processing; Another kind is to use the mode of the software of similar xml editor, this mode shortcoming is to carry out structuring to the data of pure content, can't bring information such as the format of former document content and pattern into, and type-setting domain or hope have the user of content pattern before seal, existing content structure method can not be satisfied the demand because of the style information or the format information that can not make structurized content have content simultaneously, therefore, how to make structurized content not only keep raw content but also have the pattern or the format information of raw content, become the problem that increasing user pays close attention to.
Summary of the invention
At the defective that exists in the prior art, the purpose of this invention is to provide a kind of content structure job operation and system based on pattern, not only can finish the robotization processing of content structureization by these method and system, and make the structured content after the processing keep information such as original pattern, format.
For reaching above purpose, the technical solution used in the present invention is: a kind of content structure job operation based on pattern may further comprise the steps:
(1) set up the content structure system, the structuring key word is set as required, and the relation between definite structuring key word;
(2) set up corresponding relation between pattern and the structuring key word;
(3) parsing needs structurized document, extracts content and forms structured content;
Further, in the step (1), when the structuring key word was set, structurized as required document content architecture was provided with;
Further, in the step (1), when the structuring key word is set, be provided with according to the pattern of document content.
Further, in the step (1), when determining concerning between the structuring key word, determine according to the pattern of document content; Relations such as the relation between the described structuring key word is meant position, arrangement, level, structure between the key word, comprise, the actual corresponding relation of content in document of structuring key word representative just.
Further, in the step (2), when setting up the corresponding relation of pattern and structuring key word, corresponding one or more patterns of structuring key word, but a kind of pattern can only corresponding structuring key word.
Further, in the step (3), after document is finished structuring, formed two files: pattern mapped file and structure content file, described pattern mapped file has write down the corresponding relation between pattern and the structuring key word; Described structure content file logging the corresponding relation of structuring key word and document content.
A kind of content structure system of processing based on pattern, this system comprises: the structuring key word makes up module, pattern and key word respective modules, resolves extraction module;
Described structuring key word makes up module and is used to be provided with the structuring key word, and the relation between definite structuring key word;
Described pattern and key word respective modules are used to set up the corresponding relation between pattern and the structuring key word;
Described parsing extraction module is used for resolving the structurized document of needs, and extracts document content formation structured content;
When said system is worked, at first make up module the structuring key word is set by the structuring key word, and the relation between definite structuring key word; Set up corresponding relation between pattern and the structuring key word by pattern and key word respective modules then, resolve extraction module then and read and resolve the structurized document of needs, according to pattern and the pattern of key word respective modules foundation and the corresponding relation between the structuring key word, extract corresponding document content in the structuring key word, thereby the formation structured content, processing finishes.
Effect of the present invention is: adopt method and system of the present invention, for with in the content of information such as pattern, format, add man-hour carrying out content structureization, not only can finish the automatic processing of content structureization, make the structured content after the processing can keep information such as original pattern, format simultaneously, greatly facilitate the needs of different user.
Description of drawings
Fig. 1 is the process flow diagram of the method for the invention;
Fig. 2 is the structural drawing of system of the present invention.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the invention will be further elaborated:
As shown in Figure 2, a kind of content structure system of processing based on pattern, this system comprises: the structuring key word makes up module, pattern and key word respective modules, resolves extraction module;
Described structuring key word makes up module and is used to be provided with the structuring key word, and the relation between definite structuring key word;
Described pattern and key word respective modules are used to set up the corresponding relation between pattern and the structuring key word;
Described parsing extraction module is used for resolving the structurized document of needs, and extracts document content formation structured content;
When said system is worked, at first make up module the structuring key word is set by the structuring key word, and the relation between definite structuring key word; Set up corresponding relation between pattern and the structuring key word by pattern and key word respective modules then, resolve extraction module then and read and resolve the structurized document of needs, according to pattern and the pattern of key word respective modules foundation and the corresponding relation between the structuring key word, extract corresponding document content in the structuring key word, thereby the formation structured content, processing finishes.
For adapting to said system, the present invention has adopted a kind of content structure job operation based on pattern, as shown in Figure 1, specifically may further comprise the steps:
(1) set up the content structure system, the structuring key word is set as required, and the relation between definite structuring key word; Being provided with of structuring key word is more flexible, can be as required or user's custom be provided with according to the content structure of document, also can be provided with according to the pattern title of document content; Simultaneously determine relation between the structuring key word according to the pattern of document content; Relations such as the relation between the described structuring key word is meant position, arrangement, level, structure between the structuring key word, comprise, the actual corresponding relation of content in document of structuring key word representative just;
In the present embodiment, be processed as the implementation process that example specifies this step need type-setting document before the following seal being carried out content structureization:
According to the pattern in the above-mentioned document content, it is as follows to mark its concrete pattern and attribute thereof:
Figure A200810238994D00091
The above-mentioned file content that has pattern is being carried out the structuring first being processed, elder generation's content construction structuring system, the structuring key word is set, because the content of this document has many patterns, be provided with according to each pattern in the document content when therefore in the present embodiment structuring key word being set, make a concrete analysis of as follows:
Comprise in the above-mentioned file that a headline, bullets of 3 subheads, figure say, a form and some texts, various contents have all been used different patterns, can be divided into two classes: a class is the pattern of text style such as title correspondence, the pattern of subhead correspondence, the pattern of bullets correspondence, the pattern of text correspondence; Another kind of is that the object pattern is said corresponding pattern, the pattern of form correspondence as figure.Set the structuring key word according to pattern, the result is as shown in the table:
Headline
Subhead
Text
List items
Figure says
Form
When the structuring key word is set, need simultaneously to determine relation between the structuring key word according to the pattern of document content; Relations such as the relation between the described structuring key word is meant position, arrangement, level, structure between the structuring key word, comprise, the actual corresponding relation of content in document of structuring key word representative just; In the present embodiment, by analysis as can be known, 1) entire document is a root element; 2) headline, subhead are the daughter elements of root element; 3) text is the daughter element of root element; 4) bullets, picture, form are and other element of the same level of text; 5) list items is the daughter element of bullets, substantially should analyze, and has determined relation between the structuring key word according to the pattern of the document content.
(2) set up the corresponding relation of pattern and structuring key word;
When setting up the corresponding relation of pattern and structuring key word, corresponding one or more (two or more) patterns of structuring key word, but a kind of pattern can only corresponding structuring key word, specifically in the present embodiment, each pattern all has unique structuring key word corresponding with it, and the attribute of the corresponding pattern of record institute, specifically corresponding relation is as shown in the table:
The structuring key word Pattern
Headline The headline pattern
Subhead The subhead pattern
Text The text pattern
List items The bullets pattern
Figure says Figure says pattern
Form Table style
(3) parsing needs structurized document, extracts content and forms structured content;
In the present embodiment, parse documents, simultaneously according to pattern in the step (2) and structuring key word corresponding relation, the content of extracting document forms structured content, and detailed process is as follows:
1) finds headline earlier, extract its content and structuring in structuring key word " headline ", finish the structuring of headline;
2) find subhead, extract its content and structuring in structuring key word " subhead ", finish the structuring of subhead;
3) find text, bullets, picture, form etc., extract its content and structuring in corresponding structuring key word, finish the structuring of text with it.
After file in the foregoing description is finished structuring, formed two files: pattern mapped file and structure content file, pattern mapped file have write down the corresponding relation between pattern and the structuring key word; The structure content file logging corresponding relation of structuring key word and document content, the pattern mapped file is as follows:
Figure A200810238994D00111
The structure content file is as follows:
Figure A200810238994D00121
Through above-mentioned processing, document content in the present embodiment has been carried out structuring processing, the result of content structureization meets the relation between the structuring key word of determining in the step (1) fully, and this structurized content can have original style information, during concrete the application, if the client need not have the structured content of pattern, this moment can a choice structure content file; If the client need have the structured content of pattern, then select pattern mapped file and structure content file to get final product simultaneously.
The result of foregoing structuring processing, its form of expression can freely be expressed according to user's needs, can be the file that meets the XML standard criterion, also can be the file that oneself defines.
Method of the present invention is not limited to the embodiment described in the embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.

Claims (7)

1、一种基于样式的内容结构化加工方法,包括以下步骤:1. A style-based content structured processing method, comprising the following steps: (1)建立内容结构化体系,根据需要设置结构化关键字,并确定结构化关键字之间的关系;(1) Establish a structured content system, set structured keywords as needed, and determine the relationship between structured keywords; (2)建立样式与结构化关键字之间的对应关系;(2) establish the corresponding relationship between styles and structured keywords; (3)解析需要结构化的文档,提取内容形成结构化内容。(3) Parse the document that needs to be structured, and extract the content to form structured content. 2、如权利要求1所述的一种基于样式的内容结构化加工方法,其特征在于:步骤(1)中,设置结构化关键字时,根据需要结构化的文档内容结构来设置。2. A style-based content structured processing method according to claim 1, characterized in that: in step (1), when setting the structured keyword, it is set according to the required structured document content structure. 3、如权利要求1所述的一种基于样式的内容结构化加工方法,其特征在于:步骤(1)中,设置结构化关键字时,根据文档内容的样式来设置。3. A style-based content structured processing method according to claim 1, characterized in that: in step (1), when setting the structured keywords, it is set according to the style of the document content. 4、如权利要求1至3之一所述的一种基于样式的内容结构化加工方法,其特征在于:步骤(1)中,确定结构化关键字之间的关系时,根据文档内容的样式来确定。4. A style-based content structured processing method according to any one of claims 1 to 3, characterized in that: in step (1), when determining the relationship between structured keywords, according to the style of document content to make sure. 5、如权利要求4所述的一种基于样式的内容结构化加工方法,其特征在于:步骤(2)中,建立样式与结构化关键字的对应关系时,一个结构化关键字对应一种或者多种样式,但是一种样式只能对应一个结构化关键字。5. A style-based content structured processing method as claimed in claim 4, characterized in that: in step (2), when establishing the corresponding relationship between styles and structured keywords, one structured keyword corresponds to one Or multiple styles, but a style can only correspond to one structured keyword. 6、如权利要求5所述的一种基于样式的内容结构化加工方法,其特征在于:步骤(3)中,文档完成结构化后,形成了两个文件:样式映射文件和结构内容文件,所述样式映射文件记录了样式与结构化关键字之间的对应关系,所述结构内容文件记录了结构化关键字与文档内容的对应关系。6. A style-based content structured processing method as claimed in claim 5, characterized in that: in step (3), after the document is structured, two files are formed: a style mapping file and a structured content file, The style mapping file records the correspondence between styles and structured keywords, and the structure content file records the correspondence between structured keywords and document content. 7、一种基于样式的内容结构化加工系统,该系统包括:结构化关键字构建模块、样式与关键字对应模块、解析提取模块;7. A style-based content structured processing system, which includes: a structured keyword building module, a style and keyword correspondence module, and an analysis and extraction module; 所述的结构化关键字构建模块用于设置结构化关键字,并确定结构化关键字之间的关系;The structured keyword construction module is used to set structured keywords and determine the relationship between structured keywords; 所述的样式与关键字对应模块用于建立样式与结构化关键字之间的对应关系;The style and keyword correspondence module is used to establish the correspondence between styles and structured keywords; 所述的解析提取模块用于解析需要结构化的文档,并提取文档内容形成结构化内容;The parsing and extraction module is used for parsing documents that need to be structured, and extracting document content to form structured content; 当上述系统工作时,首先由结构化关键字构建模块设置结构化关键字,并确定结构化关键字之间的关系;然后由样式与关键字对应模块建立起样式与结构化关键字之间的对应关系,然后解析提取模块读取并解析需要结构化的文档,根据样式与关键字对应模块建立的样式与结构化关键字之间的对应关系,提取相应的文档内容到结构化关键字中,从而形成结构化内容,处理结束。When the above-mentioned system is working, firstly, the structural keywords are set by the structured keyword building module, and the relationship between the structured keywords is determined; The corresponding relationship, and then the parsing and extraction module reads and parses the document that needs to be structured, and extracts the corresponding document content into the structured keyword according to the corresponding relationship between the style and the structured keyword established by the style and keyword corresponding module. Thus, the structured content is formed, and the processing ends.
CN2008102389945A 2008-12-08 2008-12-08 A style-based content structured processing method and system Expired - Fee Related CN101430714B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102389945A CN101430714B (en) 2008-12-08 2008-12-08 A style-based content structured processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102389945A CN101430714B (en) 2008-12-08 2008-12-08 A style-based content structured processing method and system

Publications (2)

Publication Number Publication Date
CN101430714A true CN101430714A (en) 2009-05-13
CN101430714B CN101430714B (en) 2011-01-26

Family

ID=40646108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102389945A Expired - Fee Related CN101430714B (en) 2008-12-08 2008-12-08 A style-based content structured processing method and system

Country Status (1)

Country Link
CN (1) CN101430714B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122280A (en) * 2009-12-17 2011-07-13 北大方正集团有限公司 Method and system for intelligently extracting content object
CN102200966A (en) * 2011-06-01 2011-09-28 潍坊北大青鸟华光照排有限公司 Method for extracting and processing layout information
CN102799597A (en) * 2011-05-26 2012-11-28 株式会社日立制作所 Content extraction method
CN102103605B (en) * 2009-12-18 2012-12-19 北大方正集团有限公司 Method and system for intelligently extracting document structure
CN102841886A (en) * 2011-06-21 2012-12-26 北大方正集团有限公司 Method and device for splitting document
CN102855243A (en) * 2011-06-28 2013-01-02 北大方正集团有限公司 Method and device for extracting document structure
CN103136184A (en) * 2011-12-05 2013-06-05 北大方正集团有限公司 Automatic typesetting method and automatic typesetting device
CN103377183A (en) * 2012-04-27 2013-10-30 北大方正集团有限公司 Method and device for typesetting repeatedly
CN104657342A (en) * 2013-11-19 2015-05-27 北大方正集团有限公司 Clean proof generating method and device
CN103186514B (en) * 2011-12-31 2016-04-20 北大方正集团有限公司 For realizing the method and apparatus of file structure
CN111723555A (en) * 2020-06-22 2020-09-29 稿定(厦门)科技有限公司 Flat typesetting method and system

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122280A (en) * 2009-12-17 2011-07-13 北大方正集团有限公司 Method and system for intelligently extracting content object
CN102122280B (en) * 2009-12-17 2013-06-05 北大方正集团有限公司 Method and system for intelligently extracting content object
CN102103605B (en) * 2009-12-18 2012-12-19 北大方正集团有限公司 Method and system for intelligently extracting document structure
CN102799597A (en) * 2011-05-26 2012-11-28 株式会社日立制作所 Content extraction method
CN102200966A (en) * 2011-06-01 2011-09-28 潍坊北大青鸟华光照排有限公司 Method for extracting and processing layout information
CN102841886A (en) * 2011-06-21 2012-12-26 北大方正集团有限公司 Method and device for splitting document
CN102855243A (en) * 2011-06-28 2013-01-02 北大方正集团有限公司 Method and device for extracting document structure
CN103136184A (en) * 2011-12-05 2013-06-05 北大方正集团有限公司 Automatic typesetting method and automatic typesetting device
CN103136184B (en) * 2011-12-05 2016-01-13 北大方正集团有限公司 A kind of method of automatic typesetting and device thereof
CN103186514B (en) * 2011-12-31 2016-04-20 北大方正集团有限公司 For realizing the method and apparatus of file structure
CN103377183A (en) * 2012-04-27 2013-10-30 北大方正集团有限公司 Method and device for typesetting repeatedly
CN103377183B (en) * 2012-04-27 2016-04-20 北大方正集团有限公司 Repeat the method and apparatus of typesetting
CN104657342A (en) * 2013-11-19 2015-05-27 北大方正集团有限公司 Clean proof generating method and device
CN111723555A (en) * 2020-06-22 2020-09-29 稿定(厦门)科技有限公司 Flat typesetting method and system

Also Published As

Publication number Publication date
CN101430714B (en) 2011-01-26

Similar Documents

Publication Publication Date Title
CN101430714B (en) A style-based content structured processing method and system
CN103631969B (en) A kind of generation method and device of report data
CN102591971B (en) Method and device for extracting webpage information
US7720885B2 (en) Generating a word-processing document from database content
CN104133772A (en) Automatic test data generation method
CN113407678B (en) Knowledge graph construction method, device and equipment
CN111309313A (en) A fast way to generate HTML and store form data
CN103823838A (en) Method for inputting and comparing multi-format documents
CN102122280A (en) Method and system for intelligently extracting content object
CN102566945A (en) Method and system for realizing automatic acquisition and on-demand printing of book
CN105808775A (en) Method and device for synchronizing layout file information into database
CN100338605C (en) Recording method for extendable mark language file repairing trace
CN110737432A (en) script aided design method and device based on root list
CN113741864B (en) Automatic semantic service interface design method and system based on natural language processing
US20250103771A1 (en) Systems and methods for automatically generating designs
CN102110006A (en) System and method for expanding and developing application business
US20070282804A1 (en) Apparatus and method for extracting database information from a report
CN108196921B (en) Document development method and device, computer equipment and storage medium
CN105912723A (en) Storage method of custom field
CN111142871B (en) A front-end page development system, method, device, and medium
CN113238865A (en) Method for quickly constructing knowledge graph based on Excel one-key import
CN108399188B (en) Universal establishing and processing method for strong service object based on type metadata
CN108228688B (en) An XBRL-based template generation method, system and server
US20080077641A1 (en) System and method for editing contract clauses in static web pages
CN105701158A (en) File system read-write optimization method and framework

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110126

Termination date: 20191208