Summary of the invention
At the defective that exists in the prior art, the purpose of this invention is to provide a kind of content structure job operation and system based on pattern, not only can finish the robotization processing of content structureization by these method and system, and make the structured content after the processing keep information such as original pattern, format.
For reaching above purpose, the technical solution used in the present invention is: a kind of content structure job operation based on pattern may further comprise the steps:
(1) set up the content structure system, the structuring key word is set as required, and the relation between definite structuring key word;
(2) set up corresponding relation between pattern and the structuring key word;
(3) parsing needs structurized document, extracts content and forms structured content;
Further, in the step (1), when the structuring key word was set, structurized as required document content architecture was provided with;
Further, in the step (1), when the structuring key word is set, be provided with according to the pattern of document content.
Further, in the step (1), when determining concerning between the structuring key word, determine according to the pattern of document content; Relations such as the relation between the described structuring key word is meant position, arrangement, level, structure between the key word, comprise, the actual corresponding relation of content in document of structuring key word representative just.
Further, in the step (2), when setting up the corresponding relation of pattern and structuring key word, corresponding one or more patterns of structuring key word, but a kind of pattern can only corresponding structuring key word.
Further, in the step (3), after document is finished structuring, formed two files: pattern mapped file and structure content file, described pattern mapped file has write down the corresponding relation between pattern and the structuring key word; Described structure content file logging the corresponding relation of structuring key word and document content.
A kind of content structure system of processing based on pattern, this system comprises: the structuring key word makes up module, pattern and key word respective modules, resolves extraction module;
Described structuring key word makes up module and is used to be provided with the structuring key word, and the relation between definite structuring key word;
Described pattern and key word respective modules are used to set up the corresponding relation between pattern and the structuring key word;
Described parsing extraction module is used for resolving the structurized document of needs, and extracts document content formation structured content;
When said system is worked, at first make up module the structuring key word is set by the structuring key word, and the relation between definite structuring key word; Set up corresponding relation between pattern and the structuring key word by pattern and key word respective modules then, resolve extraction module then and read and resolve the structurized document of needs, according to pattern and the pattern of key word respective modules foundation and the corresponding relation between the structuring key word, extract corresponding document content in the structuring key word, thereby the formation structured content, processing finishes.
Effect of the present invention is: adopt method and system of the present invention, for with in the content of information such as pattern, format, add man-hour carrying out content structureization, not only can finish the automatic processing of content structureization, make the structured content after the processing can keep information such as original pattern, format simultaneously, greatly facilitate the needs of different user.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the invention will be further elaborated:
As shown in Figure 2, a kind of content structure system of processing based on pattern, this system comprises: the structuring key word makes up module, pattern and key word respective modules, resolves extraction module;
Described structuring key word makes up module and is used to be provided with the structuring key word, and the relation between definite structuring key word;
Described pattern and key word respective modules are used to set up the corresponding relation between pattern and the structuring key word;
Described parsing extraction module is used for resolving the structurized document of needs, and extracts document content formation structured content;
When said system is worked, at first make up module the structuring key word is set by the structuring key word, and the relation between definite structuring key word; Set up corresponding relation between pattern and the structuring key word by pattern and key word respective modules then, resolve extraction module then and read and resolve the structurized document of needs, according to pattern and the pattern of key word respective modules foundation and the corresponding relation between the structuring key word, extract corresponding document content in the structuring key word, thereby the formation structured content, processing finishes.
For adapting to said system, the present invention has adopted a kind of content structure job operation based on pattern, as shown in Figure 1, specifically may further comprise the steps:
(1) set up the content structure system, the structuring key word is set as required, and the relation between definite structuring key word; Being provided with of structuring key word is more flexible, can be as required or user's custom be provided with according to the content structure of document, also can be provided with according to the pattern title of document content; Simultaneously determine relation between the structuring key word according to the pattern of document content; Relations such as the relation between the described structuring key word is meant position, arrangement, level, structure between the structuring key word, comprise, the actual corresponding relation of content in document of structuring key word representative just;
In the present embodiment, be processed as the implementation process that example specifies this step need type-setting document before the following seal being carried out content structureization:
According to the pattern in the above-mentioned document content, it is as follows to mark its concrete pattern and attribute thereof:
The above-mentioned file content that has pattern is being carried out the structuring first being processed, elder generation's content construction structuring system, the structuring key word is set, because the content of this document has many patterns, be provided with according to each pattern in the document content when therefore in the present embodiment structuring key word being set, make a concrete analysis of as follows:
Comprise in the above-mentioned file that a headline, bullets of 3 subheads, figure say, a form and some texts, various contents have all been used different patterns, can be divided into two classes: a class is the pattern of text style such as title correspondence, the pattern of subhead correspondence, the pattern of bullets correspondence, the pattern of text correspondence; Another kind of is that the object pattern is said corresponding pattern, the pattern of form correspondence as figure.Set the structuring key word according to pattern, the result is as shown in the table:
Headline |
Subhead |
Text |
List items |
Figure says |
Form |
When the structuring key word is set, need simultaneously to determine relation between the structuring key word according to the pattern of document content; Relations such as the relation between the described structuring key word is meant position, arrangement, level, structure between the structuring key word, comprise, the actual corresponding relation of content in document of structuring key word representative just; In the present embodiment, by analysis as can be known, 1) entire document is a root element; 2) headline, subhead are the daughter elements of root element; 3) text is the daughter element of root element; 4) bullets, picture, form are and other element of the same level of text; 5) list items is the daughter element of bullets, substantially should analyze, and has determined relation between the structuring key word according to the pattern of the document content.
(2) set up the corresponding relation of pattern and structuring key word;
When setting up the corresponding relation of pattern and structuring key word, corresponding one or more (two or more) patterns of structuring key word, but a kind of pattern can only corresponding structuring key word, specifically in the present embodiment, each pattern all has unique structuring key word corresponding with it, and the attribute of the corresponding pattern of record institute, specifically corresponding relation is as shown in the table:
The structuring key word |
Pattern |
Headline |
The headline pattern |
Subhead |
The subhead pattern |
Text |
The text pattern |
List items |
The bullets pattern |
Figure says |
Figure says pattern |
Form |
Table style |
(3) parsing needs structurized document, extracts content and forms structured content;
In the present embodiment, parse documents, simultaneously according to pattern in the step (2) and structuring key word corresponding relation, the content of extracting document forms structured content, and detailed process is as follows:
1) finds headline earlier, extract its content and structuring in structuring key word " headline ", finish the structuring of headline;
2) find subhead, extract its content and structuring in structuring key word " subhead ", finish the structuring of subhead;
3) find text, bullets, picture, form etc., extract its content and structuring in corresponding structuring key word, finish the structuring of text with it.
After file in the foregoing description is finished structuring, formed two files: pattern mapped file and structure content file, pattern mapped file have write down the corresponding relation between pattern and the structuring key word; The structure content file logging corresponding relation of structuring key word and document content, the pattern mapped file is as follows:
The structure content file is as follows:
Through above-mentioned processing, document content in the present embodiment has been carried out structuring processing, the result of content structureization meets the relation between the structuring key word of determining in the step (1) fully, and this structurized content can have original style information, during concrete the application, if the client need not have the structured content of pattern, this moment can a choice structure content file; If the client need have the structured content of pattern, then select pattern mapped file and structure content file to get final product simultaneously.
The result of foregoing structuring processing, its form of expression can freely be expressed according to user's needs, can be the file that meets the XML standard criterion, also can be the file that oneself defines.
Method of the present invention is not limited to the embodiment described in the embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.