CN102200966A - Method for extracting and processing layout information - Google Patents
Method for extracting and processing layout information Download PDFInfo
- Publication number
- CN102200966A CN102200966A CN2011101458507A CN201110145850A CN102200966A CN 102200966 A CN102200966 A CN 102200966A CN 2011101458507 A CN2011101458507 A CN 2011101458507A CN 201110145850 A CN201110145850 A CN 201110145850A CN 102200966 A CN102200966 A CN 102200966A
- Authority
- CN
- China
- Prior art keywords
- layout information
- information
- processing
- layout
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 239000000284 extract Substances 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 7
- 230000010365 information processing Effects 0.000 abstract description 5
- 238000005192 partition Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000002386 leaching Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007596 consolidation process Methods 0.000 description 1
- 210000004709 eyebrow Anatomy 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a method for extracting and processing layout information. The method comprises the following steps of extracting layout information, processing the layout information and outputting the layout information of manuscripts appearing in newspaper. By adopting the method provided by the invention, different types of format files such as PS (Post Script), S2, PDF (Portable Document Format) and the like can be processed during the process of extracting the layout information,; multiple types of files can be processed under a single window; information such as an article author, an article introduction, an article theme, an article subtitle and the like, can be automatically extracted; and during the processing of the layout information, through intelligently combining a word partition and an image partition, an accurate corresponding relationship between words and an image can be ensured, so that the accuracy for extracting the layout information is improved, the speed of information processing is increased and the work efficiency is greatly increased.
Description
Technical field
The present invention relates to the Chinese information processing technology field in the computer utility, specifically, relate to a kind of layout information and extract and method for processing.
Background technology
Current, the layout of digital newspaper, magazine etc. all is to finish through steps such as typing, composing, demonstrations by computing machine.But the layout information of existing digital newspaper, magazine etc. is in extraction and process, can only handle the layout files of single type, and when running into the incomplete layout files of partition information, general adopt manual method to handle, be difficult to reduce well the complete needed layout information of the contribution that appears in the newspapers.
Summary of the invention
Technical matters to be solved by this invention is: provide a kind of layout information to extract and method for processing, utilize this method can handle dissimilar layout files, improve the accuracy of layout information extraction and the speed of information processing, increase work efficiency.
For solving the problems of the technologies described above, technical scheme of the present invention is: a kind of layout information extracts and method for processing, may further comprise the steps
(1) extraction of layout information: obtain needed layout files from data source earlier; Analyze layout files then, obtain the type of layout files; Adopt diverse ways by different file types, analyze and extract layout information, described layout information comprises Word message and pictorial information; Described layout information is formed unified format;
(2) processing of layout information: earlier described Word message and pictorial information are carried out subregion, literal subregion and picture subregion are carried out the intelligence combination according to attribute; The special subregion that does not have intelligent association is carried out manual association process; Content and form to layout information are handled;
(3) layout information after will processing is output as structured document.
Owing to adopted technique scheme, the invention has the beneficial effects as follows: adopt method of the present invention, in the layout information leaching process, can handle dissimilar layout files, such as PS, S2, PDF etc., can realize under a window that the processing of polytype file is handled; In the process of layout information, by literal subregion and picture subregion are carried out the intelligence combination, can guarantee the accurate corresponding relation of literal and picture, improve the accuracy of layout information extraction and the speed of information processing, increase work efficiency greatly.
Description of drawings
The present invention is further described below in conjunction with drawings and Examples.
Fig. 1 is the theory diagram of the embodiment of the invention;
Fig. 2 is the layout information process block diagram in the embodiment of the invention.
Embodiment
As shown in Figure 1, layout information of the present invention extracts and method for processing, comprises
(1) extraction step of layout information: obtain needed layout files from data source earlier; Analyze layout files then, it is classified, obtain the type of layout files, PS file for example, S2, S72, S92 file or pdf document etc.; Adopt diverse ways by different file types, for example adopt PS plug-in unit, S2 plug-in unit, PDF plug-in unit respectively, analyze the content of extracting layout information, described layout information content comprises Word message and pictorial information, wherein Word message comprises: word content, word attribute, literal be information such as position in the space of a whole page; Pictorial information comprises: picture name, and picture size, picture is information such as position in the space of a whole page; Described layout information is formed unified format to be used for procedure of processing;
As shown in Figure 2, layout information of the present invention extracts and method for processing, also comprises
(2) procedure of processing of layout information: Word message and the pictorial information importing data processing and sorting system with consolidation form in the step (1) carries out subregion earlier, afterwards literal subregion and picture subregion are carried out the intelligence combination according to attribute, can form the different contributions that appears in the newspapers; To those special subregions that does not have intelligent association, then need to carry out manual association process; Again according to appearing in the newspapers the needs of data, the content and the form of the layout information of every piece of contribution are handled;
(3) last, the derived type structure document comprises complete layout information.
In a word, adopt method of the present invention, in the layout information leaching process, can handle dissimilar layout files,, can realize under a window that the processing of polytype file is handled such as PS, S2, PDF etc.; Can automatically extract information such as article author, article eyebrow head, article theme, article subtitle; In the process of layout information, by literal subregion and picture subregion are carried out the intelligence combination, can guarantee the accurate corresponding relation of literal and picture, improve the accuracy of layout information extraction and the speed of information processing, increase work efficiency greatly.
The above is giving an example of best mode for carrying out the invention, and the part of wherein not addressing in detail is those of ordinary skills' common practise.Protection scope of the present invention is as the criterion with the content of claim, and any equivalent transformation that carries out based on technology enlightenment of the present invention is also within protection scope of the present invention.
Claims (1)
1. a layout information extracts and method for processing, it is characterized in that, may further comprise the steps
(1) extraction of layout information: obtain needed layout files from data source earlier; Analyze layout files then, obtain the type of layout files; Adopt diverse ways by different file types, analyze and extract layout information, described layout information comprises Word message and pictorial information; Described layout information is formed unified format;
(2) processing of layout information: earlier described Word message and pictorial information are carried out subregion, literal subregion and picture subregion are carried out the intelligence combination according to attribute; The subregion that does not have intelligent association is carried out manual association process; Content and form to layout information are handled;
(3) layout information after will processing is output as structured document.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2011101458507A CN102200966A (en) | 2011-06-01 | 2011-06-01 | Method for extracting and processing layout information |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2011101458507A CN102200966A (en) | 2011-06-01 | 2011-06-01 | Method for extracting and processing layout information |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN102200966A true CN102200966A (en) | 2011-09-28 |
Family
ID=44661652
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN2011101458507A Pending CN102200966A (en) | 2011-06-01 | 2011-06-01 | Method for extracting and processing layout information |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN102200966A (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104008180A (en) * | 2014-06-09 | 2014-08-27 | 北京奇虎科技有限公司 | Association method of structural data with picture, association device thereof |
| CN105095297A (en) * | 2014-05-16 | 2015-11-25 | 北大方正集团有限公司 | Method and apparatus for automatically generating picture code list |
| CN105530276A (en) * | 2014-09-30 | 2016-04-27 | 北大方正集团有限公司 | Method and system for processing reported data |
| CN109086327A (en) * | 2018-07-03 | 2018-12-25 | 中国科学院信息工程研究所 | A kind of method and device quickly generating webpage visual structure graph |
| CN114973286A (en) * | 2022-06-16 | 2022-08-30 | 科大讯飞股份有限公司 | Document element extraction method, device, equipment and storage medium |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2002117035A (en) * | 2000-10-10 | 2002-04-19 | Citation Japan:Kk | Device and method for analysis using free word and storage medium |
| CN101430714A (en) * | 2008-12-08 | 2009-05-13 | 北大方正集团有限公司 | Content structuring process method and system based on model |
| US20100329577A1 (en) * | 2009-06-24 | 2010-12-30 | Fuji Xerox Co., Ltd. | Image processing device |
-
2011
- 2011-06-01 CN CN2011101458507A patent/CN102200966A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2002117035A (en) * | 2000-10-10 | 2002-04-19 | Citation Japan:Kk | Device and method for analysis using free word and storage medium |
| CN101430714A (en) * | 2008-12-08 | 2009-05-13 | 北大方正集团有限公司 | Content structuring process method and system based on model |
| US20100329577A1 (en) * | 2009-06-24 | 2010-12-30 | Fuji Xerox Co., Ltd. | Image processing device |
Non-Patent Citations (1)
| Title |
|---|
| 《软件世界》 20021031 孟奇刚 信息管理新天地 全文 1 , 第10期 * |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105095297A (en) * | 2014-05-16 | 2015-11-25 | 北大方正集团有限公司 | Method and apparatus for automatically generating picture code list |
| CN104008180A (en) * | 2014-06-09 | 2014-08-27 | 北京奇虎科技有限公司 | Association method of structural data with picture, association device thereof |
| CN105530276A (en) * | 2014-09-30 | 2016-04-27 | 北大方正集团有限公司 | Method and system for processing reported data |
| CN109086327A (en) * | 2018-07-03 | 2018-12-25 | 中国科学院信息工程研究所 | A kind of method and device quickly generating webpage visual structure graph |
| CN109086327B (en) * | 2018-07-03 | 2022-05-17 | 中国科学院信息工程研究所 | Method and device for rapidly generating webpage visual structure graph |
| CN114973286A (en) * | 2022-06-16 | 2022-08-30 | 科大讯飞股份有限公司 | Document element extraction method, device, equipment and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111832403B (en) | Document structure recognition method, document structure recognition model training method and device | |
| CN113642584B (en) | Character recognition method, device, equipment, storage medium and intelligent dictionary pen | |
| CN104391881B (en) | A log parsing method and system based on word segmentation algorithm | |
| CN104598577B (en) | A kind of extracting method of Web page text | |
| CN109685052A (en) | Method for processing text images, device, electronic equipment and computer-readable medium | |
| US20150095769A1 (en) | Layout Analysis Method And System | |
| CN102200966A (en) | Method for extracting and processing layout information | |
| CN102253979A (en) | Vision-based web page extracting method | |
| CN101881999A (en) | Oracle Video Input System and Implementation Method | |
| CN103902918B (en) | Method and device for rapidly extracting text from Word document | |
| Talukder et al. | Connected component based approach for text extraction from color image | |
| CN109685061A (en) | The recognition methods of mathematical formulae suitable for structuring | |
| CN103440239A (en) | Functional region recognition-based webpage segmentation method and device | |
| CN105426355A (en) | Syllabic size based method and apparatus for identifying Tibetan syntax chunk | |
| Clausner et al. | Efficient ocr training data generation with aletheia | |
| Devi et al. | Embedded optical character recognition on Tamil text image using Raspberry Pi | |
| CN102236658B (en) | Webpage content extracting method and device | |
| CN104820962A (en) | Method for generating and printing watermarks capable of replacing manual signatures | |
| CN109447015A (en) | A kind of method and device handling form Image center selection word | |
| CN116958996A (en) | OCR information extraction method, system and equipment | |
| CN112541505B (en) | Text recognition method, text recognition device and computer-readable storage medium | |
| CN120068810A (en) | Multi-mode document structuring processing method, device, equipment and medium | |
| CN106156314B (en) | A kind of data manipulation method and device, data search method and device | |
| CN104598289A (en) | Recognition method and electronic device | |
| CN103678284A (en) | Method and device for translating page characters |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20110928 |