[go: up one dir, main page]

CN102200966A - Method for extracting and processing layout information - Google Patents

Method for extracting and processing layout information Download PDF

Info

Publication number
CN102200966A
CN102200966A CN2011101458507A CN201110145850A CN102200966A CN 102200966 A CN102200966 A CN 102200966A CN 2011101458507 A CN2011101458507 A CN 2011101458507A CN 201110145850 A CN201110145850 A CN 201110145850A CN 102200966 A CN102200966 A CN 102200966A
Authority
CN
China
Prior art keywords
layout information
information
processing
layout
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101458507A
Other languages
Chinese (zh)
Inventor
殷建民
张东升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WEIFANG BEIDA JADE BIRD HUAGUANG IMAGESETTER CO Ltd
Original Assignee
WEIFANG BEIDA JADE BIRD HUAGUANG IMAGESETTER CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WEIFANG BEIDA JADE BIRD HUAGUANG IMAGESETTER CO Ltd filed Critical WEIFANG BEIDA JADE BIRD HUAGUANG IMAGESETTER CO Ltd
Priority to CN2011101458507A priority Critical patent/CN102200966A/en
Publication of CN102200966A publication Critical patent/CN102200966A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for extracting and processing layout information. The method comprises the following steps of extracting layout information, processing the layout information and outputting the layout information of manuscripts appearing in newspaper. By adopting the method provided by the invention, different types of format files such as PS (Post Script), S2, PDF (Portable Document Format) and the like can be processed during the process of extracting the layout information,; multiple types of files can be processed under a single window; information such as an article author, an article introduction, an article theme, an article subtitle and the like, can be automatically extracted; and during the processing of the layout information, through intelligently combining a word partition and an image partition, an accurate corresponding relationship between words and an image can be ensured, so that the accuracy for extracting the layout information is improved, the speed of information processing is increased and the work efficiency is greatly increased.

Description

A kind of layout information extracts and method for processing
Technical field
The present invention relates to the Chinese information processing technology field in the computer utility, specifically, relate to a kind of layout information and extract and method for processing.
Background technology
Current, the layout of digital newspaper, magazine etc. all is to finish through steps such as typing, composing, demonstrations by computing machine.But the layout information of existing digital newspaper, magazine etc. is in extraction and process, can only handle the layout files of single type, and when running into the incomplete layout files of partition information, general adopt manual method to handle, be difficult to reduce well the complete needed layout information of the contribution that appears in the newspapers.
Summary of the invention
Technical matters to be solved by this invention is: provide a kind of layout information to extract and method for processing, utilize this method can handle dissimilar layout files, improve the accuracy of layout information extraction and the speed of information processing, increase work efficiency.
For solving the problems of the technologies described above, technical scheme of the present invention is: a kind of layout information extracts and method for processing, may further comprise the steps
(1) extraction of layout information: obtain needed layout files from data source earlier; Analyze layout files then, obtain the type of layout files; Adopt diverse ways by different file types, analyze and extract layout information, described layout information comprises Word message and pictorial information; Described layout information is formed unified format;
(2) processing of layout information: earlier described Word message and pictorial information are carried out subregion, literal subregion and picture subregion are carried out the intelligence combination according to attribute; The special subregion that does not have intelligent association is carried out manual association process; Content and form to layout information are handled;
(3) layout information after will processing is output as structured document.
Owing to adopted technique scheme, the invention has the beneficial effects as follows: adopt method of the present invention, in the layout information leaching process, can handle dissimilar layout files, such as PS, S2, PDF etc., can realize under a window that the processing of polytype file is handled; In the process of layout information, by literal subregion and picture subregion are carried out the intelligence combination, can guarantee the accurate corresponding relation of literal and picture, improve the accuracy of layout information extraction and the speed of information processing, increase work efficiency greatly.
Description of drawings
The present invention is further described below in conjunction with drawings and Examples.
Fig. 1 is the theory diagram of the embodiment of the invention;
Fig. 2 is the layout information process block diagram in the embodiment of the invention.
Embodiment
As shown in Figure 1, layout information of the present invention extracts and method for processing, comprises
(1) extraction step of layout information: obtain needed layout files from data source earlier; Analyze layout files then, it is classified, obtain the type of layout files, PS file for example, S2, S72, S92 file or pdf document etc.; Adopt diverse ways by different file types, for example adopt PS plug-in unit, S2 plug-in unit, PDF plug-in unit respectively, analyze the content of extracting layout information, described layout information content comprises Word message and pictorial information, wherein Word message comprises: word content, word attribute, literal be information such as position in the space of a whole page; Pictorial information comprises: picture name, and picture size, picture is information such as position in the space of a whole page; Described layout information is formed unified format to be used for procedure of processing;
As shown in Figure 2, layout information of the present invention extracts and method for processing, also comprises
(2) procedure of processing of layout information: Word message and the pictorial information importing data processing and sorting system with consolidation form in the step (1) carries out subregion earlier, afterwards literal subregion and picture subregion are carried out the intelligence combination according to attribute, can form the different contributions that appears in the newspapers; To those special subregions that does not have intelligent association, then need to carry out manual association process; Again according to appearing in the newspapers the needs of data, the content and the form of the layout information of every piece of contribution are handled;
(3) last, the derived type structure document comprises complete layout information.
In a word, adopt method of the present invention, in the layout information leaching process, can handle dissimilar layout files,, can realize under a window that the processing of polytype file is handled such as PS, S2, PDF etc.; Can automatically extract information such as article author, article eyebrow head, article theme, article subtitle; In the process of layout information, by literal subregion and picture subregion are carried out the intelligence combination, can guarantee the accurate corresponding relation of literal and picture, improve the accuracy of layout information extraction and the speed of information processing, increase work efficiency greatly.
The above is giving an example of best mode for carrying out the invention, and the part of wherein not addressing in detail is those of ordinary skills' common practise.Protection scope of the present invention is as the criterion with the content of claim, and any equivalent transformation that carries out based on technology enlightenment of the present invention is also within protection scope of the present invention.

Claims (1)

1. a layout information extracts and method for processing, it is characterized in that, may further comprise the steps
(1) extraction of layout information: obtain needed layout files from data source earlier; Analyze layout files then, obtain the type of layout files; Adopt diverse ways by different file types, analyze and extract layout information, described layout information comprises Word message and pictorial information; Described layout information is formed unified format;
(2) processing of layout information: earlier described Word message and pictorial information are carried out subregion, literal subregion and picture subregion are carried out the intelligence combination according to attribute; The subregion that does not have intelligent association is carried out manual association process; Content and form to layout information are handled;
(3) layout information after will processing is output as structured document.
CN2011101458507A 2011-06-01 2011-06-01 Method for extracting and processing layout information Pending CN102200966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101458507A CN102200966A (en) 2011-06-01 2011-06-01 Method for extracting and processing layout information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101458507A CN102200966A (en) 2011-06-01 2011-06-01 Method for extracting and processing layout information

Publications (1)

Publication Number Publication Date
CN102200966A true CN102200966A (en) 2011-09-28

Family

ID=44661652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101458507A Pending CN102200966A (en) 2011-06-01 2011-06-01 Method for extracting and processing layout information

Country Status (1)

Country Link
CN (1) CN102200966A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008180A (en) * 2014-06-09 2014-08-27 北京奇虎科技有限公司 Association method of structural data with picture, association device thereof
CN105095297A (en) * 2014-05-16 2015-11-25 北大方正集团有限公司 Method and apparatus for automatically generating picture code list
CN105530276A (en) * 2014-09-30 2016-04-27 北大方正集团有限公司 Method and system for processing reported data
CN109086327A (en) * 2018-07-03 2018-12-25 中国科学院信息工程研究所 A kind of method and device quickly generating webpage visual structure graph
CN114973286A (en) * 2022-06-16 2022-08-30 科大讯飞股份有限公司 Document element extraction method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002117035A (en) * 2000-10-10 2002-04-19 Citation Japan:Kk Device and method for analysis using free word and storage medium
CN101430714A (en) * 2008-12-08 2009-05-13 北大方正集团有限公司 Content structuring process method and system based on model
US20100329577A1 (en) * 2009-06-24 2010-12-30 Fuji Xerox Co., Ltd. Image processing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002117035A (en) * 2000-10-10 2002-04-19 Citation Japan:Kk Device and method for analysis using free word and storage medium
CN101430714A (en) * 2008-12-08 2009-05-13 北大方正集团有限公司 Content structuring process method and system based on model
US20100329577A1 (en) * 2009-06-24 2010-12-30 Fuji Xerox Co., Ltd. Image processing device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《软件世界》 20021031 孟奇刚 信息管理新天地 全文 1 , 第10期 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095297A (en) * 2014-05-16 2015-11-25 北大方正集团有限公司 Method and apparatus for automatically generating picture code list
CN104008180A (en) * 2014-06-09 2014-08-27 北京奇虎科技有限公司 Association method of structural data with picture, association device thereof
CN105530276A (en) * 2014-09-30 2016-04-27 北大方正集团有限公司 Method and system for processing reported data
CN109086327A (en) * 2018-07-03 2018-12-25 中国科学院信息工程研究所 A kind of method and device quickly generating webpage visual structure graph
CN109086327B (en) * 2018-07-03 2022-05-17 中国科学院信息工程研究所 Method and device for rapidly generating webpage visual structure graph
CN114973286A (en) * 2022-06-16 2022-08-30 科大讯飞股份有限公司 Document element extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111832403B (en) Document structure recognition method, document structure recognition model training method and device
CN113642584B (en) Character recognition method, device, equipment, storage medium and intelligent dictionary pen
CN104391881B (en) A log parsing method and system based on word segmentation algorithm
CN104598577B (en) A kind of extracting method of Web page text
CN109685052A (en) Method for processing text images, device, electronic equipment and computer-readable medium
US20150095769A1 (en) Layout Analysis Method And System
CN102200966A (en) Method for extracting and processing layout information
CN102253979A (en) Vision-based web page extracting method
CN101881999A (en) Oracle Video Input System and Implementation Method
CN103902918B (en) Method and device for rapidly extracting text from Word document
Talukder et al. Connected component based approach for text extraction from color image
CN109685061A (en) The recognition methods of mathematical formulae suitable for structuring
CN103440239A (en) Functional region recognition-based webpage segmentation method and device
CN105426355A (en) Syllabic size based method and apparatus for identifying Tibetan syntax chunk
Clausner et al. Efficient ocr training data generation with aletheia
Devi et al. Embedded optical character recognition on Tamil text image using Raspberry Pi
CN102236658B (en) Webpage content extracting method and device
CN104820962A (en) Method for generating and printing watermarks capable of replacing manual signatures
CN109447015A (en) A kind of method and device handling form Image center selection word
CN116958996A (en) OCR information extraction method, system and equipment
CN112541505B (en) Text recognition method, text recognition device and computer-readable storage medium
CN120068810A (en) Multi-mode document structuring processing method, device, equipment and medium
CN106156314B (en) A kind of data manipulation method and device, data search method and device
CN104598289A (en) Recognition method and electronic device
CN103678284A (en) Method and device for translating page characters

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20110928