JP2019008615A

JP2019008615A - Document reconstruction device

Info

Publication number: JP2019008615A
Application number: JP2017124616A
Authority: JP
Inventors: 航一田代; Koichi Tashiro
Original assignee: Konica Minolta Inc
Current assignee: Konica Minolta Inc
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2019-01-17
Anticipated expiration: 2037-06-26
Also published as: JP7003457B2

Abstract

To provide a document reconstruction device capable of appropriately canceling the column setting of a document arranged in columns and generating a document easy to browse even with a small terminal.SOLUTION: A document reconstruction device includes a dividing unit 21 configured to divide an original document arranged in columns into a plurality of regions based on a predetermined region determination condition (for example, a boundary line or a blank), a sentence extraction unit 22 configured to extract sentences included in respective divided regions, a discrimination unit 23 configured to discriminate whether a sentence extracted from each region is a sentence linked to a sentence extracted from another region or an independent sentence, a sentence coupling unit 24 configured to connect sentences discriminated to be connected to sentences extracted from other regions into one sentence, and a reconstruction unit 25 configured to arrange the independent sentence and the sentences linked into one sentence by the sentence coupling units in a row and reconstruct the original document into a document whose column setting has been cancelled.SELECTED DRAWING: Figure 2

Description

本発明は、段組みされた文書の段組みを解除して文書を再構成する文書再構成装置およびプログラムに関する。 The present invention relates to a document reconstruction apparatus and program for reconfiguring a document by releasing the column of the document that has been arranged in columns.

スマートフォンやタブレットＰＣといった小型の電子端末の普及に伴い、そのような小型の端末上で文書の閲覧を行う機会が増えている。しかし、スマートフォンやタブレットＰＣはディスプレイが小さいため、書籍や資料といった文書を閲覧する際は、文書の拡大・縮小、上下スクロールを繰り返しながら読まなくてはならない。特に新聞や雑誌といった、段組みされた文書を閲覧するには、上述の繰り返し操作がより多く必要になる。 With the spread of small electronic terminals such as smartphones and tablet PCs, opportunities to browse documents on such small terminals are increasing. However, since smartphones and tablet PCs have small displays, when browsing a document such as a book or document, the document must be read while repeatedly enlarging / reducing and scrolling up and down. In particular, in order to view a columned document such as a newspaper or a magazine, the above-described repetitive operation is more required.

例えば、図２４に示すような段組みされた文書を小型の電子端末で閲覧すると、図２５に示すように、デフォルトでは画面に対し、文書全体が全画面表示される形式で表示される。このままでは文字が小さく且つ段組みとなっていることから、閲覧者は文書の拡大・縮小や上下スクロールを行いながら該文書を閲覧する必要がある。その結果、図２６の矢印に示すように、画面を横スクロールさせたり縦にスクロールさせたりする必要があるため、操作が煩雑となり、利便性が良くなかった。 For example, when a column-structured document as shown in FIG. 24 is viewed on a small electronic terminal, as shown in FIG. 25, the entire document is displayed in a full-screen display format on the screen by default. Since the characters are small and in columns as they are, the viewer needs to browse the document while enlarging / reducing the document and scrolling up and down. As a result, as indicated by the arrows in FIG. 26, it is necessary to scroll the screen horizontally or vertically, which makes the operation complicated and not convenient.

このような問題に対応する技術として、下記特許文献１には、段組みされた第一の文書を構成する複数の各テキスト群を１列に配列し直して第二の文書を生成するシステムが開示されている。このシステムでは、縦に並ぶ複数の行によって構成される複数のテキスト群が横に並んで配置されてなる第一の文書の各テキスト群に、第一の文書を人間が読む際の順番に応じた順位を付し、順位の小さい順にテキスト群を縦に並べて第二の文書を生成する。 As a technique for dealing with such a problem, Patent Document 1 below discloses a system that generates a second document by rearranging a plurality of text groups constituting the first document arranged in a line into one column. It is disclosed. In this system, according to the order in which the first document is read by humans, each text group of the first document in which a plurality of text groups composed of a plurality of vertical rows are arranged side by side is arranged. The second document is generated by arranging the text groups vertically in ascending order of ranking.

特開２０１７−４９８６５号公報JP 2017-49865 A

特許文献１では、第一の文書を構成する複数のテキスト群に、第一の文書を人間が読む際の順番に応じた順位を付する、とあるが、人間が読む際の順番をどのように特定するかについては開示がない。たとえば、新聞などでは複雑な段組みで紙面が構成されるため、人間が読む際の順番を正しく見つけ出して、各テキスト群を適切な順序で一列に配列することは難しい。そのため、第一の文書において、一のテキスト群と他の一テキスト群とが本来１つの文章であった場合でも、それらが連続して配列されないケースがあり、文章として正しく読むことができないといった問題が生じる。 In Patent Document 1, a plurality of text groups constituting the first document are given a rank according to the order in which the first document is read by a person. There is no disclosure as to whether it is specified. For example, in newspapers and the like, the page is composed of complicated columns, so it is difficult to correctly find the order in which humans read and to arrange each text group in a line in an appropriate order. Therefore, in the first document, even if one text group and one other text group are originally one sentence, there are cases where they are not arranged consecutively and cannot be read correctly as sentences. Occurs.

本発明は、段組みされた文書において本来１つの文章が飛び飛びの場所に分断されて配置されている場合にも、それらを繋いで文書の段組みを適切に解除することのできる文書再構成装置およびプログラムを提供することを目的としている。 The present invention provides a document reconstruction device that can properly cancel a column of documents by connecting them even when a single sentence is originally divided and arranged in a jumped place in a columned document. And aims to provide a program.

かかる目的を達成するための本発明の要旨とするところは、次の各項の発明に存する。 The gist of the present invention for achieving the object lies in the inventions of the following items.

［１］段組みされた元文書を、所定の領域判別条件に基づいて複数の領域に分割する分割部と、
分割後の各領域に含まれる文章を抽出する文章抽出部と、
分割後の各領域から抽出した文章が、他の領域から抽出した文章と繋がった文章か、独立した文章かを判別する判別部と、
他の領域から抽出した文章と繋がっていると判別された文章同士を１つの文章に繋げる文章結合部と、
前記独立した文章および前記文章結合部によって１つに繋げられた文章を一列に配列して、前記元文書を、段組みの解除された文書に再構成する再構成部と、
を有する
ことを特徴とする文書再構成装置。 [1] A dividing unit that divides a columned original document into a plurality of areas based on a predetermined area discrimination condition;
A text extraction unit that extracts text included in each divided area;
A discriminator for discriminating whether the text extracted from each area after division is a text connected to text extracted from other areas or an independent text;
A sentence combining unit that connects sentences determined to be connected to sentences extracted from other areas to one sentence;
A reconstruction unit that arranges the independent sentences and the sentences connected to one by the sentence combination unit in a row, and reconstructs the original document into a column-unreleased document;
A document reconstruction apparatus characterized by comprising:

上記発明では、段組みされた文書を、段組みの境界線や空白などの領域判別条件に基づいて複数の領域に分割し、それぞれの領域の文章が他の領域の文章と繋がった文章か否かを判断し、繋がっているものは１つの文章に結合した上で、各文章を、通常読む順で一列に配列して文書を再構成する。たとえば、新聞のように複雑に段組みされていても、文章の繋がりを判断することで、飛び飛びの位置に分断されて配置されていた文章を適切に繋げて文書を再構成することができる。 In the above invention, the columned document is divided into a plurality of regions based on region discrimination conditions such as column boundary lines and blanks, and whether the texts in each region are connected to the texts in other regions or not. The connected ones are combined into one sentence, and then the sentences are arranged in a line in the normal reading order to reconstruct the document. For example, even if it is complicatedly arranged like a newspaper, it is possible to reconstruct a document by appropriately connecting the sentences that are divided and arranged at the skipped positions by determining the connection of the sentences.

［２］前記判別部は、文章同士の繋がりの適正度を数値化し、所定の閾値と比較して、前記判別する
ことを特徴とする［１］に記載の文書再構成装置。 [2] The document reconstruction apparatus according to [1], wherein the determination unit performs the determination by digitizing the appropriateness of connection between sentences and comparing with a predetermined threshold.

［３］前記閾値をユーザが設定し得る
ことを特徴とする［２］に記載の文書再構成装置。 [3] The document reconstruction device according to [2], wherein the threshold value can be set by a user.

［４］前記判別部は、文章の内容の類似度、およびまたは、一の文章の末尾と他の一の文章の先頭との連続性、に基づいて、前記判別する
ことを特徴とする［１］に記載の文書再構成装置。 [4] The discriminating unit discriminates based on the similarity of the content of the sentence and / or the continuity between the end of one sentence and the beginning of another sentence. [1] ] The document reconstruction device described in the above.

［５］前記判別部は、一の領域に含まれる文章と、前記一の領域に含まれる文章と連続する可能性のない位置にある領域に含まれる文章との繋がりは判別しない
ことを特徴とする［１］に記載の文書再構成装置。 [5] The determination unit does not determine a connection between a sentence included in one area and a sentence included in an area at a position that is not likely to be continuous with the sentence included in the one area. The document reconstruction device according to [1].

上記発明では、繋がりを判別する対象を絞り込むことで、処理負担が軽減される。 In the said invention, a processing burden is reduced by narrowing down the object which discriminate | determines a connection.

［６］前記再構成部は、前記文章を読み進める方向に従って、前記文章を一列に配列する
ことを特徴とする［１］に記載の文書再構成装置。 [6] The document reconstruction device according to [1], wherein the reconstruction unit arranges the sentences in a line according to a direction in which the sentence is read.

上記発明では、行単位での読み進め方向に、文章を配列する。これにより、再構成された文書を、文章の読み進め方向にスクロールさせていけば、次の文章が自然に表示される。 In the above invention, sentences are arranged in the reading direction in line units. Thus, if the reconstructed document is scrolled in the reading direction of the sentence, the next sentence is naturally displayed.

［７］前記再構成部は、ユーザの指定する方向に従って、前記文章を一列に配列する
ことを特徴とする［１］に記載の文書再構成装置。 [7] The document reconstruction device according to [1], wherein the reconstruction unit arranges the sentences in a line according to a direction designated by a user.

上記発明では、ユーザが文章の配列方向を任意に指定することができる。ユーザの好みに応じた配列の文書を再構成することができる。 In the above-described invention, the user can arbitrarily specify the arrangement direction of the sentences. Documents can be reconstructed according to user preferences.

［８］前記再構成部は、各領域内での文章のレイアウトを保持したまま文字サイズを調整して前記再構成するか、リフローで前記再構成するかを選択可能である
ことを特徴とする［１］に記載の文書再構成装置。 [8] The reconstruction unit can select whether to perform the reconstruction by adjusting the character size while maintaining the layout of the sentence in each region, or to perform the reconstruction by reflow. The document reconstruction device according to [1].

上記発明では、ユーザの好みに応じた形態(レイアウト)で文書を再構成することができる。 In the above invention, the document can be reconstructed in a form (layout) according to the user's preference.

［９］文字サイズをユーザが指定し得る
ことを特徴とする［８］に記載の文書再構成装置。 [9] The document reconstruction device according to [8], wherein the user can specify a character size.

上記発明では、ユーザの好みに応じた文字サイズで文書を再構成することができる。 In the above invention, the document can be reconstructed with a character size according to the user's preference.

［１０］一の領域の中に文章のほかに画像や図形などのオブジェクトが存在する場合に、前記再構成部は、一の領域に含まれるオブジェクトと文章とを一体に扱って、前記配列する
ことを特徴とする［１］に記載の文書再構成装置。 [10] When an object such as an image or a graphic exists in one area in addition to a sentence, the reconstruction unit handles the object and the sentence included in the one area as a unit and arranges the objects. The document reconstruction device according to [1], wherein

上記発明では、文章とオブジェクトとの対応関係が維持される。 In the above invention, the correspondence between the sentence and the object is maintained.

［１１］前記元文書がイメージデータの場合に、前記分割部は、前記元文書を画像処理によって領域判別することで、前記分割する
ことを特徴とする［１］に記載の文書再構成装置。 [11] The document reconstruction device according to [1], wherein, when the original document is image data, the dividing unit divides the original document by determining an area by image processing.

［１２］前記元文書がマークアップ言語で記述された文書の場合に、前記分割部は、段組みを示すタグ情報に基づいて、前記分割する
ことを特徴とする［１］に記載の文書再構成装置。 [12] The document reconstruction according to [1], wherein, when the original document is a document described in a markup language, the dividing unit divides the document based on tag information indicating a column. Configuration equipment.

［１３］前記元文書がイメージデータの場合に、前記文章抽出部は、文字認識によって文章を抽出する
ことを特徴とする［１］に記載の文書再構成装置。 [13] The document reconstruction device according to [1], wherein when the original document is image data, the text extraction unit extracts text by character recognition.

［１４］前記元文書がマークアップ言語で記述された文書の場合に、前記文章抽出部は、テキスト領域を示すタグ情報に基づいて文章を抽出する
ことを特徴とする［１］に記載の文書再構成装置。 [14] The document according to [1], wherein, when the original document is a document described in a markup language, the sentence extraction unit extracts a sentence based on tag information indicating a text area. Reconstruction device.

［１５］情報処理装置を、［１］乃至［１４］のいずれか１つの文書再構成装置として機能させるプログラム。 [15] A program for causing an information processing apparatus to function as any one of the document reconstruction apparatuses according to [1] to [14].

本発明に係る文書再構成装置およびプログラムによれば、段組みされた文書において本来１つの文章が飛び飛びの場所に分断されて配置されている場合にも、それらを適切に繋いて文書の段組みを適切に解除することができる。 According to the document reconstruction device and the program according to the present invention, even when a single sentence is originally divided and arranged at a place where it is skipped in a stacked document, the documents are appropriately connected to each other. Can be canceled appropriately.

本発明に係る文書再構成装置を含む文書閲覧システムの一例を示す図であるIt is a figure which shows an example of the document browsing system containing the document reconstruction apparatus which concerns on this invention. 文書再構成装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a document reconstruction apparatus. 文書再構成装置が元文書の段組みを解除して再構成文書を作成する処理の概要を示す流れ図である。10 is a flowchart illustrating an outline of processing in which a document reconstruction device releases a column of an original document and creates a reconstructed document. 例１の元文書およびこれを境界線を基準に複数の領域に分割してラべリングした状態を示す図である。It is a figure which shows the state which divided and labeled the original document of Example 1, and this into several area | regions on the basis of a boundary line. 例２の元文書およびこれを空白を基準に複数の領域に分割してラべリングした状態を示す図である。It is a figure which shows the state which divided and labeled the original document of Example 2, and this into several area | regions on the basis of the blank. 文書の再構成処理（図３のステップ１０７の詳細）を示す流れ図である。4 is a flowchart showing a document reconstruction process (details of step 107 in FIG. 3). 図４に示し例１の元文書を再構成した再構成文書を示す図である。FIG. 5 is a diagram illustrating a reconstructed document obtained by reconstructing the original document of Example 1 shown in FIG. 4. 例３の元文書を示す図である。10 is a diagram illustrating an original document of Example 3. FIG. 例３の元文書を領域１〜領域４に分割してラべリングした状態を示す図である。It is a figure which shows the state which divided | segmented the original document of Example 3 into the area | region 1-the area | region 4, and was labeled. 図８に示した例３の元文書を再構成した再構成文書を示す図である。FIG. 9 is a diagram illustrating a reconstructed document obtained by reconstructing the original document of Example 3 illustrated in FIG. 8. 例４の元文書を示す図である。10 is a diagram showing an original document of Example 4. FIG. 例４の元文書を複数の領域に分割してラべリングした状態を示す図である。It is a figure which shows the state which divided | segmented and labeled the original document of Example 4 into several area | regions. 例４の元文書を再構成した結果の再構成文書を示す図である。It is a figure which shows the reconstruction document of the result of having reconstructed the original document of Example 4. 例５の元文書を示す図である。10 is a diagram showing an original document of Example 5. FIG. 例５の元文書を再構成した結果の再構成文書を示す図である。It is a figure which shows the reconstruction document of the result of having reconstructed the original document of Example 5. 例６の元文書を示す図である。10 is a diagram showing an original document of Example 6. FIG. 例６の元文書を再構成した結果の再構成文書を示す図である。It is a figure which shows the reconstruction document of the result of having reconstructed the original document of Example 6. 例７の再構成文書を示す図である。FIG. 10 is a diagram showing a reconstructed document of Example 7. 例８の再構成文書を示す図である。FIG. 10 is a diagram showing a reconstructed document of Example 8. 例９の元文書を示す図である。10 is a diagram showing an original document of Example 9. FIG. 例９の元文書を再構成した結果の再構成文書を示す図である。It is a figure which shows the reconstruction document of the result of having reconstructed the original document of Example 9. 例１０の元文書を示す図である。10 is a diagram showing an original document of Example 10. FIG. 例１０の再構成文書を示す図である。FIG. 10 is a diagram showing a reconstructed document of Example 10. 段組みされた文書の例を示す図である。It is a figure which shows the example of the document put in a column. 図２４の段組みされた文書の全体を小型の携帯端末の表示部に表示した様子を示す図である。It is a figure which shows a mode that the whole document of the column of FIG. 24 was displayed on the display part of a small portable terminal. 図２４の段組みされた文書を小型の携帯端末で閲覧する様子を示す図である。It is a figure which shows a mode that the document shown in FIG. 24 is browsed with a small portable terminal.

以下、図面に基づき本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明に係る文書再構成装置を含む文書閲覧システム２の一例を示す図である。文書閲覧システム２は、ユーザが使用するスマートフォンやタブレットなどの小型の携帯端末５と、該携帯端末５とネットワークを通じて通信可能に接続されたサーバ（情報処理装置）である文書再構成装置１０を備えて構成される。 FIG. 1 is a diagram showing an example of a document browsing system 2 including a document reconstruction device according to the present invention. The document browsing system 2 includes a small portable terminal 5 such as a smartphone or a tablet used by a user, and a document reconstruction apparatus 10 that is a server (information processing apparatus) connected to the portable terminal 5 through a network. Configured.

携帯端末５は、段組みされた文書を閲覧する際に、その文書（元文書とする）のデータを、ネットワークを通じて文書再構成装置１０に送信し、段組みの解除を依頼する（Ｐ１）。文書再構成装置１０は受信した元文書の段組みを解除して、一方向へのスクールのみで閲覧できるようにした再構成文書を生成し（Ｐ２）、該再構成文書を携帯端末５に送信する（Ｐ３）。携帯端末５では再構成文書を閲覧することで、一方向へのスクロール操作を行うだけで文書を先頭から末尾まで円滑に閲覧することが可能になる。 When the portable terminal 5 browses a columned document, the portable terminal 5 transmits data of the document (which is an original document) to the document reconstruction device 10 through the network, and requests cancellation of the column (P1). The document reconstruction device 10 cancels the column of the received original document, generates a reconstructed document that can be viewed only in a one-way school (P2), and transmits the reconstructed document to the mobile terminal 5. (P3). By browsing the reconstructed document on the portable terminal 5, it is possible to browse the document smoothly from the beginning to the end by simply performing a scrolling operation in one direction.

なお、文書再構成装置１０による段組み解除の機能を果たすプログラムを携帯端末５にインストールしておき、段組みされた文書の段組み解除を携帯端末５で行うように構成されてもよい（図１のＰ４）。 Note that a program that performs the function of releasing the column by the document reconstruction device 10 may be installed in the mobile terminal 5 and the mobile terminal 5 may cancel the column setting of the stacked document (see FIG. 1 P4).

図２は、文書再構成装置１０の概略構成を示すブロック図である。文書再構成装置１０は、ＣＰＵ(Central Processing Unit)１１に、ＲＡＭ(Random Access Memory)１２、ＲＯＭ（Read Only Memory）やハードディスク装置などで構成された記憶部１３、ネットワーク通信部１４、入力Ｉ／Ｆ部１５、出力Ｉ／Ｆ部１６などを接続して構成される。 FIG. 2 is a block diagram illustrating a schematic configuration of the document reconstruction device 10. The document reconstruction device 10 includes a CPU (Central Processing Unit) 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory), a storage unit 13 including a hard disk device, a network communication unit 14, an input I / O. The F unit 15 and the output I / F unit 16 are connected to each other.

ＣＰＵ１１はマイクロプロセッサを有し、ＯＳプログラムをベースとし、その上で、ミドルウェアやアプリケーションプログラムを実行する。記憶部１３には各種プログラムやデータが格納される。ＣＰＵ１１がこれらのプログラムに従って処理を実行することで文書再構成装置１０として機能が実現される。ＲＡＭ１２は、ＣＰＵ１１が処理を実行する際に各種データを一時的に格納するワークメモリとして使用される。 The CPU 11 has a microprocessor, and is based on an OS program, on which middleware and application programs are executed. The storage unit 13 stores various programs and data. The function is realized as the document reconstruction device 10 by the CPU 11 executing processing according to these programs. The RAM 12 is used as a work memory that temporarily stores various data when the CPU 11 executes processing.

ネットワーク通信部１４は、ネットワークを通じて携帯端末５や各種の外部装置と通信する機能を果たす。ネットワーク通信部１４は、携帯端末５から元文書のデータおよび段組みの解除依頼を受信する。またネットワーク通信部１４は、再構成文書を携帯端末５に送信する。 The network communication unit 14 has a function of communicating with the mobile terminal 5 and various external devices through the network. The network communication unit 14 receives the original document data and column release request from the mobile terminal 5. In addition, the network communication unit 14 transmits the reconstructed document to the mobile terminal 5.

入力装置１５は、キーボードやマウスなどユーザの操作を入力するための機器である。出力装置１６は、液晶モニタなどのディスプレイ装置である。なお、段組み解除に関する各種設定（たとえば、後述するレイアウトの選択や文字サイズの選択など）は、文書再構成装置１０の入力装置１５から受け付けるほか、携帯端末５から受けることができる。 The input device 15 is a device for inputting user operations such as a keyboard and a mouse. The output device 16 is a display device such as a liquid crystal monitor. Various settings relating to column cancellation (for example, layout selection and character size selection described later) can be received from the input device 15 of the document reconstruction device 10 or from the mobile terminal 5.

ＣＰＵ１１は、プログラムを実行することで、分割部２１、文章抽出部２２、判別部２３、文章結合部２４、再構成部２５としての機能を果たす。 The CPU 11 functions as a dividing unit 21, a text extracting unit 22, a determining unit 23, a text combining unit 24, and a reconstruction unit 25 by executing a program.

分割部２１は、元文書を所定の領域判別条件に基づいて複数の領域に分割する。 The dividing unit 21 divides the original document into a plurality of areas based on predetermined area determination conditions.

文章抽出部２２は、分割後の各領域に含まれる文章(テキスト群)を抽出する。 The sentence extraction unit 22 extracts sentences (text group) included in each divided area.

判別部２３は、各領域から抽出した文章が、他の領域から抽出した文章と繋がった文章であるか、独立した文章であるかを判別する。 The determination unit 23 determines whether the sentence extracted from each area is a sentence connected to a sentence extracted from another area or an independent sentence.

文章結合部２４は、他の領域から抽出した文章と繋がっていると判別された文章同士を１つの文章に繋げる。 The sentence combining unit 24 connects sentences determined to be connected to sentences extracted from other areas to one sentence.

再構成部２５は、独立した文章および文章結合部２４によって１つに繋げられた文章を一列に配列して、前記元文書を、段組みの解除された文書に再構成する。 The reconstruction unit 25 arranges the independent sentences and the sentences joined together by the sentence combination unit 24 in a line, and reconstructs the original document into a column-removed document.

図３は、文書再構成装置１０が元文書の段組みを解除して再構成文書を作成する処理の概要を示す流れ図である。文書再構成装置１０は、まず、元文書を入力する（ステップＳ１０１）。入力される元文書は、文字コードで表された文書でもよいし、ビットマップデータなどのイメージデータで表されていてもよい。この例では、元文書は、紙文書をスキャナなどで読み取って得たイメージデータになっているものとする。 FIG. 3 is a flowchart showing an outline of processing in which the document reconstruction device 10 cancels the column of the original document and creates a reconstructed document. The document reconstruction apparatus 10 first inputs an original document (step S101). The input original document may be a document represented by a character code, or may be represented by image data such as bitmap data. In this example, it is assumed that the original document is image data obtained by reading a paper document with a scanner or the like.

次に、文書再構成装置１０のＣＰＵ１１（分割部２１）は、元文書を所定の領域判別条件に基づいて複数の領域に分割し、分割後の各領域にラべリングを行う（ステップＳ１０２）。 Next, the CPU 11 (dividing unit 21) of the document reconstruction device 10 divides the original document into a plurality of areas based on a predetermined area discrimination condition, and performs labeling on each divided area (step S102). .

領域判別条件は、段組みの境界線や一定以上の空白領域の存在などであり、文字が纏まって存在する範囲を１つの領域として判別するための条件である。図４に示すように段組みされた元文書に段組みの境界線が引かれている場合には、その線を基準に複数の領域に分割する。図４（ｂ）は例１の元文書を分割してラべリングした状態を示す。図中の破線は領域を示す。なお、図４の元文書を例１とする。 The area discrimination condition is a condition for discriminating a range in which characters exist together as one area, such as the presence of a column boundary or a blank area of a certain level or more. As shown in FIG. 4, when a column boundary is drawn in the columned original document, the document is divided into a plurality of areas based on the line. FIG. 4B shows a state where the original document of Example 1 is divided and labeled. A broken line in the figure indicates a region. The original document in FIG.

境界線が無い場合には、図５に示すように、文章と文章の間や、段組みの各段の間に設けられる空白領域を基準に複数の領域に分割する。より詳細には、各文字列の行間隔を確認し、前後で間隔に一定以上の相違がある場合は空白であると判断する。たとえば、間隔が１２ｐｔ、１２ｐｔ、…と続いた後に、間隔が３０ｐｔになった場合は、該３０ｐｔとなった箇所に区切りの空白領域があると判断する。図５（ｂ）は例２の元文章を分割してラべリングした状態を示す。図中の破線は領域を示す。なお、図５の元文書を例２とする。 When there is no boundary line, as shown in FIG. 5, it divides | segments into a some area | region on the basis of the blank area | region provided between sentences and between each stage of a column. More specifically, the line spacing of each character string is confirmed, and if there is a certain difference in spacing between before and after, it is determined that the space is blank. For example, if the interval becomes 30 pt after the interval continues to 12 pt, 12 pt,..., It is determined that there is a blank area delimited at the position where the interval becomes 30 pt. FIG. 5B shows a state where the original sentence of Example 2 is divided and labeled. A broken line in the figure indicates a region. The original document in FIG.

次に文書再構成装置１０のＣＰＵ１１（文章抽出部２２）は、分割した各領域に含まれている文字を光学文字認識等によって認識して文字コードに変換し、それぞれの領域に含まれている文章（文字群）を文字コードの形式で抽出する（ステップＳ１０３）。元文書が文字コードで記述されている場合、その文字コードをそのまま抽出すればよい。 Next, the CPU 11 (sentence extraction unit 22) of the document reconstruction device 10 recognizes characters included in each divided area by optical character recognition or the like, converts them into character codes, and is included in each area. A sentence (character group) is extracted in the form of a character code (step S103). If the original document is described in character code, the character code may be extracted as it is.

文書再構成装置１０のＣＰＵ１１（文章抽出部２２）は、抽出した文章(文字コード群)を解析して、その言語、行の方向（縦書き、横書き）、行単位での読み進める方向などを特定する（ステップＳ１０４）。たとえば、図４に示す例１の元文書の場合、日本語、縦書き、右から左に読み進める文章であることを認識する。そして、言語、読み進める方向などから、この文書の各領域の文章を読むときの標準的な読む順序（どの領域から順に読み進めるか）を特定し、各領域の文章に標準的な読む順序に従った順位を初期値として付与する。 The CPU 11 (sentence extraction unit 22) of the document reconstruction device 10 analyzes the extracted sentence (character code group) and determines its language, line direction (vertical writing, horizontal writing), reading direction in line units, and the like. Specify (step S104). For example, in the case of the original document of Example 1 shown in FIG. 4, it is recognized that the sentence is Japanese, vertical writing, and a sentence that is read from right to left. Then, specify the standard reading order (from which area to read in order) when reading the text in each area of this document from the language, reading direction, etc., and set the standard reading order for the text in each area. The order according to this is given as the initial value.

図４に示す例１の場合、各段においては右から左に進み、かつ上の段から順に下へと読み進めると判断し、標準的な読む順序は、領域１→領域２→領域３→領域４となる。したがって、領域１の文章には順位１を、領域２の文章には順位２を、領域３の文章には順位３を、領域４の文章には位４を初期値として付与する。なお領域Ｎから抽出した文章を文章Ｎとする。また文章Ｎに与えた順位を文書Ｎ（１）のように（）を付けて付記する。 In the case of Example 1 shown in FIG. 4, it is determined that each stage proceeds from right to left and is read sequentially from the top to the bottom, and the standard reading order is region 1 → region 2 → region 3 → Region 4 is entered. Accordingly, rank 1 is assigned to the text in region 1, rank 2 is assigned to the text in region 2, rank 3 is assigned to the text in region 3, and rank 4 is assigned to the text in region 4. Note that a sentence extracted from the region N is a sentence N. Further, the order given to the sentence N is appended with () as in the document N (1).

次に文書再構成装置１０のＣＰＵ１１（判別部２３）は、各領域から抽出した文章が他の文章と繋がった文章であるか、独立した文章であるかを判別する（ステップＳ１０５）。領域１〜領域４から抽出した文章同士を比較し、２つの領域の文章が連続した文章であるか否かを判別する。ここでは、その判別のために所定の指標値を計算する。指標値による比較方法としてはテキストの類似度、文脈の一致度の計算、などが挙げられる。 Next, the CPU 11 (determination unit 23) of the document reconstruction device 10 determines whether the sentence extracted from each area is a sentence connected to another sentence or an independent sentence (step S105). The sentences extracted from the areas 1 to 4 are compared with each other to determine whether or not the sentences in the two areas are continuous sentences. Here, a predetermined index value is calculated for the determination. Examples of the comparison method based on index values include text similarity and context matching calculation.

テキストの類似度は、例えば、ＴＦ（Term Frequency）−ＩＤＦ（Inverse Document Frequency）やＣｏｓ類似度により、文章間の類似度を計算する。ここではＣｏｓ類似度によって類似度を計算する。領域１から抽出した文章１と領域２から抽出した文章２との類似度を計算する例を示す。まず、各文章に含まれる単語の出現頻度をベクトルで表現すると、
文章１：（今年，景気，・・・）＝（３，１０・・・）、
文書２：（近年，技術動向，・・・）＝（１５，３・・・）
となり、Ｃｏｓ類似度は、
Ｃｏｓθ＝文章１のベクトル・文章２のベクトル／｜文書１｜｜文書２｜、として求まる、ここでは、Ｃｏｓθ＝0.2（上限を1.0とする）であったとする。 As the text similarity, for example, the similarity between sentences is calculated by TF (Term Frequency) -IDF (Inverse Document Frequency) or Cos similarity. Here, the similarity is calculated based on the Cos similarity. An example of calculating the similarity between the sentence 1 extracted from the area 1 and the sentence 2 extracted from the area 2 is shown. First, when the frequency of words in each sentence is expressed as a vector,
Sentence 1: (This year, economy, ...) = (3, 10 ...),
Document 2: (Recent technology trend, ...) = (15, 3 ...)
And Cos similarity is
Cos θ = vector of sentence 1 · vector of sentence 2 / | document 1 || document 2 |, where Cos θ = 0.2 (upper limit is 1.0).

文脈の一致度に関しては、文章１の終わりが「・・・であった。」、文章２の始まりが「近年の技術動向に・・・」であったとしたとき、文章１の終わりは句読点（。）であることから、他の文章が続いている可能性は高くないと判断し、文脈の一致度は、たとえば、0.3（上限を1.0とする）と計算される。 With regard to the degree of coincidence of context, when the end of sentence 1 is “...” and the beginning of sentence 2 is “in recent technological trends ...”, the end of sentence 1 is punctuation ( Therefore, it is determined that there is no high possibility that another sentence continues, and the degree of coincidence of the context is calculated as 0.3 (the upper limit is 1.0), for example.

上記を総合的に判断し、文章１と文章２の最終的な文章連続度が０．２５と算出されたとする。 Assume that the above is comprehensively determined and the final sentence continuity of sentence 1 and sentence 2 is calculated to be 0.25.

近傍の文章間においても同様に計算を行う。すなわち、文章１と文章２、文章１と文章３、文章１と文章４、文章２と文章３、文章２と文章４、文章３と文章４、のそれぞれについて文章連続度を計算し、求めた文章連続度の値が閾値を越えているか否かを判断する。例えば、文章１と文章２との間の文章連続度の値が0.25で、既定のあるいはユーザが設定した閾値が0.8（上限1.0）であったとすると、文章１と文章２とは連続した文章でないと判断する。図４に示す例１では、すべての文章は他の文章と連続しておらず、それぞれが独立した文章であると判断される。 The same calculation is performed between adjacent sentences. That is, the sentence continuity was calculated for each of sentence 1 and sentence 2, sentence 1 and sentence 3, sentence 1 and sentence 4, sentence 2 and sentence 3, sentence 2 and sentence 4, sentence 3 and sentence 4, and found. It is determined whether the value of the sentence continuity exceeds a threshold value. For example, if the sentence continuity value between sentence 1 and sentence 2 is 0.25 and the default or user-set threshold is 0.8 (upper limit 1.0), sentence 1 and sentence 2 are not consecutive sentences. Judge. In Example 1 shown in FIG. 4, all sentences are not continuous with other sentences, and it is determined that each sentence is an independent sentence.

次に文書再構成装置１０のＣＰＵ１１（文章結合部２４）は、ステップＳ１０５において連続した文章であると判断された文章が存在する場合に、それらの文章を結合して１つの文章にする（ステップＳ１０６）。たとえば、仮に、文章２と文章４が連続した文書であるとステップＳ１０５で判断された場合、文章２と文章４を１つの文章に結合する(これを、文書２＋４、のように記す)。結合後の文章２＋４の読む順位は、文章２の順位と、文書４の順位のうちの小さい方とする。 Next, when there are sentences determined to be continuous sentences in step S105, the CPU 11 (sentence combining unit 24) of the document reconstruction device 10 combines these sentences into one sentence (step). S106). For example, if it is determined in step S105 that the sentence 2 and the sentence 4 are continuous documents, the sentence 2 and the sentence 4 are combined into one sentence (this is described as a document 2 + 4). The reading order of the sentence 2 + 4 after combination is the smaller of the order of the sentence 2 and the order of the document 4.

ステップＳ１０５の判別結果に基づいて連続する文章同士をステップＳ１０６で結合した後、文書再構成装置１０のＣＰＵ１１（再構成部２５）は、各文章（元々独立していた文章および１つに結合された文章）を一列に配列し、段組みの解除された文書(再構成文書)を生成して（ステップＳ１０７）、本処理を終了する。 After combining successive sentences in step S106 based on the determination result of step S105, the CPU 11 (reconstruction unit 25) of the document reconstruction device 10 is combined with each sentence (originally independent sentences and one). ) Are arranged in a line, a column-removed document (reconstructed document) is generated (step S107), and the process ends.

図６は、図３のステップＳ１０７の詳細を示す流れ図である。まず、文書のレイアウトを確定する（ステップＳ２０１）。ここでは、レイアウトとして、文章のレイアウトを保持したまま文字サイズを調整するか、リフローとするか、を選択可能とする。この選択は、たとえば、携帯端末５のユーザから受ける。図４の例１の元文書については、文章のレイアウトを保持したまま文字サイズを調整するように文書を再構成するものとする。 FIG. 6 is a flowchart showing details of step S107 in FIG. First, the document layout is determined (step S201). Here, as the layout, it is possible to select whether to adjust the character size while maintaining the layout of the sentence or to reflow. This selection is received from the user of the portable terminal 5, for example. For the original document in Example 1 of FIG. 4, the document is reconfigured so as to adjust the character size while maintaining the text layout.

次に文書再構成装置１０のＣＰＵ１１（再構成部２５）は、ステップＳ１０４で特定した、読み進める方向に基づいて、文章を一列に配列する際の配列方向を決定する(ステップＳ２０２)。図４に示す例１の元文書の場合、各行は縦読みであり、行単位で読み進める方向は右から左なので、各領域から抽出した文章を右から左に向かって一列に配列する。この文章を閲覧するとき、ユーザは、横スクロールにより文書内の見る位置を調整することになる。 Next, the CPU 11 (reconstruction unit 25) of the document reconstruction device 10 determines an arrangement direction for arranging sentences in a line based on the reading direction specified in step S104 (step S202). In the case of the original document of Example 1 shown in FIG. 4, since each line is vertically read and the reading direction is line by line, the text extracted from each area is arranged in a line from right to left. When browsing this text, the user adjusts the viewing position in the document by horizontal scrolling.

最後に、文書再構成装置１０のＣＰＵ１１(再構成部２５)はは、各領域から抽出した独立した文章およびステップＳ１０６で結合された文章を、ステップＳ２０１で決定したレイアウト、ステップＳ２０２で決定した配列方向に従って、順位が若い順に並べて、文書の再構成を行う（ステップＳ２０３）。 Finally, the CPU 11 (reconstruction unit 25) of the document reconstruction device 10 uses the layout determined in step S201 and the layout determined in step S202 for the independent text extracted from each area and the text combined in step S106. According to the direction, the documents are reconstructed by arranging them in ascending order (step S203).

図７は、図４に示した例１の元文書を再構成した再構成文書を示している。同図（ｂ）は、再構成文書を携帯端末５で閲覧する際のスクロール状況を示している。この再構成文書は、右スクロールのみで閲覧可能となっている。これにより、図４の元文書と比較すると、文字の拡大回数、上下移動のスクロール回数を削減することが可能となり、ユーザが閲覧している端末に適した閲覧しやすい文書となる。 FIG. 7 shows a reconstructed document obtained by reconstructing the original document of Example 1 shown in FIG. FIG. 5B shows a scroll state when the reconstructed document is browsed with the portable terminal 5. This reconstructed document can be browsed only by scrolling right. Thereby, compared with the original document of FIG. 4, it is possible to reduce the number of times of character enlargement and the number of scrolls of vertical movement, and it becomes a document that is easy to browse suitable for the terminal that the user is browsing.

次に、元文書から再構成文書を生成する場合の各種の例について説明する。 Next, various examples when generating a reconstructed document from an original document will be described.

＜例３＞
図８は、例３の元文書を示している。この元文書は、上段と下段の２段に段組みされており、日本語、縦書きで、右から左に向かって読み進める文書である。領域は４つに分かれており、上段の右側の領域に１つの独立した文章、下段の右側の領域に１つの独立した文章があり、さらに、上段左側の領域の文章に下段左側の文章が繋がっている。 <Example 3>
FIG. 8 shows the original document of Example 3. This original document is arranged in two columns, an upper level and a lower level, and is a document that is read from right to left in Japanese and vertical writing. The area is divided into four parts: one upper sentence on the right side of the upper section, one independent sentence on the lower right area, and a lower left sentence connected to the upper left area. ing.

文書再構成装置１０のＣＰＵ１１（分割部２１）は、元文書を入力し（ステップＳ１０１）、境界線を基準に４つの領域に分割し、各領域をラべリングする(ステップＳ１０２)。図９は、領域１〜領域４に、ラべリングした状態を示す。図中の破線は領域を示す。 The CPU 11 (dividing unit 21) of the document reconstruction apparatus 10 inputs the original document (step S101), divides the document into four areas based on the boundary line, and labels each area (step S102). FIG. 9 shows a state where the regions 1 to 4 are labeled. A broken line in the figure indicates a region.

文書再構成装置１０のＣＰＵ１１（文章抽出部２２）は、ラベリングされた領域１〜領域４のそれぞれについて光学文字認識を行い、テキスト(文章)を抽出する(ステップＳ１０３)。この例では、領域１からは「今年の景気に関して○・・・○」、領域２からは「近年の技術動向に関して×・・・×」、領域３からは「昨日のスポーツに関して△・・・△」、領域４からは「×・・・×」という文章が抽出される。 The CPU 11 (sentence extraction unit 22) of the document reconstruction device 10 performs optical character recognition for each of the labeled areas 1 to 4, and extracts text (sentence) (step S103). In this example, “Regarding this year's economy ○ ... ○” from the region 1, “Regarding recent technological trends × ... ×” from the region 2, and “Regarding yesterday's sports Δ ...” from the region 3. The text “×... X” is extracted from the region 4.

そして、言語、行の方向、行単位での読み進める方向を特定する（ステップＳ１０４）。例３の元文書の場合、日本語、縦書き、右から左に向かって読み進める文書であると特定する。そして、この条件での標準的な読む順序に従って、各領域の文書に順位を付与する。例３の元文書の場合、標準的な読む順序は、領域１→領域２→領域３→領域４となり、文章１に与える初期の順位は１、文章２に与える初期の順位は２、文章３に与える初期の順位は３、文章４に与える初期の順位は４、となる。 Then, the language, the line direction, and the reading direction in line units are specified (step S104). In the case of the original document in Example 3, the document is specified as Japanese, vertical writing, and a document that is read from right to left. Then, in accordance with the standard reading order under these conditions, a rank is assigned to the document in each area. In the case of the original document of Example 3, the standard reading order is region 1 → region 2 → region 3 → region 4, the initial order given to sentence 1 is 1, the initial order given to sentence 2 is 2, and sentence 3 The initial ranking given to the sentence 4 is 3, and the initial ranking given to the sentence 4 is 4.

次に文書再構成装置１０のＣＰＵ１１（判別部２３）は、各領域から抽出した文章が他の文章と繋がった文章であるか、独立した文章であるかを判別する（ステップＳ１０５）。ここでは、例１の元文書の場合と同様の手法により比較した結果、文章２と文章４の類似度が0.8（上限1.0）であったとする。また、文脈の一致度は文章２の終わりが「・・・であり、将」、文章４の文章の始まりが「来性は高いといえる。・・・」であったとする。“将”と“来”は組み合わせると、“将来”という文字列になり、“将来”という文字列を単語データベース（辞書）とマッチングを行うことにより一つの単語として認識されることから、文脈の一致度は1.0（上限1.0）と計算される。総合的に判断し、結果として最終的な文章連続度が0.9と算出されたとする。 Next, the CPU 11 (determination unit 23) of the document reconstruction device 10 determines whether the sentence extracted from each area is a sentence connected to another sentence or an independent sentence (step S105). Here, it is assumed that the similarity between the sentence 2 and the sentence 4 is 0.8 (upper limit 1.0) as a result of comparison using the same method as that of the original document in Example 1. Further, the degree of coincidence of the context is assumed that the end of the sentence 2 is “..., general”, and the beginning of the sentence 4 is “highly coming”. When “general” and “coming” are combined, it becomes the character string “future”, and the character string “future” is recognized as one word by matching with the word database (dictionary). The degree of coincidence is calculated as 1.0 (upper limit 1.0). Assume that the final sentence continuity is calculated as 0.9 as a result of comprehensive judgment.

同様に近傍の文章同士の比較を行い、すべての組み合わせ（文章１と文章２、文章１と文章３、文章１と文章４、文章２と文章３、文章２と文章４、文章３と文章４）について文章連続度の値を算出する。 Similarly, adjacent sentences are compared and all combinations (sentence 1 and sentence 2, sentence 1 and sentence 3, sentence 1 and sentence 4, sentence 2 and sentence 3, sentence 2 and sentence 4, sentence 3 and sentence 4) ) Is calculated for the sentence continuity value.

ここでは、文章連続度の値を閾値（0.8とする）と比較した結果、文章１と文章２、文章１と文章３、文章１と文章４、文章２と文章３、文章３と文章４の間についての値はユーザ設定の閾値（0.8）を越えず、連続した文章ではなく、文章２と文章４について計算した文章連続度の値は閾値を越えており、連続した文章であると判断されたとする。したがって、文章１と文章３は独立した文章であり、文章２と文章４は１つの文章に結合される（ステップＳ１０６）。文書２の初期の順位は（２）、文書４の初期の順位は（４）なので、文書２＋４の順位は（２）となる。 Here, as a result of comparing the value of the sentence continuity with a threshold value (0.8), sentence 1 and sentence 2, sentence 1 and sentence 3, sentence 1 and sentence 4, sentence 2 and sentence 3, sentence 3 and sentence 4 The value for the interval does not exceed the user-set threshold (0.8) and is not a continuous sentence. The value of the sentence continuity calculated for sentences 2 and 4 exceeds the threshold and is determined to be a continuous sentence. Suppose. Accordingly, sentences 1 and 3 are independent sentences, and sentences 2 and 4 are combined into one sentence (step S106). Since the initial rank of document 2 is (2) and the initial rank of document 4 is (4), the rank of document 2 + 4 is (2).

文書の再構成においては、レイアウトを保持したまま文字サイズを調整する方法に決定し(ステップＳ２０１)、ステップＳ１０４で特定した、読み進める方向に基づいて、文章を一列に配列する際の配列方向は、「文章を右から左に向かって一列に配列する」に決定する(ステップＳ２０２)。そして、上記レイアウトおよび配列方向に従って、文章１（１）、文章２＋４（２）、文章３（３）を、（）の中の順位の若い順に配列して再構成文書を生成する（ステップＳ２０３）。 In document reconstruction, a method of adjusting the character size while maintaining the layout is determined (step S201), and the arrangement direction when arranging the sentences in a line based on the reading direction specified in step S104 is: , “Arrange sentences in a line from right to left” is determined (step S202). Then, according to the layout and arrangement direction, the reconstructed document is generated by arranging the sentences 1 (1), sentences 2 + 4 (2), and sentences 3 (3) in ascending order of ranks in () (step S203). .

図１０は、図８に示した例３の元文書を再構成した再構成文書を示している。同図（ｂ）は、再構成文書を携帯端末５で閲覧する際のスクロール状況を示している。この再構成文書は、右スクロールのみで閲覧可能となっている。これにより、段組みで文章が別の領域に分かれていたとしても、元文書より、文字の拡大回数、上下移動のスクロール回数を削減して閲覧することが可能となり、ユーザの端末に適した閲覧しやすい文書となる。 FIG. 10 shows a reconstructed document obtained by reconstructing the original document of Example 3 shown in FIG. FIG. 5B shows a scroll state when the reconstructed document is browsed with the portable terminal 5. This reconstructed document can be browsed only by scrolling right. As a result, even if the text is divided into separate areas in columns, it is possible to browse from the original document by reducing the number of character enlargements and the number of scrolls of vertical movement, which is suitable for the user's terminal It becomes easy to do.

＜例４＞
例４の元文書は、複数ページ（２ページ）で構成される(図１１参照)。１ページ目は、上段に２つ、下段に２つの文章から構成され、２ページ目は、上段に２つ、下段に１つの文章から構成される。いずれも、日本語、縦読み、かつ右から左に読み進める文章である。なお、１ページ目の左下の文章は２ページ目の上段右の文章へ続いており、２ページ目の上段左の文章は下段の文章へ続いている。よって、この２ページの文書には、独立した５つの文章が含まれている。 <Example 4>
The original document of Example 4 is composed of a plurality of pages (2 pages) (see FIG. 11). The first page is composed of two sentences at the top and two sentences at the bottom, and the second page is composed of two sentences at the top and one sentence at the bottom. Both are Japanese, vertical reading, and sentences that are read from right to left. The lower left sentence on the first page continues to the upper right sentence on the second page, and the upper left sentence on the second page continues to the lower sentence. Therefore, the two-page document includes five independent sentences.

図１２は、図３のステップＳ１０１、ステップＳ１０２により、ラべリングされた結果を示す。ステップＳ１０３、ステップＳ１０４については前述の例１，例３の場合と同様に行われる。 FIG. 12 shows the result of labeling by step S101 and step S102 of FIG. Steps S103 and S104 are performed in the same manner as in the first and third examples.

ステップＳ１０５の繋がり判別では、文章１〜文章７について比較し、各々が連続した文章であるか否かを判別するための指標である文章連続度の値を計算する。ここで、たとえば、文章１に対して、文章５、文章６、文章７は隣り合っている領域でもなく、ページが異なっているため、文章として連続している可能性は低いと考えられる。よってそれらの文章の類似度や、文脈の一致度の計算は省略する。 In the connection determination in step S105, the sentences 1 to 7 are compared, and the value of the sentence continuity, which is an index for determining whether or not each is a continuous sentence, is calculated. Here, for example, the sentence 5, the sentence 6, and the sentence 7 are not adjacent to the sentence 1, and the pages are different. Therefore, the calculation of the similarity between the sentences and the coincidence of the contexts is omitted.

すなわち、本例では、すべての組み合わせを計算した場合、２１通りの組み合わせについて文章連続度を計算することになるが、連続する可能性のない組み合わせについての計算を省略する。この場合、文章１については文章２と文章３、文章２については文章３と文章４と文章５、文章３については文章４、文章４については文章５、文章５については文章６と文章７、文章６については文章７、との組み合わせを考えればよく、合計１０通りについて計算すればよい。 That is, in this example, when all combinations are calculated, the sentence continuity is calculated for 21 combinations, but the calculations for combinations that are not likely to be continuous are omitted. In this case, sentence 2 and sentence 3 for sentence 1, sentence 3 and sentence 4 and sentence 5 for sentence 2, sentence 4 for sentence 3, sentence 5 for sentence 4, sentence 6 and sentence 7 for sentence 5, For the sentence 6, a combination with the sentence 7 may be considered, and a total of 10 patterns may be calculated.

図１２の例では、繋がりを判別した結果、文章１、文章２、文章３はそれぞれ独立した文章、文章４と文章５は連続している、文章６と文章７は連続していると判別される。文章のレイアウトを保持したまま文字サイズを調整し、横スクロールで読めるように文章を右から左に一列に配列して文書を再構成した結果を図１３に示す。これにより、文章がページを跨いでいたとしても、もとの文書より文字の拡大回数、上下移動のスクロール回数を削減することが可能となり、ユーザが閲覧している端末に適した閲覧しやすい文書となる。なお、例４では２ページに跨っている場合を例示したが、それ以上の複数ページでもよい。 In the example of FIG. 12, as a result of determining the connection, it is determined that sentence 1, sentence 2, and sentence 3 are independent sentences, sentence 4 and sentence 5 are continuous, sentence 6 and sentence 7 are continuous. The FIG. 13 shows the result of restructuring the document by adjusting the character size while maintaining the sentence layout and arranging the sentences in a line from right to left so that they can be read by horizontal scrolling. This makes it possible to reduce the number of character enlargements and the number of vertical scrolls compared to the original document even if the text straddles the page, and is an easy-to-view document suitable for the terminal being browsed by the user. It becomes. In addition, although the case where it straddled 2 pages was illustrated in Example 4, more than that may be used.

＜例５＞
例５の元文書では、領域の中に、図や画像などのオブジェクトが存在する。図１４に例５の元文書を示す。上段に２つ、下段に２つの文章（計４つの文章）から構成され、かつ、その中の上段右側の文章には、画像のオブジェクトが含まれている。この文書は、各行が縦読みで、行単位では右から左に読み進める日本語の文書である。 <Example 5>
In the original document of Example 5, objects such as figures and images exist in the area. FIG. 14 shows an original document of Example 5. The upper part includes two sentences (four sentences in total), and the sentence on the right side of the upper part includes an image object. This document is a Japanese document in which each line is read vertically, and read line by line from right to left.

文書再構成装置１０のＣＰＵ１１は、他の例と同様に、図３の処理を実施して、文書を各領域に分割し、それぞれの領域にラべリングする。この際、一の領域の中に文章と画像などオブジェクトが存在する場合、そのオブジェクトはその文章に属するものとして(紐付けて)扱う。図１４（ｂ）は、例５の元文書を分割してラべリングした状態を示す。図中の破線は領域を示す。 As in the other examples, the CPU 11 of the document reconstruction device 10 performs the process of FIG. 3 to divide the document into regions and label the regions. At this time, when an object such as a sentence and an image exists in one area, the object is treated as being associated (linked) with the sentence. FIG. 14B shows a state where the original document of Example 5 is divided and labeled. A broken line in the figure indicates a region.

図３の各ステップを文書再構成装置１０のＣＰＵ１１が実行することで、文章１、文章２、文章３、文章４を抽出し、文書１にはオブジェクトＡが紐付けされ、各文章はそれぞれ独立した文章と判別される。そして、文章のレイアウトを保持したまま文字サイズを調整し、横スクロールで読めるように、右から左に一列に各文章をその順位に従って配列して再構成文章が作成される。 The CPU 11 of the document reconstruction device 10 executes each step of FIG. 3 to extract sentence 1, sentence 2, sentence 3, and sentence 4, and object 1 is linked to document 1, and each sentence is independent. It is discriminated from the sentence. Then, the text size is adjusted while maintaining the text layout, and the text is arranged in a line from the right to the left according to the order so as to be read by horizontal scrolling.

図１５は、生成された再構成文書を示す。元文書でのレイアウトと同じようにして、文章１に中にオブジェクトＡが配置されている。同図（ｂ）は、再構成文書を携帯端末５で閲覧する際のスクロール状況を示している。 FIG. 15 shows the generated reconstructed document. An object A is arranged in the sentence 1 in the same manner as the layout in the original document. FIG. 5B shows a scroll state when the reconstructed document is browsed with the portable terminal 5.

このように、文章中に画像などのオブジェクトが含まれていたとしても、拡大回数、上下移動のスクロール回数を削減することが可能となり、ユーザが閲覧している端末に適した閲覧しやすい文書となる。なお、オブジェクトの例として画像が一つ存在する場合を例示したが、複数存在してもよい。また、オブジェクトの例として画像の場合を示したが、グラフや表などのオブジェクトでもよい。 In this way, even if an object such as an image is included in the text, it is possible to reduce the number of times of enlargement and scrolling of the up and down movement, and it is easy to view a document suitable for the terminal that the user is viewing. Become. In addition, although the case where one image exists is illustrated as an example of the object, a plurality of images may exist. Moreover, although the case of the image was shown as an example of an object, objects, such as a graph and a table | surface, may be sufficient.

＜例６＞
例６の元文書（図１６参照）は、横書きの日本語文書であり、各行は左から右に読み、行単位では上から下に読み進める。この例では、上段に２つの文章、下段に２つの文章がある。図１６（ｂ）は、例６の元文書を分割してラべリングした状態を示す。図中の破線は領域を示す。 <Example 6>
The original document of Example 6 (see FIG. 16) is a horizontally written Japanese document, and each line is read from left to right, and read line by line from top to bottom. In this example, there are two sentences at the top and two sentences at the bottom. FIG. 16B shows a state where the original document of Example 6 is divided and labeled. A broken line in the figure indicates a region.

文書再構成装置１０のＣＰＵ１１が図３の各ステップを実行することで、各々が独立した日本語の文章であると判別され、文章のレイアウトを保持したまま文字サイズを調整し、左から右へ向かって読むので、縦スクロールで読めるように文書が再構成される。図１７は、再構成された結果の再構成文書を示している。これにより、元文書より、文字の拡大回数、上下移動のスクロール回数を削減することが可能となり、ユーザが閲覧している端末に適した閲覧しやすい文書となる。 The CPU 11 of the document reconstruction device 10 executes each step of FIG. 3 to determine that each is an independent Japanese sentence, adjust the character size while maintaining the sentence layout, and from left to right Because it is read in the direction, the document is reconstructed so that it can be read by vertical scrolling. FIG. 17 shows a reconstructed document as a result of reconstructing. Thereby, it is possible to reduce the number of character enlargements and the number of scrolls for vertical movement from the original document, and the document is easy to browse and suitable for the terminal being browsed by the user.

＜例７＞
例７は、各文章を配列して再構成する際に、リフロー表示に対応した文書にする。該文書をリフロー表示した例を図１８に示す。 <Example 7>
In Example 7, when a sentence is arranged and reconstructed, a document corresponding to reflow display is used. An example of reflow display of the document is shown in FIG.

文書のレイアウトを保持したまま文字サイズを調整するよりも、リフロー表示したほうが文字の拡大回数、上下移動のスクロール回数を削減することが可能となる場合もあり、これにより、よりユーザが閲覧している端末に適した閲覧しやすい文書となる。文書再構成装置１０は、リフローとする旨の選択を、たとえば、段組み解除の指示と共に携帯端末５から受信する。 In some cases, reflow display can reduce the number of times the character is enlarged and scrolled up and down rather than adjusting the character size while maintaining the document layout. It is an easy-to-view document suitable for the terminal that is being used. For example, the document reconstruction apparatus 10 receives a selection for reflow from the portable terminal 5 together with a column release instruction.

＜例８＞
例８では、ユーザが指定した文字サイズや文字フォントでリフロー表示する。図１９はユーザが指定した文字サイズや文字フォントでリフロー表示した場合の一例を示す。元文書は例７と同じである。文書再構成装置１０は、リフローする場合の文字サイズの指定を、たとえば、段組み解除の指示と共に携帯端末５から受信する。なお、閲覧する際に携帯端末５において文字サイズの指示を受けて、携帯端末５が表示する文字サイズを変更するようにしてもよい。 <Example 8>
In Example 8, reflow display is performed with the character size and character font specified by the user. FIG. 19 shows an example when reflow display is performed with the character size and character font specified by the user. The original document is the same as in Example 7. The document reconstruction device 10 receives the designation of the character size for reflow from, for example, the portable terminal 5 together with the instruction for releasing the column. Note that the character size displayed on the portable terminal 5 may be changed by receiving an instruction on the character size in the portable terminal 5 when browsing.

デバイスに合わせたリフロー表示を行っても、ユーザにとっては文字が小さく感じる場合があるので、予め文字サイズを設定し、調節することで、拡大する回数を減らすことができ、また、ユーザが好ましフォントへ変更することで、よりユーザが閲覧しやすい文書となる。 Even if reflow display is performed according to the device, the user may feel that the characters are small, so the number of times of enlargement can be reduced by setting and adjusting the character size in advance, and the user prefers it. By changing the font, the document can be easily viewed by the user.

＜例９＞
図２０は、例９の元文書を示している。例９の元文書は、上段に２つの文章、下段に２つの文章が配置されており、左上から右下に向かって読む文書である。ここでは、モンゴル語の文章となっている。図２０（ｂ）は、例９の元文書を領域に分割してラべリングした状態を示す。図中の破線は領域を示す。ラべリングの順序は日本語の場合と同様になっている。 <Example 9>
FIG. 20 shows the original document of Example 9. The original document of Example 9 is a document in which two sentences are arranged in the upper part and two sentences are arranged in the lower part, and read from the upper left to the lower right. Here, it is a Mongolian sentence. FIG. 20B shows a state in which the original document of Example 9 is divided into regions and labeled. A broken line in the figure indicates a region. The order of labeling is the same as in Japanese.

この例では、図３の各ステップを経ることで、例９の元文書が、モンゴル語の文章であり、各文章がそれぞれ独立した文書であると判別されたものとする。そして、文章のレイアウトを保持したまま文字サイズを調整し、左から右へ向かって読むので、横スクロールで読むように左から右に向かって一列に文章を配列した再構成文書を生成する。モンゴル語であることから、文章の読む順位は、文章２（１）、文章１（２）、文章４（３）、文章３（４）となる。 In this example, it is assumed that the original document of Example 9 is determined to be a Mongolian sentence and each sentence is an independent document through the steps of FIG. Then, the text size is adjusted while maintaining the text layout and read from the left to the right, so that a reconstructed document in which the texts are arranged in a line from the left to the right as if read by horizontal scrolling is generated. Since it is Mongolian, the reading order of the sentences is sentence 2 (1), sentence 1 (2), sentence 4 (3), and sentence 3 (4).

図２１は、例９の元文書を再構成した再構成文書を示す。同図（ｂ）は、再構成文書を携帯端末５で閲覧する際のスクロール状況を示している。この文書の先頭は左端側である。閲覧時のスクロール方向は左から右方向になる。 FIG. 21 shows a reconstructed document obtained by reconstructing the original document of Example 9. FIG. 5B shows a scroll state when the reconstructed document is browsed with the portable terminal 5. The top of this document is on the left side. The scroll direction during browsing is from left to right.

このように、言語によっては、日本語と同じ縦書きであっても、行単位での読み進む方向が相違するので、その言語に適した方向に文章を配列することで、よりユーザにとって閲覧し易い文書に再構成することができる。 In this way, depending on the language, even if it is the same vertical writing as Japanese, the reading direction in line units is different, so by arranging the sentences in the direction suitable for the language, the user can browse more. It can be reconstructed into an easy document.

＜例１０＞
図２２は、例１０の元文書を示している。例１０の元文書は、英語の文書であり、上段に２つ、下段に２つの合計４つの領域に分けて文章が配置されている。各行は左から右に読み、行単位には上から下に読み進める文書である。図２２（ｂ）は、例１０の元文書を領域に分割してラべリングした状態を示す。図中の破線は領域を示す。 <Example 10>
FIG. 22 shows the original document of Example 10. The original document of Example 10 is an English document, and sentences are arranged in a total of four areas, two at the top and two at the bottom. Each line is a document that is read from left to right and read line by line from top to bottom. FIG. 22B shows a state in which the original document of Example 10 is divided into regions and labeled. A broken line in the figure indicates a region.

図３の処理により、文章のレイアウトを保持したまま文字サイズを調整し、英語を日本語に翻訳し、左から右に向かって読むことから、縦スクロールで読むように再構成した再構成文書を図２３に示す。 With the processing in Fig. 3, the text size is adjusted while maintaining the layout of the text, English is translated into Japanese, and read from left to right. It shows in FIG.

＜例１１＞
例１１は、図４の例１と同じレイアウトになるＸＭＬデータの文書が元文書の場合である。ＸＭＬのパーサを使い、レイアウトのタグ、テキストのタグを取得することで処理を行う。 <Example 11>
Example 11 is a case where an XML data document having the same layout as Example 1 in FIG. 4 is an original document. Using an XML parser, processing is performed by acquiring layout tags and text tags.

例えば、段組みの線が引かれているタグを抽出し、文書の各領域に分割し、各領域に含まれる文章はテキストのタグを抽出することで取得する。 For example, tags with columned lines are extracted and divided into each area of the document, and sentences included in each area are acquired by extracting text tags.

＜その他＞
文章連続度と比較して、文書が独立した文書であるか、連続した文書であるかを判別する際に使用する閾値は、ユーザが任意に設定してもよいし、装置が予め定めた値としてもよい。また、文書を読む際のスクロール方向は、言語、行単位の読む方向に基づいて装置で自動的に定める例を示したが、ユーザがスクロール方向を指定可能とし、指定されたスクロール方向と同一方向に文章を配列して再構成文書を生成するようにしてもよい。 <Others>
The threshold used when determining whether the document is an independent document or a continuous document compared to the sentence continuity may be arbitrarily set by the user, or a value determined in advance by the device It is good. In addition, although the scroll direction when reading a document has been shown as an example in which the device automatically determines the scroll direction based on the language and line-by-line reading direction, the user can specify the scroll direction and the same direction as the specified scroll direction A reconstructed document may be generated by arranging sentences.

文書を再構成する際に、レイアウトを保持したまま文字サイズを調整するか、リフロー表示に対応させるかの選択を、ユーザから受け付け可能とし、ユーザの選択したレイアウト方法で文書を再構成するようにしてもよい。 When restructuring a document, the user can accept the selection of whether to adjust the character size while maintaining the layout or to support reflow display, and reconfigure the document using the layout method selected by the user. May be.

このように、本発明によれば、段組みされた文書において本来１つの文章が飛び飛びの場所に分断されて配置されている場合にも、それらを繋いで文書の段組みを適切に解除し、小形の端末でも閲覧しやすい文書を生成することができる。 As described above, according to the present invention, even when a single sentence is originally divided and arranged in a jumped place in a stacked document, the columns of the document are appropriately released by connecting them, It is possible to generate a document that is easy to view even on a small terminal.

以上、本発明の実施の形態を図面によって説明してきたが、具体的な構成は実施の形態に示したものに限られるものではなく、本発明の要旨を逸脱しない範囲における変更や追加があっても本発明に含まれる。 The embodiment of the present invention has been described with reference to the drawings. However, the specific configuration is not limited to that shown in the embodiment, and there are changes and additions within the scope of the present invention. Are also included in the present invention.

文書を複数の領域に分割する際の領域判別条件は、境界線、空白に限定されない。たとえば、文字サイズの相違（見出しは大きい文字が使用される）、背景色の違いなどでもよい。また、分割後の各領域から抽出した文章が、他の領域から抽出した文章と繋がった文章か、独立した文章かを判別する際の判別方法は、実施の形態に例示したものに限定されない。たとえば、意味解析などを併用してもよい。 The area determination conditions for dividing a document into a plurality of areas are not limited to boundary lines and blanks. For example, a difference in character size (a large character is used for the headline), a difference in background color, or the like may be used. In addition, the determination method for determining whether the sentence extracted from each divided area is a sentence connected to a sentence extracted from another area or an independent sentence is not limited to the one exemplified in the embodiment. For example, semantic analysis may be used together.

実施の形態では、元文書として上下に段組みされた文書を例示したが、左右方向に段組みされた文書であっても、本発明は適用される。 In the embodiment, a document that is vertically arranged as an original document is exemplified. However, the present invention is applied to a document that is horizontally arranged.

２…文書閲覧システム
１０…文書再構成装置
１１…ＣＰＵ
１２…ＲＡＭ
１３…記憶部
１４…ネットワーク通信部
１５…入力装置
１６…出力装置
２１…分割部
２２…文章抽出部
２３…判別部
２４…文章結合部
２５…再構成部 2 ... Document browsing system 10 ... Document reconstruction device 11 ... CPU
12 ... RAM
DESCRIPTION OF SYMBOLS 13 ... Memory | storage part 14 ... Network communication part 15 ... Input device 16 ... Output device 21 ... Dividing part 22 ... Text extraction part 23 ... Discrimination part 24 ... Text coupling | bond part 25 ... Reconstruction part

Claims

A dividing unit that divides the columned original document into a plurality of areas based on a predetermined area discrimination condition;
A text extraction unit that extracts text included in each divided area;
A discriminator for discriminating whether the text extracted from each area after division is a text connected to text extracted from other areas or an independent text;
A sentence combining unit that connects sentences determined to be connected to sentences extracted from other areas to one sentence;
A reconstruction unit that arranges the independent sentences and the sentences connected to one by the sentence combination unit in a row, and reconstructs the original document into a column-unreleased document;
A document reconstruction apparatus characterized by comprising:

The document reconstructing apparatus according to claim 1, wherein the discriminating unit digitizes the appropriateness of connection between sentences and compares the text with a predetermined threshold value to perform the discrimination.

The document reconstruction apparatus according to claim 2, wherein the threshold value can be set by a user.

The said discrimination | determination part performs the said discrimination based on the similarity of the content of a sentence, or the continuity of the end of one sentence, and the head of another one sentence. Document reconstruction device.

The determination unit does not determine a connection between a sentence included in one area and a sentence included in an area at a position that is not likely to be continuous with the sentence included in the one area. 2. The document reconstruction device according to 1.

The document reconstruction apparatus according to claim 1, wherein the reconstruction unit arranges the sentences in a line according to a direction in which the sentence is read.

The document reconstruction device according to claim 1, wherein the reconstruction unit arranges the sentences in a line according to a direction designated by a user.

The reconstructing unit can select whether the reconfiguration is performed by adjusting a character size while maintaining a layout of a sentence in each region, or the reconfiguration is performed by reflow. Document reconstruction device described in 1.

The document reconstruction apparatus according to claim 8, wherein a user can specify a character size.

When an object such as an image or a graphic exists in one area in addition to a sentence, the reconstruction unit treats the object and the sentence included in the one area together and arranges them. The document reconstruction apparatus according to claim 1.

The document reconstruction apparatus according to claim 1, wherein, when the original document is image data, the dividing unit divides the original document by determining an area by image processing.

The document reconstruction device according to claim 1, wherein when the original document is a document described in a markup language, the dividing unit divides the document based on tag information indicating a column.

The document reconstruction apparatus according to claim 1, wherein when the original document is image data, the sentence extraction unit extracts a sentence by character recognition.

The document reconstruction device according to claim 1, wherein when the original document is a document described in a markup language, the sentence extraction unit extracts a sentence based on tag information indicating a text area. .

A program causing an information processing apparatus to function as the document reconstruction apparatus according to any one of claims 1 to 14.