JP2010218020A

JP2010218020A - Document management device, document management method, and program

Info

Publication number: JP2010218020A
Application number: JP2009061459A
Authority: JP
Inventors: Isamu Kozuka; 勇小塚; Yasukazu Mizushima; 靖和水嶋
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2009-03-13
Filing date: 2009-03-13
Publication date: 2010-09-30

Abstract

【課題】文書の内部に該文書を特徴づける属性情報が記述されている前記文書をデータベースに登録する場合に、前記属性情報の入力を行うことなく、前記文書と前記属性情報とを関連付けて登録することを可能とする文書管理装置、文書管理方法およびプログラムを提供する。
【解決手段】文書管理装置の文書入力部１０２は、特許公報ファイル１００を入力する。入力文書種別判定部１０４は、特許公報ファイル１００のファイル種別を判定する。属性抽出部１０６は、特許公報ファイル１００内の隅付括弧で示される属性情報特定キーワードに基づいて、特許公報ファイル１００の属性情報として公開番号を抽出する。文書情報登録部１０８は、公開番号と特許公報ファイル１００とを関連付けてデータベース１１２に登録する。
【選択図】図１When registering a document in which attribute information characterizing the document is described in a database, the document and the attribute information are registered in association with each other without inputting the attribute information. Provided are a document management apparatus, a document management method, and a program that can be performed.
A document input unit of a document management apparatus inputs a patent publication file. The input document type determination unit 104 determines the file type of the patent publication file 100. The attribute extraction unit 106 extracts the publication number as the attribute information of the patent publication file 100 based on the attribute information specifying keyword indicated by the brackets in the patent publication file 100. The document information registration unit 108 registers the publication number and the patent publication file 100 in the database 112 in association with each other.
[Selection] Figure 1

Description

本発明は、電子計算機を利用して、文書および該文書を特徴づける属性情報を相互に関連するデータとして管理する文書管理装置、文書管理方法およびプログラムに関し、特に前記文書の内部に前記文書を特徴づける属性情報が記述されている文書を入力とした場合に、前記属性情報の入力を行うことなく前記入力文書および前記属性情報の登録が可能な、文書管理装置、文書管理方法およびプログラムに関する。 The present invention relates to a document management apparatus, a document management method, and a program for managing a document and attribute information characterizing the document as mutually related data using an electronic computer, and in particular, features the document inside the document. The present invention relates to a document management apparatus, a document management method, and a program capable of registering an input document and the attribute information without inputting the attribute information when a document in which attribute information to be attached is input.

一般に、データベースを用いてデータを管理する場合、該データの名前を表すデータ項目名の集合が定義されたデータテーブルで構成されるデータベーススキーマによって各データは管理されている。そして、データベースにデータを登録する場合には、入力データから前記データテーブルに適合する形に前記入力データを成形し、前記データテーブルに相当するデータの集合を作成し、データベースに登録する。したがって、データ入力の際には、前記データ項目名とともに、前記データ項目名に対応するデータを入力する必要がある。 In general, when data is managed using a database, each data is managed by a database schema including a data table in which a set of data item names representing the names of the data is defined. When registering data in the database, the input data is shaped from the input data into a form that matches the data table, a set of data corresponding to the data table is created, and registered in the database. Therefore, when inputting data, it is necessary to input data corresponding to the data item name together with the data item name.

図８は、従来のデータ管理装置を示したものである。データ管理装置は、データ登録手段、データベース、およびデータベーススキーマから構成されている。入力データは前記データ登録手段に入力される。
図９は、図８に示す入力データの一例を示したものである。図９に示す入力データには、１行毎にデータ項目名と該データ項目名に対応するデータの内容が記述されており、特許公報情報を登録する場合を例として、出願番号、公開番号、登録特許番号、および特許公報全文が記述された特許公報テキストが保存されている記憶領域上の場所を示す公報テキスト位置が記述されている。 FIG. 8 shows a conventional data management apparatus. The data management device includes data registration means, a database, and a database schema. Input data is input to the data registration means.
FIG. 9 shows an example of the input data shown in FIG. In the input data shown in FIG. 9, the data item name and the content of the data corresponding to the data item name are described for each line. For example, when registering patent gazette information, the application number, the publication number, A registered patent number and a gazette text position indicating a location on a storage area where a patent gazette text in which a full patent gazette is described are stored are described.

図１０は、図８に示すデータスキーマの一例として、特許公報情報をデータベースで管理する場合に用いられる特許公報スキーマを示したものである。前記特許公報スキーマは、２つのデータテーブルから構成されている。一つは番号テーブルであり、出願番号、公開番号および登録特許番号が定義されている。もう一つは公報テキストテーブルであり、公開番号と公報テキスト保存位置情報とが定義されている。データテーブルの定義にはデータ定義言語が使用される。 FIG. 10 shows, as an example of the data schema shown in FIG. 8, a patent publication schema used when patent publication information is managed by a database. The patent publication schema is composed of two data tables. One is a number table in which application numbers, publication numbers, and registered patent numbers are defined. The other is a publication text table, in which a publication number and publication text storage position information are defined. A data definition language is used to define the data table.

前記データ登録手段は、前記入力データ内に記述されている内容を、前記データテーブルに従う形に成形し、データベースに登録する。図８に示すデータ管理装置は、図９に示す入力データから、該入力データを１行毎に処理し、データ項目名と、該データ項目名に対応するデータ情報とを抽出し、図１０に示すテーブルに従うデータの集合（（出願番号、公開番号、登録特許番号）および（公開番号、公報テキスト位置））を作成し、該データの集合を前記データベースに登録する。 The data registration means forms the contents described in the input data into a form according to the data table and registers it in the database. The data management apparatus shown in FIG. 8 processes the input data line by line from the input data shown in FIG. 9, and extracts the data item name and the data information corresponding to the data item name. A set of data ((application number, publication number, registered patent number) and (publication number, gazette text position)) according to the table shown is created, and the set of data is registered in the database.

図１０に示すデータスキーマでは、二つのテーブルで公開番号を共通の項目として定義している。これにより番号テーブルにも公報テキストテーブルにも公開番号が共有され、前記データベースに公開番号をもとに特許情報を問い合わせると、公開番号とともに出願番号、登録特許番号、および公報テキスト位置を一括して抜き出すことが可能となる。 In the data schema shown in FIG. 10, the public number is defined as a common item in two tables. As a result, the publication number is shared in both the number table and the gazette text table, and when the patent information is inquired to the database based on the publication number, the application number, the registered patent number, and the gazette text position are collectively displayed together with the publication number. It becomes possible to extract.

ただし、このようなデータ管理装置の場合、データを入力する場合に生じる入力ミスによるデータの誤登録の問題があった。即ち、データ項目名において“公開番号”と記述すべき箇所に“公表番号”と記述した場合、データベースへのデータ登録が不可、あるいは誤登録されてしまうという問題（データ項目名の不一致の問題）、およびデータ内容において、公開番号に対応するデータの箇所に誤って公表番号のデータを記述した場合に、誤った情報がデータベースに登録されるという問題（データ内容の不一致の問題）があった。また複数のデータを登録する場合、毎回図９に示されるデータ項目名および該データ項目名に対応するデータ内容の情報を作成しなくてはならず、登録データ作成の煩雑さの問題もあった。 However, in the case of such a data management apparatus, there has been a problem of erroneous registration of data due to an input error that occurs when data is input. In other words, if "publication number" is described in the place where "public number" should be described in the data item name, data registration to the database is impossible or erroneous registration (data item name mismatch problem) In addition, in the data contents, when the data of the publication number is erroneously described in the data portion corresponding to the public number, there is a problem that incorrect information is registered in the database (data content mismatch problem). In addition, when registering a plurality of data, it is necessary to create the data item name shown in FIG. 9 and data content information corresponding to the data item name every time, and there is a problem of complicated registration data creation. .

同一内容のデータを入力する場合に生じる、データ登録の煩雑さ、および誤データ登録の問題を解決するための同一データの一括登録方法として、特許文献１に記載のものがある。特許文献１では、同一の内容のデータを登録する場合に、逐次データを入力するのではなく、データベースに予め登録されているデータの中から検索条件に従うデータを取り出し、該検索条件に従うデータに対して複製処理を行わせることで、人手による同一データの入力の煩雑さ、誤入力の問題を回避することを可能としている。 Japanese Patent Application Laid-Open No. 2004-151867 discloses a method for collectively registering the same data to solve the problem of data registration complexity and erroneous data registration that occur when data of the same content is input. In Patent Document 1, when registering data having the same content, instead of sequentially inputting data, data according to a search condition is extracted from data registered in advance in a database, and the data according to the search condition is extracted. By performing the duplication processing, it is possible to avoid the trouble of manually inputting the same data and the problem of erroneous input.

特開２００７−８６８６２号公報JP 2007-86862 A

入力データの誤登録の問題および入力作業の煩雑さの問題は、データ登録装置のデータ登録時における重要な問題であり、特許文献１では同一内容のデータを複製処理することで前記煩雑さの問題を解決している。
しかし通常のデータ登録装置でのデータ登録時に生じるデータ入力の誤入力の問題の解決には至ってはいない。さらに特許文献１では、データ入力に伴うデータ項目名の入力が必要であり、該データ項目名の入力の煩雑さの問題は解決されてはいない。データ項目名は全ての関連情報を紐づける重要な情報であるため、該データ項目名に誤りが含まれている場合、データベース中のデータの信頼性が著しく損なわれるという問題がある。 The problem of erroneous registration of input data and the problem of complexity of input work are important problems when registering data in the data registration apparatus. Has solved.
However, the problem of erroneous input of data input that occurs during data registration in a normal data registration apparatus has not yet been solved. Further, in Patent Document 1, it is necessary to input a data item name accompanying data input, and the problem of complicated input of the data item name has not been solved. Since the data item name is important information that links all the related information, if the data item name contains an error, the reliability of the data in the database is significantly impaired.

また、図９のように、特許公報テキストおよび、出願番号、公開番号等の書誌情報を入力とする場合、前記書誌情報は前記特許公報テキストに記述されているにもかかわらず、書誌情報をデータ登録手段に入力する必要があり、煩雑さの問題があった。また例えば公開番号を誤って登録した場合には、入力された公開番号と、該公開番号とともに入力された特許公報テキストに記載されている公開番号とが異なるという誤登録の問題があった。
そこで本発明は、上記従来の未解決の問題に着目してなされたものであり、文書の内部に該文書を特徴づける属性情報が記述されている前記文書をデータベースに登録する場合に、前記属性情報の入力を行うことなく、前記文書と前記属性情報とを関連付けて登録することを可能とする文書管理装置、文書管理方法およびプログラムを提供することを目的とする。 In addition, as shown in FIG. 9, when bibliographic information such as patent gazette text and application number, publication number, etc. is input, the bibliographic information is stored as data even though the bibliographic information is described in the patent gazette text. There was a problem of complexity because it was necessary to input to the registration means. Further, for example, when the public number is registered by mistake, there is a problem of erroneous registration in which the input public number is different from the public number described in the patent gazette text input together with the public number.
Therefore, the present invention has been made paying attention to the above-mentioned conventional unsolved problems, and when registering the document in which attribute information characterizing the document is described in the document, the attribute is registered. It is an object of the present invention to provide a document management apparatus, a document management method, and a program capable of associating and registering the document and the attribute information without inputting information.

上記問題を解決するために、本発明の請求項１にかかる文書管理装置は、文書の内部に該文書を特徴づける属性情報が記述されている前記文書を入力する文書入力部と、前記文書と前記属性情報とを関連付けて管理するデータベースとを備えた文書管理装置において、前記文書入力部により入力された文書から前記属性情報を抽出する属性情報抽出部と、前記属性情報抽出部により抽出された前記属性情報と前記入力された文書とを関連付けて前記データベースに登録する文書情報登録部とを備え、前記属性情報抽出部は、前記入力された文書内に記述されているキーワードであって、前記属性情報が記述されている箇所を特定する属性情報特定キーワードに基づいて、または、前記入力された文書内に記述されている表現パターンであって、前記属性情報を表現する属性情報表現パターンに対応する文字列の出現頻度に基づいて、前記属性情報を抽出することを特徴とする。 In order to solve the above problem, a document management apparatus according to claim 1 of the present invention includes a document input unit that inputs the document in which attribute information characterizing the document is described, and the document. In a document management apparatus comprising a database that manages the attribute information in association with each other, the attribute information extraction unit that extracts the attribute information from the document input by the document input unit, and the attribute information extraction unit A document information registration unit that associates the attribute information with the input document and registers it in the database, and the attribute information extraction unit is a keyword described in the input document, Based on an attribute information specifying keyword that specifies a location where attribute information is described, or an expression pattern described in the input document Based on the appearance frequency of the character string corresponding to the attribute information expression pattern representing said attribute information, and extracts the attribute information.

この請求項１の発明によれば、前記文書管理装置は、前記入力された文書に記述されている、前記属性情報が記述されている箇所を特定する属性情報特定キーワードに基づいて、または、前記属性情報を表現する属性情報表現パターンに対応する文字列の出現頻度に基づいて、前記属性情報を抽出し、該抽出された前記属性情報と前記入力された文書とを関連付けて前記データベースに登録するため、前記文書を前記文書管理装置に入力するだけで、前記属性情報及び前記文書を関連付けてデータベースに登録することが可能となる。したがって、ユーザーは前記文書の属性情報を指定したり入力したりする必要がなくなり、登録が簡便になるとともに、前記属性情報の誤登録を防ぐことができる。 According to the first aspect of the present invention, the document management device is based on an attribute information specifying keyword that specifies a location where the attribute information is described, which is described in the input document, or the The attribute information is extracted based on the appearance frequency of the character string corresponding to the attribute information expression pattern expressing the attribute information, and the extracted attribute information and the input document are associated and registered in the database. Therefore, the attribute information and the document can be associated and registered in the database simply by inputting the document to the document management apparatus. Therefore, the user does not need to specify or input the attribute information of the document, so that the registration becomes simple and the erroneous registration of the attribute information can be prevented.

また、請求項２にかかる文書管理装置は、請求項１において、前記文書内において前記文字列が出現する出現領域を推定するための領域推定パラメータを管理する領域推定パラメータ管理部をさらに備え、前記属性情報抽出部は、前記文字列の出現頻度に基づいて前記属性情報を抽出する場合、前記領域推定パラメータ管理部で管理されている領域推定パラメータを用いて前記出現領域を推定し、該推定された出現領域から前記文字列を探索することにより、前記文字列の出現頻度を判定することを特徴とする。 The document management apparatus according to claim 2 further includes an area estimation parameter management unit that manages area estimation parameters for estimating an appearance area in which the character string appears in the document. When the attribute information extraction unit extracts the attribute information based on the appearance frequency of the character string, the attribute information extraction unit estimates the appearance region using the region estimation parameter managed by the region estimation parameter management unit, and the estimation is performed. The appearance frequency of the character string is determined by searching the character string from the appearance region.

この請求項２の発明によれば、前記領域推定パラメータ管理部で管理されている領域推定パラメータを用いて前記出現領域を推定し、該推定された出現領域から前記文字列を探索することにより、前記文字列の出現頻度を判定するため、前記属性情報を表す文字列が出現する可能性が高い出現領域に限定して探索を行うことで、前記文字列の出現頻度を高くすることができ、精度の高い属性抽出を行うことが可能となる。 According to the invention of claim 2, by estimating the appearance area using the area estimation parameter managed by the area estimation parameter management unit, and searching the character string from the estimated appearance area, In order to determine the appearance frequency of the character string, it is possible to increase the appearance frequency of the character string by performing a search limited to the appearance region where the character string representing the attribute information is likely to appear, It is possible to perform attribute extraction with high accuracy.

請求項３に記載の文書管理方法は、文書と該文書を特徴づける属性情報とを関連付けてデータベースで管理する文書管理装置が実行する文書管理方法において、前記属性情報が記述されている前記文書を入力する文書入力ステップと、前記入力された文書内に、前記属性情報が記述されている箇所を特定する属性情報特定キーワードが記述されている場合には、該属性情報特定キーワードに基づいて前記属性情報を抽出し、前記入力された文書内に、前記属性情報特定キーワードが記述されておらず、かつ、前記属性情報を表現する属性情報表現パターンが記述されている場合には、該属性情報表現パターンに対応する文字列の出現頻度に基づいて前記属性情報を抽出する属性情報抽出ステップと、前記属性情報抽出ステップから抽出された前記属性情報と前記入力された文書とを関連付けて前記データベースに登録する文書情報登録ステップとを備えることを特徴とする。 The document management method according to claim 3, wherein the document management method is executed by a document management apparatus that manages a database by associating a document with attribute information characterizing the document and managing the document in which the attribute information is described. A document input step for inputting, and an attribute information specifying keyword for specifying a location in which the attribute information is described in the input document, the attribute based on the attribute information specifying keyword If the attribute information specifying keyword is not described in the input document and an attribute information expression pattern expressing the attribute information is described in the input document, the attribute information expression Extracted from the attribute information extraction step for extracting the attribute information based on the appearance frequency of the character string corresponding to the pattern, and the attribute information extraction step In association with a document whose serial was attribute information and said input, characterized in that it comprises a document information registration step of registering in the database.

この請求項３の発明によれば、前記文書管理装置は、前記入力された文書内に、前記属性情報が記述されている箇所を特定する属性情報特定キーワードが記述されている場合には、該属性情報特定キーワードに基づいて前記属性情報を抽出し、前記入力された文書内に、前記属性情報特定キーワードが記述されておらず、かつ、前記属性情報を表現する属性情報表現パターンが記述されている場合には、該属性情報表現パターンに対応する文字列の出現頻度に基づいて前記属性情報を抽出し、該抽出された前記属性情報と前記入力された文書とを関連付けて前記データベースに登録するため、前記文書の属性情報の入力を行うことなく、前記文書を入力するだけで、前記属性情報及び前記文書を関連付けてデータベースに登録し管理することが可能となる。 According to the third aspect of the present invention, when the attribute information specifying keyword for specifying the location where the attribute information is described is described in the input document, the document management apparatus The attribute information is extracted based on the attribute information specific keyword, and the attribute information specific keyword is not described in the input document, and an attribute information expression pattern expressing the attribute information is described. The attribute information is extracted based on the appearance frequency of the character string corresponding to the attribute information expression pattern, and the extracted attribute information and the input document are associated with each other and registered in the database. Therefore, without inputting the attribute information of the document, the attribute information and the document are associated with each other and registered and managed in the database only by inputting the document. It can become.

請求項４に記載の文書管理方法は、請求項３において、前記入力された文書内において前記文字列が出現する出現領域を推定するための領域推定パラメータを設定する領域推定パラメータ設定ステップをさらに備え、前記属性情報抽出ステップにおいて、前記文字列の出現頻度に基づいて前記属性情報を抽出する場合には、前記領域推定パラメータ設定ステップにおいて設定された領域推定パラメータを用いて前記出現領域を推定し、該推定された出現領域から前記文字列を探索することにより、前記文字列の出現頻度を判定することを特徴とする。 The document management method according to claim 4, further comprising an area estimation parameter setting step for setting an area estimation parameter for estimating an appearance area in which the character string appears in the input document. In the attribute information extraction step, when the attribute information is extracted based on the appearance frequency of the character string, the appearance region is estimated using the region estimation parameter set in the region estimation parameter setting step, The appearance frequency of the character string is determined by searching the character string from the estimated appearance area.

この請求項４の発明によれば、前記属性情報抽出ステップにおいて、前記文字列の出現頻度に基づいて前記属性情報を抽出する場合には、前記領域推定パラメータ設定ステップにおいて設定された領域推定パラメータを用いて前記出現領域を推定し、該推定された出現領域から前記文字列を探索することにより、前記文字列の出現頻度を判定するため、前記属性情報を表す文字列が出現する可能性が高い出現領域に限定して探索を行うことで、前記文字列の出現頻度を高くすることができ、精度の高い属性抽出を行うことが可能となる。 According to the fourth aspect of the present invention, in the attribute information extraction step, when the attribute information is extracted based on the appearance frequency of the character string, the region estimation parameter set in the region estimation parameter setting step is The appearance region is estimated using the character string, and the character string representing the attribute information is likely to appear in order to determine the appearance frequency of the character string by searching the character string from the estimated appearance region. By performing the search limited to the appearance area, the appearance frequency of the character string can be increased, and the attribute extraction with high accuracy can be performed.

また、請求項５にかかるプログラムは、サーバからのダウンロードあるいは記録媒体からのコピーによってコンピュータに記憶させ実行させることで、請求項３または４に記載された方法をコンピュータによって実現することが可能となる。 The program according to claim 5 can be stored in a computer and executed by being downloaded from a server or copied from a recording medium, whereby the method according to claim 3 or 4 can be realized by the computer. .

本発明により、前記文書管理装置は、前記入力された文書に記述されている、前記属性情報が記述されている箇所を特定する属性情報特定キーワードに基づいて、または、前記属性情報を表現する属性情報表現パターンに対応する文字列の出現頻度に基づいて、前記属性情報を抽出し、該抽出された前記属性情報と前記入力された文書とを関連付けて前記データベースに登録するため、前記文書を前記文書管理装置に入力するだけで、前記属性情報及び前記文書を関連付けてデータベースに登録することが可能となる。したがって、ユーザーは前記文書の属性情報を指定したり入力したりする必要がなくなり、登録が簡便になるとともに、前記属性情報の誤登録を防ぐことができる。 According to the present invention, the document management apparatus is configured to use an attribute information specifying keyword that specifies a location in which the attribute information is described in the input document, or an attribute that represents the attribute information. In order to extract the attribute information based on the appearance frequency of the character string corresponding to the information expression pattern, and to associate the extracted attribute information with the input document and register them in the database. The attribute information and the document can be associated with each other and registered in the database simply by inputting to the document management apparatus. Therefore, the user does not need to specify or input the attribute information of the document, so that the registration becomes simple and the erroneous registration of the attribute information can be prevented.

本発明の実施例１に係る文書管理装置の構成を示す図である。It is a figure which shows the structure of the document management apparatus which concerns on Example 1 of this invention. 本発明の実施例１に係る文書管理装置が備える入力文書種別判定部が実行する判定方法の流れを示すフローチャートである。It is a flowchart which shows the flow of the determination method which the input document classification determination part with which the document management apparatus concerning Example 1 of this invention is provided performs. 本発明の実施例１に係る文書管理装置が備えるデータベーススキーマの一例を示す図である。It is a figure which shows an example of the database schema with which the document management apparatus which concerns on Example 1 of this invention is provided. 本発明の実施例２に係る文書管理装置の構成を示す図である。It is a figure which shows the structure of the document management apparatus which concerns on Example 2 of this invention. 本発明の実施例２に係る文書管理装置に入力される特許公報ファイルの公開番号フォーマットパターンの一例を示す図である。It is a figure which shows an example of the publication number format pattern of the patent gazette file input into the document management apparatus which concerns on Example 2 of this invention. 本発明の実施例３に係る文書管理装置の構成を示す図である。It is a figure which shows the structure of the document management apparatus which concerns on Example 3 of this invention. 本発明の実施例３に係る属性抽出部が実行する文字列の出現頻度の判定方法の流れを示すフローチャートである。It is a flowchart which shows the flow of the determination method of the appearance frequency of the character string which the attribute extraction part which concerns on Example 3 of this invention performs. 従来における一般的なデータ登録方法を説明するための図である。It is a figure for demonstrating the conventional general data registration method. 特許情報入力データの一例を示す図である。It is a figure which shows an example of patent information input data. 特許情報を管理する場合のデータベーススキーマの一例を示す図である。It is a figure which shows an example of the database schema in the case of managing patent information.

以下、本発明の実施の形態として、文書を特許公報ファイル１００とした場合について、図面を参照しながら説明する。なお、以下の説明において参照する各図では、他の図と同等部分は同一符号によって示されている。 Hereinafter, as an embodiment of the present invention, a case where a document is a patent publication file 100 will be described with reference to the drawings. In the drawings referred to in the following description, the same parts as those in the other drawings are denoted by the same reference numerals.

図１は、実施例１に係る文書管理装置の構成図である。同図に示すように、本発明の文書管理装置は、文書入力部１０２、入力文書種別判定部１０４、属性抽出部１０６、文書情報登録部１０８、データベーススキーマ１１０、およびデータベース１１２を備えている。これらの機能は、文書管理装置が備える図示せぬＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）が、ハードディスクやＲＯＭ（Ｒｅａd ＯｎｌｙＭｅｍｏｒｙ）等の記憶装置に記憶されたプログラムを実行することにより実現される機能である。 FIG. 1 is a configuration diagram of a document management apparatus according to the first embodiment. As shown in the figure, the document management apparatus of the present invention includes a document input unit 102, an input document type determination unit 104, an attribute extraction unit 106, a document information registration unit 108, a database schema 110, and a database 112. These functions are realized by a CPU (Central Processing Unit) (not shown) included in the document management apparatus executing a program stored in a storage device such as a hard disk or a ROM (Read Only Memory).

文書入力部１０２には、特許公報ファイル１００が入力される。ここで、特許公報ファイル１００とは、ＩＰＤＬ（特許電子図書館）が一般に公開している電子版特許公報テキストファイルまたは電子版特許公報ｐｄｆファイルのことである。
本実施例では、テキストフォーマットで記述された特許公報ファイル１００が入力された場合を示す。
入力文書種別判定部１０４は、入力された特許公報ファイル１００のファイル種別を判定し、特許公報ファイル１００がテキストファイルなのか、ｐｄｆファイルなのかを判定する。そして、該判定結果を、記憶装置にあらかじめ設けられている、入力文書種別を記憶するための記憶領域に設定する。 The patent publication file 100 is input to the document input unit 102. Here, the patent gazette file 100 is an electronic version patent gazette text file or an electronic version patent gazette pdf file publicly disclosed by IPDL (Patent Electronic Library).
In this embodiment, a case where a patent publication file 100 described in a text format is input is shown.
The input document type determination unit 104 determines the file type of the input patent gazette file 100 and determines whether the patent gazette file 100 is a text file or a pdf file. Then, the determination result is set in a storage area for storing the input document type, which is provided in advance in the storage device.

図２は、入力文書種別判定部１０４が行う判定方法の流れを示したフローチャートである。入力文書種別判定部１０４は、まず、Ｓ２００で処理を開始し、Ｓ２１０に移る。Ｓ２１０では、ファイル種別判定ツールの使用の可否を判断する。前記ファイル種別判定ツールとは、本文書管理装置で予め用意しているものではなく、本文書管理装置が利用する、本文書管理装置の他に用意されたツールを指す。代表的な前記ファイル種別判定ツールには、ＵＮＩＸ（登録商標）・ＯＳ（オペレーティングシステム）に付属されているｆｉｌｅコマンドがある。前記ファイル種別判定ツールの利用が可能であればＳ２３０に移り、前記ファイル種別判定ツールに特許公報ファイル１００を読み込ませる。前記ファイル種別判定ツールの判定結果を、入力文書種別判定部１０４の最終結果である入力文書種別に設定し、Ｓ２６０に移り処理を終了する。 FIG. 2 is a flowchart showing the flow of the determination method performed by the input document type determination unit 104. First, the input document type determination unit 104 starts processing in S200, and proceeds to S210. In S210, it is determined whether or not the file type determination tool can be used. The file type determination tool is not prepared in advance in the document management apparatus, but refers to a tool prepared in addition to the document management apparatus used by the document management apparatus. A typical file type determination tool includes a file command attached to UNIX (registered trademark) and OS (operating system). If the file type determination tool can be used, the process proceeds to S230, and the file type determination tool is caused to read the patent publication file 100. The determination result of the file type determination tool is set to the input document type that is the final result of the input document type determination unit 104, and the process proceeds to S260 and the process is terminated.

Ｓ２１０で前記ファイル種別判定ツールの利用が不可であると判断された場合にはＳ２２０に移る。Ｓ２２０では、特許公報ファイル１００のファイル名の拡張子が“ｐｄｆ”であるかどうかを判断する。前記ファイル名の拡張子が“ｐｄｆ”である場合にはＳ２４０に移る。Ｓ２４０では前記入力文書種別を“ｐｄｆ”に設定し、Ｓ２６０に移り、処理を終了する。 If it is determined in S210 that the file type determination tool cannot be used, the process proceeds to S220. In S220, it is determined whether or not the file name extension of the patent publication file 100 is “pdf”. If the extension of the file name is “pdf”, the process proceeds to S240. In S240, the input document type is set to “pdf”, the process proceeds to S260, and the process ends.

Ｓ２２０で前記ファイル名の拡張子が“ｐｄｆ”でない場合には、Ｓ２５０に移る。Ｓ２５０では特許公報ファイル１００がテキストファイルであると判断し、前記入力文書種別を“ｔｘｔ”に設定する。その後、Ｓ２６０に移り、処理を終了する。
属性抽出部１０６は、入力文書種別判定部１０４で得られた前記入力文書種別をもとに、特許公報ファイル１００から特許公報ファイル１００の中に記述されている属性情報を抽出する。 If the extension of the file name is not “pdf” in S220, the process proceeds to S250. In S250, it is determined that the patent publication file 100 is a text file, and the input document type is set to “txt”. Thereafter, the process proceeds to S260 and the process is terminated.
The attribute extraction unit 106 extracts attribute information described in the patent publication file 100 from the patent publication file 100 based on the input document type obtained by the input document type determination section 104.

本実施例では、「出願番号」、「公開番号」、「発明の名称」を属性情報として抽出する。
前記入力文書種別が“ｔｘｔ”の場合、書誌情報を含め、全文が機械的に処理可能なテキストフォーマットで書かれており、「出願番号」、「公開番号」、「発明の名称」は、各々隅付括弧がつけられた出願番号、公開番号、発明の名称を、属性情報特定キーワードとし抽出することが可能である。属性抽出部１０６は、各々の前記属性情報について、前記属性情報に対応する前記属性情報特定キーワードの末尾から改行コードがある文末までの箇所を前記属性情報として抽出する。 In this embodiment, “application number”, “publication number”, and “invention name” are extracted as attribute information.
When the input document type is “txt”, the entire text including bibliographic information is written in a text format that can be processed mechanically, and “application number”, “publication number”, and “invention name” are respectively It is possible to extract the application number, the publication number, and the name of the invention with the brackets in the corners as the attribute information specifying keyword. For each of the attribute information, the attribute extraction unit 106 extracts, as the attribute information, a portion from the end of the attribute information specific keyword corresponding to the attribute information to the end of the sentence with a line feed code.

文書情報登録部１０８は、属性抽出部１０６で抽出された「出願番号」、「公開番号」、「発明の名称」の属性情報から、予めデータベーススキーマ１１０に定義されている図３に記載のデータベーステーブルに従ったデータ情報を作成し、データベース１１２に登録する。
本実施例に示すように、ユーザーは、特許公報ファイル１００をデータベース１１２に登録する際に、特許公報ファイル１００の属性情報である、「出願番号」、「公開番号」、「発明の名称」を、特許公報ファイルとは別に指定、入力する必要がなく、本発明によって、登録処理が簡便になるとともに、前記属性情報の誤登録の問題が発生しない文書管理装置を実現することが可能になる。 The document information registration unit 108 includes the database shown in FIG. 3 defined in advance in the database schema 110 from the attribute information of “application number”, “publication number”, and “invention name” extracted by the attribute extraction unit 106. Data information according to the table is created and registered in the database 112.
As shown in the present embodiment, when registering the patent gazette file 100 in the database 112, the user inputs “application number”, “publication number”, and “invention name” which are attribute information of the patent gazette file 100. Therefore, it is not necessary to designate and input separately from the patent publication file, and the present invention makes it possible to realize a document management apparatus that simplifies the registration process and does not cause the problem of erroneous registration of the attribute information.

図４は、実施例２に係る文書管理装置の構成図である。
実施例２に係る文書管理装置は、実施例１に係る文書管理装置に対して、特許公報ファイル１００の種類がｐｄｆファイルに変更されている点および属性抽出部２０６の処理内容が実施例１とは異なる。それ以外は実施例１と同等である。
ｐｄｆフォーマットの特許公報ファイル１００においては、書誌情報は画像情報として挿入されており、ｐｄｆファイルを機械的に処理が可能なテキスト情報に変換した場合でも、前記書誌情報を機械的に処理可能なテキスト情報として抽出することができない。 FIG. 4 is a configuration diagram of the document management apparatus according to the second embodiment.
The document management apparatus according to the second embodiment is different from the document management apparatus according to the first embodiment in that the type of the patent publication file 100 is changed to a pdf file and the processing contents of the attribute extraction unit 206 are the same as those in the first embodiment. Is different. Other than that is the same as Example 1.
In the patent publication file 100 in the pdf format, bibliographic information is inserted as image information, and even when the pdf file is converted into text information that can be processed mechanically, the bibliographic information can be processed mechanically. It cannot be extracted as information.

前記ｐｄｆフォーマットの特許公報ファイル１００においては、各ページのヘッダ部に公開番号が記述されている。本実施例に係る文書管理装置においては、前記ｐｄｆフォーマットの特許公報ファイル１００が入力の場合、前記ヘッダ部に記載の公開番号を属性情報として抽出し、データベーススキーマ１１０の特許公報ファイルテーブルに対応するデータをデータベース１１２に登録する。 In the patent publication file 100 in the pdf format, a public number is described in the header portion of each page. In the document management apparatus according to the present embodiment, when the patent publication file 100 in the pdf format is input, the public number described in the header part is extracted as attribute information and corresponds to the patent publication file table of the database schema 110. Data is registered in the database 112.

具体的には、属性抽出部２０６は、ｐｄｆフォーマットの特許公報ファイル１００を変換したテキスト情報を先頭から最後まで読み込み、図５に従う公開番号フォーマットパターンに従う文字列とその出現頻度を求める。図５では正規表現を用いた公開番号フォーマットパターンを示している。図５（Ａ）は「特開」に始まり、その後に「平」または「昭」のいずれかの文字が続き、更にその後に数字列と、「−」と、数字列と、が続くパターンを表している。図５（Ｂ）は「特開」に始まり、その後に４つの数字で構成されている数字列と、「−」と、数字列と、が続くパターンを表している。属性抽出部２０６は、求められた文字列と該文字列の出現頻度から、該出現頻度がもっとも高い文字列を特許公報ファイル１００の公開番号として決定する。 Specifically, the attribute extraction unit 206 reads text information obtained by converting the patent publication file 100 in the pdf format from the beginning to the end, and obtains a character string according to the public number format pattern according to FIG. FIG. 5 shows a public number format pattern using regular expressions. FIG. 5A shows a pattern that begins with “JP,” followed by either “Hira” or “Akira,” followed by a numeric string, “−”, and a numeric string. Represents. FIG. 5B shows a pattern starting with “JP,” followed by a number string composed of four numbers, “−”, and a number string. The attribute extraction unit 206 determines the character string having the highest appearance frequency as the publication number of the patent publication file 100 from the obtained character string and the appearance frequency of the character string.

特許公報ファイル１００には、特許公報ファイルの１００の公開番号のほかに、先行技術文献の公開番号が記載されており、属性抽出部２０６では前記先行技術文献の公開番号も抽出する。しかし、特許公報ファイル１００の公開番号は各ページのヘッダ部に記述されており、一般に前記先行技術文献の公開番号より特許公報ファイル１００の公開番号の方が出現頻度が高く、出現頻度が高い文字列を選択することで、属性抽出部２０６は特許公報ファイル１００の公開番号を正しく抽出することが可能になる。 The patent publication file 100 includes the publication number of the prior art document in addition to the publication number of the patent publication file 100, and the attribute extraction unit 206 also extracts the publication number of the prior art document. However, the publication number of the patent publication file 100 is described in the header part of each page. Generally, the publication number of the patent publication file 100 is higher in appearance frequency than the publication number of the prior art document, and the appearance frequency is higher. By selecting a column, the attribute extraction unit 206 can correctly extract the publication number of the patent publication file 100.

図６は、実施例３に係る文書管理装置の構成図である。
実施例３は実施例２に対し、属性抽出部３０６の動作内容および領域推定パラメータ管理部３１４が追加された点が異なる。それ以外は実施例２と同等である。
領域推定パラメータ管理部３１４は、特許公報ファイル１００の１ページに記載されている行数に対応する値を管理する。本実施例では、前記行数に対応する値の初期値を“５８”に設定している。
属性抽出部３０６は、実施例２に係る属性抽出部２０６のように特許公報ファイル１００の全文に対して探索処理を行わずに、公開番号が出現すると推定した特定の出現領域に対してのみ、図５に示す公開番号フォーマットパターンに対応する文字列を探索する点が異なる。 FIG. 6 is a configuration diagram of the document management apparatus according to the third embodiment.
The third embodiment is different from the second embodiment in that an operation content of the attribute extraction unit 306 and a region estimation parameter management unit 314 are added. Other than that, it is equivalent to the second embodiment.
The area estimation parameter management unit 314 manages a value corresponding to the number of lines described on one page of the patent publication file 100. In this embodiment, the initial value of the value corresponding to the number of rows is set to “58”.
The attribute extraction unit 306 does not perform a search process on the entire text of the patent publication file 100 as in the attribute extraction unit 206 according to the second embodiment, and only for a specific appearance region that is estimated that a public number appears. The difference is that a character string corresponding to the public number format pattern shown in FIG. 5 is searched.

これは、実施例２に係る属性抽出部２０６では、入力された特許公報ファイル１００の全文を文字列の探索領域としているため、前記特許公報ファイル１００が短いと、前記先行技術文献の公開番号の出現頻度と公開番号の出現頻度とが等しくなる場合があり、この場合には、前記出現頻度が高い文字列が複数存在し、公開番号を抽出できなくなるという問題があるからである。これに対して、本実施例に係る属性抽出部３０６では、文字列を探索する領域を、前記公開番号が出現する可能性が高い出現領域に限定することで、前記出現頻度が高い文字列が複数存在する問題を解決している。 This is because, in the attribute extraction unit 206 according to the second embodiment, the entire text of the input patent gazette file 100 is used as a character string search area. Therefore, if the patent gazette file 100 is short, the publication number of the prior art document is set. This is because the appearance frequency and the appearance frequency of the public number may be equal, and in this case, there are a plurality of character strings having a high appearance frequency, and there is a problem that the public number cannot be extracted. On the other hand, in the attribute extraction unit 306 according to the present embodiment, the character string having a high appearance frequency can be obtained by limiting the search area for the character string to an appearance area where the public number is likely to appear. Solves multiple existing problems.

特許公報ファイル１００がｐｄｆファイルの場合、各ページのヘッダ部に公開番号が記述されており、入力ファイルを特徴づける属性情報として、この公開番号を抽出する。ヘッダ部に記載されている前記公開番号は、１ページの行数分離れて、ほぼ規則的に出現する。本実施例では、該規則的に出現する公開番号のエリアを出現領域として特定して探索することで、前記出現頻度が高い文字列が複数存在する問題を解決している。 When the patent publication file 100 is a pdf file, a public number is described in the header portion of each page, and this public number is extracted as attribute information characterizing the input file. The public number described in the header part appears almost regularly with the number of lines in one page separated. In the present embodiment, the problem of the presence of a plurality of character strings having a high appearance frequency is solved by specifying and searching the area of the public number that appears regularly as an appearance area.

図７は、属性抽出部３０６が実行する文字列の出現頻度の判定方法の流れを示したフローチャートである。属性抽出部３０６はまず、Ｓ１０００で処理を開始し、Ｓ１０１０に移る。
Ｓ１０１０では、最初に行う探索の探索領域を設定するとともに、処理全体で使用するパラメータを初期化する。ここでstart_position と end_position は前記探索領域を指定する探索開始位置および探索終了位置を示す。end_position＝−１とは、入力された特許公報ファイル１００の最後までを探索領域にすることを示している。previous_found_position は前回の探索で発見された文字列の位置を表し、areasize は前記探索領域の半分の大きさを表す。ここで文字列の位置とは、該文字列が存在する特許公報ファイル１００の先頭からの行数を表す。 FIG. 7 is a flowchart showing the flow of the method for determining the appearance frequency of the character string executed by the attribute extraction unit 306. The attribute extraction unit 306 first starts processing in S1000, and proceeds to S1010.
In S1010, a search area for the search to be performed first is set, and parameters used in the entire process are initialized. Here, start_position and end_position indicate a search start position and a search end position that designate the search area. “end_position = −1” indicates that the search area extends to the end of the input patent gazette file 100. previous_found_position represents the position of the character string found in the previous search, and areasize represents half the size of the search area. Here, the position of the character string represents the number of lines from the beginning of the patent publication file 100 in which the character string exists.

Ｓ１０２０では、定められた前記探索領域から、図５に記載の公開番号フォーマットパターンに対応する文字列を探索する。探索は探索領域の先頭から開始し、該文字列が見つかり次第、Ｓ１０２０における処理を終了し、Ｓ１０３０に移る。
Ｓ１０３０では、前記文字列が見つかったどうかを判定する。見つからない場合には、Ｓ１０４０に移り、見つかった場合には、Ｓ１０７０に移る。 In S1020, a character string corresponding to the public number format pattern shown in FIG. 5 is searched from the determined search area. The search starts from the top of the search area. As soon as the character string is found, the process in S1020 is terminated, and the process proceeds to S1030.
In S1030, it is determined whether the character string is found. If not found, the process moves to S1040. If found, the process moves to S1070.

Ｓ１０７０ではfound_positionに、見つかった文字列の位置を設定し、該見つかった文字列の出現頻度に１を加え、Ｓ１０８０に移る。
Ｓ１０８０では、start_positionが０であるか否かを判定し、前記見つかった文字列が一番最初に見つかった文字列であるかどうかを判定する。Start_positionが０である場合には、一番最初に見つかった文字列と判断し、Ｓ１１００に移る。Start_positionが０でない場合には、Ｓ１０９０に移る。 In S1070, the position of the found character string is set in found_position, 1 is added to the appearance frequency of the found character string, and the process proceeds to S1080.
In S1080, it is determined whether or not start_position is 0, and it is determined whether or not the found character string is the first found character string. When Start_position is 0, it is determined that the character string is found first, and the process proceeds to S1100. If Start_position is not 0, the process moves to S1090.

Ｓ１０９０では、前回、前記図７に記載の公開番号フォーマットパターンに対応する文字列が見つかった場所から、今回見つかった場所までの距離をfound_position−previous_found_positionから求め、該距離を示すページ行数データを領域推定パラメータ管理部３１４に通達し、Ｓ１１００に移る。領域推定パラメータ管理部３１４は、図５に示す公開番号フォーマットパターンに対応する文字列間の距離を示すページ行数データを管理する。 In S1090, the distance from the location where the character string corresponding to the public number format pattern described in FIG. 7 was found last time to the location found this time is found from found_position-previous_found_position, and the page row number data indicating the distance is stored in the area. The estimated parameter management unit 314 is notified, and the process proceeds to S1100. The area estimation parameter management unit 314 manages page line number data indicating the distance between character strings corresponding to the public number format pattern shown in FIG.

Ｓ１１００では、次回の探索用に、previous_found_positionにfound_positionの値を設定し、Ｓ１１１０に移る。
Ｓ１１１０では、領域推定パラメータ管理部３１４からページ行数データを取得し、それをnumlinesに設定し、Ｓ１１２０に移る。
Ｓ１１２０では、次回探索用の探索領域を指定し、Ｓ１０２０に移る。探索領域は start_position=found_position+numlines−areasize、end_position=found_position+numlines+areasizeで決定する。これは、前回、図５に記載の公開番号フォーマットパターンに対応する文字列が見つかった場所からnumlines離れた位置を中心にして前後areasizeの領域を、前記文字列が出現する可能性が高い出現領域と推定し、該出現領域を前記探索領域にすることを表している。 In S1100, the value of found_position is set in previous_found_position for the next search, and the process proceeds to S1110.
In S1110, the page line number data is acquired from the area estimation parameter management unit 314, set to numlines, and the process proceeds to S1120.
In S1120, a search area for the next search is designated, and the process proceeds to S1020. The search area is determined by start_position = found_position + numlines−areasize and end_position = found_position + numlines + areasize. This is because an area of the front and back areasize centering on a position away from numlines from the place where the character string corresponding to the public number format pattern described in FIG. 5 was found last time is an appearance area where the character string is likely to appear. This indicates that the appearance area is set as the search area.

一方、Ｓ１０３０で探索領域に文字列が見つからなかった場合、Ｓ１０４０ではend_positionが−１であるかどうかを確認し、特許公報ファイル１００全てを読み込んだか否かを判定する。end_positionが−１である場合には、特許公報ファイル１００全てを読み込んだと判断し、Ｓ１０５０に移り、属性抽出部３０６の処理を終了する。end_positionが−１でない場合には、探索領域の終端を特許公報ファイル１００の末端にし、Ｓ１０２０に移って処理を繰り返す。 On the other hand, if no character string is found in the search area in S1030, it is checked in S1040 whether end_position is −1, and it is determined whether the entire patent publication file 100 has been read. If end_position is −1, it is determined that the entire patent publication file 100 has been read, the process proceeds to S1050, and the process of the attribute extraction unit 306 is terminated. If end_position is not -1, the end of the search area is set to the end of the patent publication file 100, and the process proceeds to S1020 to repeat the process.

以上説明したように、文書管理装置は入力された特許公報ファイル１００の内部に記述されている公開番号等の属性情報を自動的に抽出することができるため、特許公報ファイル１００を文書管理装置に入力するだけで、特許公報ファイル１００と属性情報とを関連付けてデータベース１１２に登録し管理することが可能となる。したがって、ユーザーは属性情報を指定したり入力したりする必要がなくなり、登録が簡便になるとともに、属性情報の誤登録を防ぐことができる。 As described above, since the document management apparatus can automatically extract attribute information such as a publication number described in the inputted patent gazette file 100, the patent gazette file 100 is stored in the document management apparatus. It is possible to register and manage the patent gazette file 100 and the attribute information in the database 112 simply by inputting them. Therefore, the user does not need to specify or input attribute information, and registration is simplified and erroneous registration of attribute information can be prevented.

なお、属性情報の抽出方法に関しては、実施例１と実施例２、または実施例１と実施例３で説明した手法を組み合わせることもできる。例えば、特許公報ファイル１００に、属性情報が記述されている箇所を特定するための属性情報特定キーワードが記述されている場合には、前記属性情報特定キーワードに基づいて公開番号を抽出するが、特許公報ファイル１００に前記属性情報特定キーワードが記述されておらず、公開番号フォーマットパターン等の属性情報表現パターンが記述されている場合には、該属性情報表現パターンに対応する文字列の出現頻度に基づいて属性情報を抽出するようにしてもよい。
また、上述した実施例では、文書が特許公報ファイル１００である例について説明したが、文書は特許公報ファイル１００に限定されることはなく、例えば、契約書、申込書、カルテ、論文集等の、所定の属性情報が文書内に記述されており、かつ、所定の属性情報と関連付けて管理すべきあらゆる文書が考えられる。 In addition, regarding the method for extracting attribute information, the methods described in the first and second embodiments or the first and third embodiments can be combined. For example, when an attribute information specifying keyword for specifying a location where attribute information is described is described in the patent publication file 100, a publication number is extracted based on the attribute information specifying keyword. When the attribute information specific keyword is not described in the gazette file 100 and an attribute information expression pattern such as a public number format pattern is described, it is based on the appearance frequency of the character string corresponding to the attribute information expression pattern. Attribute information may be extracted.
In the above-described embodiment, an example in which the document is the patent publication file 100 has been described. However, the document is not limited to the patent publication file 100, and examples thereof include contracts, application forms, medical records, and collections of papers. Any document that has predetermined attribute information described in the document and should be managed in association with the predetermined attribute information is conceivable.

１００特許公報ファイル
１０２文書入力部
１０４入力文書種別判定部
１０６属性抽出部
１０８文書情報登録部
１１０データベーススキーマ
１１２データベース
２０６属性抽出部
３０６属性抽出部
３１４領域推定パラメータ管理部 100 Patent Gazette File 102 Document Input Unit 104 Input Document Type Determination Unit 106 Attribute Extraction Unit 108 Document Information Registration Unit 110 Database Schema 112 Database 206 Attribute Extraction Unit 306 Attribute Extraction Unit 314 Area Estimation Parameter Management Unit

Claims

In a document management apparatus comprising: a document input unit that inputs the document in which attribute information characterizing the document is described inside the document; and a database that manages the document in association with the attribute information.
An attribute information extraction unit that extracts the attribute information from the document input by the document input unit;
A document information registration unit that associates the attribute information extracted by the attribute information extraction unit with the input document and registers it in the database;
The attribute information extraction unit
Based on an attribute information specifying keyword that specifies a location in which the attribute information is described, which is a keyword described in the input document, or
Based on the appearance frequency of the character string corresponding to the attribute information expression pattern that is the expression pattern described in the input document and expresses the attribute information,
A document management apparatus that extracts the attribute information.

An area estimation parameter management unit for managing area estimation parameters for estimating an appearance area in which the character string appears in the document;
The attribute information extraction unit
When extracting the attribute information based on the appearance frequency of the character string,
Determining the appearance frequency of the character string by estimating the appearance region using the region estimation parameter managed by the region estimation parameter management unit and searching the character string from the estimated appearance region The document management apparatus according to claim 1.

In a document management method executed by a document management apparatus that associates a document with attribute information characterizing the document and manages it in a database,
A document input step for inputting the document in which the attribute information is described;
When an attribute information specifying keyword that specifies a location where the attribute information is described is described in the input document, the attribute information is extracted based on the attribute information specifying keyword,
If the attribute information specific keyword is not described in the input document and an attribute information expression pattern expressing the attribute information is described, characters corresponding to the attribute information expression pattern An attribute information extraction step of extracting the attribute information based on the appearance frequency of the column;
A document management method comprising: a document information registration step of registering the attribute information extracted from the attribute information extraction step and the input document in association with each other in the database.

An area estimation parameter setting step for setting an area estimation parameter for estimating an appearance area in which the character string appears in the input document;
In the attribute information extraction step, when the attribute information is extracted based on the appearance frequency of the character string, the appearance region is estimated using the region estimation parameter set in the region estimation parameter setting step, The document management method according to claim 3, wherein the appearance frequency of the character string is determined by searching the character string from the estimated appearance area.

The program for making a computer perform the method described in Claim 3 or 4.