JP2004094648A

JP2004094648A - Necessary file prediction method and system, and required file prediction program

Info

Publication number: JP2004094648A
Application number: JP2002255563A
Authority: JP
Inventors: Atsushi Futakata; 二方　厚志
Original assignee: Central Research Institute of Electric Power Industry
Current assignee: Central Research Institute of Electric Power Industry
Priority date: 2002-08-30
Filing date: 2002-08-30
Publication date: 2004-03-25

Abstract

【課題】ユーザが必要としているファイルを予測する。
【解決手段】システム１は、ファイル利用履歴管理手段２と、ファイル利用履歴に現れるファイルについて属性値が等しいファイル群を１つのファイルアイテムとして纏めてファイル利用履歴をファイルアイテム化するファイルアイテム化手段３と、属性値の組み合わせである複数のコンテキストを生成するコンテキスト生成手段４と、コンテキストのそれぞれと対応させて、ファイルアイテム間に成立する相関ルールを抽出するルール抽出手段５と、対象利用者のファイル利用状況に基づいて、ファイルアイテムの群およびコンテキストの群の中から、１または複数の予測用ファイルアイテムと１または複数の予測用コンテキストを選択する作業状況推定手段６と、相関ルールを適用して予測用ファイルアイテムと相関する予測ファイルアイテムを特定し、予測ファイルアイテムに属するファイルに対象利用者がアクセスできるようにする情報提供手段７とを有する。
【選択図】　　　　図１An object of the present invention is to predict a file required by a user.
A system includes: a file usage history management unit; and a file itemization unit configured to collect a group of files having the same attribute value for a file appearing in the file usage history as one file item and to convert the file usage history into a file item. A context generating means 4 for generating a plurality of contexts which are combinations of attribute values; a rule extracting means 5 for extracting a correlation rule established between file items in association with each of the contexts; Applying a work situation estimating means 6 for selecting one or a plurality of prediction file items and one or a plurality of prediction contexts from a group of file items and a group of contexts based on the usage status, and applying an association rule Forecast file items that correlate with the forecast file items Identify Temu, and an information providing unit 7 to make the target user can access files belonging to the prediction file items.
[Selection diagram] Fig. 1

Description

【０００１】
【発明の属する技術分野】
本発明は、対象利用者が必要としているファイルを予測するための方法およびシステム並びにプログラムに関する。さらに詳述すると、本発明は、相関ルールを抽出するデータマイニングの技術を、必要ファイルの予測に応用した方法およびシステム並びにプログラムに関する。
【０００２】
【従来の技術】
パーソナルコンピュータやインターネットの普及と、企業組織内での通信ネットワーク技術を利用した情報システムの普及により、書類や文書・図表等の情報の電子化が急速に進展してきた。そのため、様々な情報が組織内外のコンピュータ上に存在し、作業を行なう上で必要となる情報をネットワーク経由で入手できる場合が多い。
【０００３】
情報の利用者（以下、本明細書では単にユーザとも呼ぶ）の情報獲得を支援する従来技術として、インターネット上の検索エンジンに代表される技術がある。この検索エンジンでは、ユーザが必要としている情報の特性をキーワードとして与えることにより、該当するウェブページへのリンクを提供する。また、当該提供の際に、重複しているページを省いたり、ユーザが必要としている可能性が高い順にランク付けを行うなどの工夫がなされている。
【０００４】
また、情報の供給者である情報システムやウェブサイトの管理者が、ユーザの意図を把握し、ユーザの利便性を高めるようにその仕組みを改善していくための技術として、Ｗｅｂ　Ｕｓａｇｅ　Ｍｉｎｉｎｇを挙げることができる（参考文献；Ｊ．　Ｓｒｉｖａｓｔａｖａ，　Ｒ．　Ｃｏｏｌｅｙ，　Ｍ．　Ｄｅｓｈｐａｈｄｅ，　ａｎｄ　Ｐ．−Ｎ．　Ｔａｎ：　Ｗｅｂ　Ｕｓａｇｅ　Ｍｉｎｉｎｇ：　Ｄｉｓｃｏｖｅｒｙ　ａｎｄ　Ａｐｐｌｉｃａｔｉｏｎｓ　ｏｆ　Ｕｓａｇｅ　Ｐａｔｔｅｒｎｓ　ｆｒｏｍ　Ｗｅｂ　Ｄａｔａ，　ＳＩＧＫＤＤ　Ｅｘｐｌｏｒａｔｉｏｎｓ，　Ｖｏｌ．１，　Ｎｏ．２，　２０００．等）。Ｗｅｂ　Ｕｓａｇｅ　Ｍｉｎｉｎｇとは、ウェブサーバに記録されたサイト内のページへのアクセスログから、「ページＡを見たユーザは、ページＢも見る傾向が強い」といった相関ルールを発見するデータマイニングの一種である。Ｗｅｂ　Ｕｓａｇｅ　Ｍｉｎｉｎｇの結果は、ページ内容やページ間のリンク構成等のサイト構成の改善に用いられる。例えば、「ページＡを見たユーザが、ページＢも見る」といったパターンが頻出する場合には、ページＡとページＢに記載されているそれぞれの情報を同時に必要としているユーザが多いと推測できる。そこで、ページＡからページＢに直接リンクを辿れない場合には、適切なガイドとともに、ＡＢ間にリンクを張ることで、ユーザが必要とする情報を入手し易くできる。
【０００５】
また、ユーザの好みに合わせて自律的に情報を収集するシステム（参考文献；Ｏ．　Ｅｔｚｉｏｎｉ：　Ｍｏｖｉｎｇ　Ｕｐ　ｔｈｅ　Ｉｎｆｏｒｍａｔｉｏｎ　Ｆｏｏｄ　Ｃｈａｉｎ：　Ｄｅｐｌｏｙｉｎｇ　Ｓｏｆｔｂｏｔｓ　ｏｎ　ｔｈｅ　Ｗｏｒｌｄ　Ｗｉｄｅ　Ｗｅｂ，　Ｉｎ　Ｐｒｏｃ．　ｏｆ　１２ｔｈ　Ｎａｔｉｏｎａｌ　Ｃｏｎｆｅｒｅｎｃｅ　ｏｆ　Ａｒｔｉｆｉｃｉａｌ　Ｉｎｔｅｌｌｉｇｅｎｃｅ，　１９９６．）や、協調フィルタリングをベースとしてユーザに情報を提供するシステム（参考文献；Ｎ．　Ｇｏｏｄ，　Ｊ．　Ｂ．　Ｓｃｈａｆｅｒ，　ｅｔ　ａｌ．：　Ｃｏｍｂｉｎｉｎｇ　Ｃｏｌｌａｂｏｒａｔｉｖｅ　Ｆｉｌｔｅｒｉｎｇ　ｗｉｔｈ　Ｐｅｒｓｏｎａｌ　Ａｇｅｎｔｓ　ｆｏｒ　Ｂｅｔｔｅｒ　Ｒｅｃｏｍｍｅｎｄａｔｉｏｎｓ，　ＩｎＰｒｏｃ．　ｏｆ　ＡＡＡＩ−９９，　１９９９．）が従来提案されている。
【０００６】
【発明が解決しようとする課題】
しかしながら、作業に有効な情報が組織内外のコンピュータ上に存在していたとしても、普段利用しない情報がどこに存在するのかを意識しているユーザは少ない。したがって、重要だが稀な作業（半期毎の業務報告作成など）や、新規の作業（異動に伴なう新しい作業や新プロジェクトの資料作成など）を行なう場合に、ユーザはどこから必要な情報を入手すれば良いのか分からない場合が多い。
【０００７】
また、検索エンジンを利用した情報検索では、適切なキーワードの選択が困難であり、また、関係しそうな文書や図表が大量に提示されるためにユーザが情報の取捨選択に迷う場合が多い。
【０００８】
また、ユーザの好みに合わせて自律的に情報を収集するシステムや協調フィルタリングをベースとしたシステムでは、作業状況を考慮していないために、的外れな推薦を行なう場合も多い。一般に、作業状況を限定すれば、その作業状況で利用される情報の種類や情報の利用パターンは、ある程度特定することができると考えられる。しかしながら、生じ得る作業状況を全て網羅し、各作業状況で必要となる情報を予め定めておくことは不可能に近い。
【０００９】
ここで、ユーザが必要とするファイルを予測するために、Ｗｅｂ　Ｕｓａｇｅ　Ｍｉｎｉｎｇを応用して、ファイルの利用履歴を分析し、「ファイルＡを利用したユーザは、ファイルＢも利用する傾向が強い」といった相関ルールを発見するデータマイニングを行うことが考えられる。ところが、この場合、以下に述べる問題点がある。
【００１０】
第１に、日常的業務においてファイルの再利用回数はさほど多くない。同一の作業を行なう場合でも、内容が類似した異なるファイルが利用されることもある。したがって、単純に個々のファイルをデータマイニングの対象として扱うのでは、一般性を持つ相関ルールが抽出されない可能性が高い。
【００１１】
第２に、新しい文書や図表等のファイルが日常的に新規作成されており、そのような新規ファイルを含む相関ルールを抽出するには、当該新規ファイルが再利用されるのを待つ必要がある。
【００１２】
第３に、作業の並列性に起因して、予測精度を悪化させる有害な相関ルールが抽出される虞がある。即ち、日常業務では、幾つかの作業が並列で行なわれている場合が多い。例えば、打ち合わせの資料を作成中に、来年度から始めるプロジェクトの計画を立案するという状況を考えてみる。打ち合わせ資料作成作業Ｐ_１を行なっているときに、Ａ，Ｂ，Ｃ，Ｄというファイルが利用され、新規プロジェクトの計画立案作業Ｐ_２を行なっているときに、Ｘ，Ｙというファイルが利用されたとする。このとき、Ａ→Ｂ→Ｘ→Ｙ→Ｃ→Ｄという時間順序で各ファイルが利用されたとする。そうすると、Ａ→Ｂ→Ｘ→Ｙ→Ｃ→Ｄという順序で、ファイル利用履歴が記録され、そこから抽出される予測ルール集合には、｛Ａ，Ｂ，Ｃ，Ｄ｝と｛Ｘ，Ｙ｝との間のルール（例えば、Ａ→Ｘ）が含まれることになる。このようなルールは、現実の作業が前例と同様に、Ｐ_１とＰ_２が並列に進行しているときには有用である。しかし、Ｐ_１のみを行なっている場合には、Ｘを予測する必要は無く、予測精度を悪化させるという意味で有害なルールと化す。
【００１３】
第４に、重要だが稀にしか起きない作業（例えば、四半期毎の報告書、月末の作業進捗報告書、来年度の計画・予算等の作成作業など）についての相関ルールは、ファイルの利用頻度が極小さいために抽出されない可能性が高い。たとえ、当該作業が定期的に必ず起きる重要度の高い作業であっても、である。
【００１４】
そこで本発明は、ユーザが必要としているファイルを適切に予測するための方法およびシステム並びにプログラムを提供することを目的とする。
【００１５】
【課題を解決するための手段】
かかる目的を達成するため、請求項１記載の必要ファイルの予測方法は、ファイルの特徴を表す複数の属性および属性の各々がとり得る属性値を予め定義し、過去に記録されたファイル利用履歴に現れるファイルについて、属性値が等しいファイル群を１つのファイルアイテムとして纏めてファイル利用履歴をファイルアイテム化し、属性値の組み合わせである複数のコンテキストを生成し、ファイルアイテム化されたファイル利用履歴に基づいて、尚且つコンテキストのそれぞれと対応させて、ファイルアイテム間に成立する相関ルールを抽出し、対象利用者のファイル利用状況に基づいて、ファイルアイテムの群およびコンテキストの群の中から、１または複数の予測用ファイルアイテムと１または複数の予測用コンテキストを選択し、予測用コンテキストに対応する相関ルールを適用して、予測用ファイルアイテムと相関する予測ファイルアイテムを特定し、予測ファイルアイテムに属するファイルを対象利用者の必要としているファイルとして予測するようにしている。
【００１６】
また、請求項７記載の必要ファイルの予測システムは、ファイル利用履歴を記録するファイル利用履歴管理手段と、ファイルの特徴を表す複数の属性および属性の各々がとり得る属性値に基づいて、ファイル利用履歴に現れるファイルについて、属性値が等しいファイル群を１つのファイルアイテムとして纏めてファイル利用履歴をファイルアイテム化するファイルアイテム化手段と、属性値の組み合わせである複数のコンテキストを生成するコンテキスト生成手段と、ファイルアイテム化されたファイル利用履歴に基づいて、尚且つコンテキストのそれぞれと対応させて、ファイルアイテム間に成立する相関ルールを抽出するルール抽出手段と、ファイル利用履歴管理手段が記録した対象利用者のファイル利用状況に基づいて、ファイルアイテムの群およびコンテキストの群の中から、１または複数の予測用ファイルアイテムと１または複数の予測用コンテキストを選択する作業状況推定手段と、予測用コンテキストに対応する相関ルールを適用して、予測用ファイルアイテムと相関する予測ファイルアイテムを特定し、予測ファイルアイテムに属するファイルに対象利用者がアクセスできるようにする情報提供手段とを有するようにしている。
【００１７】
また、請求項８記載の必要ファイルの予測用プログラムは、ファイル利用履歴を記録するファイル利用履歴管理手段と、ファイルの特徴を表す複数の属性および属性の各々がとり得る属性値に基づいて、ファイル利用履歴に現れるファイルについて、属性値が等しいファイル群を１つのファイルアイテムとして纏めてファイル利用履歴をファイルアイテム化するファイルアイテム化手段と、属性値の組み合わせである複数のコンテキストを生成するコンテキスト生成手段と、ファイルアイテム化されたファイル利用履歴に基づいて、尚且つコンテキストのそれぞれと対応させて、ファイルアイテム間に成立する相関ルールを抽出するルール抽出手段と、ファイル利用履歴管理手段が記録した対象利用者のファイル利用状況に基づいて、ファイルアイテムの群およびコンテキストの群の中から、１または複数の予測用ファイルアイテムと１または複数の予測用コンテキストを選択する作業状況推定手段と、予測用コンテキストに対応する相関ルールを適用して、予測用ファイルアイテムと相関する予測ファイルアイテムを特定し、予測ファイルアイテムに属するファイルに対象利用者がアクセスできるようにする情報提供手段として、コンピュータを機能させるようにしている。
【００１８】
したがって、共通の性質を持ったファイル群が、１つのファイルアイテムに統合される。ファイルアイテムはコンテキストによって分類され、コンテキスト毎にファイルアイテム間に成立する相関ルールが抽出される。そして、ユーザが行なっている作業状況に基づいて、当該作業状況の近似となるコンテキストが選択され、当該作業で必要とされるファイルアイテムが相関ルールにより導かれる。当該ファイルアイテムに合致するファイル、即ちユーザが必要としていると推定されるファイルに、ユーザは容易にアクセス可能となる。
【００１９】
また、請求項２記載の発明は、請求項１記載の必要ファイルの予測方法において、相関ルールは、条件部に属するファイルアイテムが、対応するコンテキストに属するように、抽出されるものとしている。したがって、コンテキストで近似される作業で利用されているファイルアイテムが与えられたときに、次に必要とされるファイルアイテムを予測するルールを抽出できる。
【００２０】
また、請求項３記載の発明は、請求項１記載の必要ファイルの予測方法において、相関ルールは、条件部に属するファイルアイテムおよび結論部に属するファイルアイテムの双方が、対応するコンテキストに属するように、抽出されるものとしている。この場合、請求項２記載の発明よりも、コンテキストが近似する作業状況に厳密に当て嵌まるが予測可能な範囲が狭いルール集合が抽出される。
【００２１】
また、請求項４記載の発明は、請求項１から３のいずれかに記載の必要ファイルの予測方法において、ファイル利用状況に現れるファイルの属性値に基づいて１または複数の予測用ファイルアイテムを特定し、予測用ファイルアイテムが属し尚且つ他に属するファイルアイテムの個数が最も少なくなるコンテキストを１つ選択するようにしている。したがって、対象利用者が最近利用したファイル以外の情報をできるだけ仮定しなくて済み、対象利用者にとって不必要な情報の提供を低減することができる。これは、単一コンテキストを選択するため、対象利用者が１つの作業を行っている場合に相当する。
【００２２】
また、請求項５記載の発明は、請求項１から３のいずれかに記載の必要ファイルの予測方法において、ファイル利用状況に現れるファイルの属性値に基づいて１または複数の予測用ファイルアイテムを特定し、予測用ファイルアイテムが属し尚且つ他に属するファイルアイテムの個数が最も少なくなるコンテキストの組み合わせを選択するようにしている。したがって、対象利用者が最近利用したファイル以外の情報をできるだけ仮定しなくて済み、対象利用者にとって不必要な情報の提供を低減することができる。これは、複数のコンテキストを選択するため、対象利用者は複数の作業を並列に行っている場合に相当する。
【００２３】
また、請求項６記載の発明は、請求項１から５のいずれかに記載の必要ファイルの予測方法において、予測ファイルアイテムについて、適用した相関ルールの成立する確率が高い順にランク付けして、上位から予め定めた個数分の予測ファイルアイテムを選択するようにしている。したがって、対象利用者にとって不必要な情報の提供を低減することができ、また、選択する予測ファイルアイテムの上位ｎ個の範囲を調整することで、対象利用者の作業状況に見合った情報の質と量を制御することが可能となる。
【００２４】
【発明の実施の形態】
以下、本発明の構成を図面に示す実施形態に基づいて詳細に説明する。
【００２５】
図１から図４に本発明の必要ファイルの予測方法の実施の一形態を示す。この必要ファイルの予測方法は、ファイルの特徴を表す複数の属性および属性の各々がとり得る属性値を予め定義し、過去に記録されたファイル利用履歴に現れるファイルについて、属性値が等しいファイル群を１つのファイルアイテムとして纏めてファイル利用履歴をファイルアイテム化し、属性値の組み合わせである複数のコンテキストを生成し、ファイルアイテム化されたファイル利用履歴に基づいて、尚且つコンテキストのそれぞれと対応させて、ファイルアイテム間に成立する相関ルールを抽出し、対象利用者のファイル利用状況に基づいて、ファイルアイテムの群およびコンテキストの群の中から、１または複数の予測用ファイルアイテムと１または複数の予測用コンテキストを選択し、予測用コンテキストに対応する相関ルールを適用して、予測用ファイルアイテムと相関する予測ファイルアイテムを特定し、予測ファイルアイテムに属するファイルを対象利用者の必要としているファイルとして予測するようにしている。
【００２６】
本実施形態では、上記方法を、例えば、中央処理演算装置（ＣＰＵ）、主記憶装置、外部記憶装置、ディスプレイ等の出力装置、キーボード等の入力装置、外部の情報処理装置との通信インターフェース等を備える周知のコンピュータを１台若しくは複数台用いて、必要ファイルの予測システムとして、装置化している。この必要ファイルの予測システム１は、図１に示すように、ファイル利用履歴を記録するファイル利用履歴管理手段２と、ファイルの特徴を表す複数の属性および属性の各々がとり得る属性値に基づいて、ファイル利用履歴に現れるファイルについて、属性値が等しいファイル群を１つのファイルアイテムとして纏めてファイル利用履歴をファイルアイテム化するファイルアイテム化手段３と、属性値の組み合わせである複数のコンテキストを生成するコンテキスト生成手段４と、ファイルアイテム化されたファイル利用履歴に基づいて、尚且つコンテキスト生成手段４で生成されたコンテキストのそれぞれと対応させて、ファイルアイテム間に成立する相関ルールを抽出するルール抽出手段５と、ファイル利用履歴管理手段２が記録した対象利用者のファイル利用状況に基づいて、ファイルアイテムの群およびコンテキストの群の中から、１または複数の予測用ファイルアイテムと１または複数の予測用コンテキストを選択する作業状況推定手段６と、予測用コンテキストに対応する相関ルールを適用して、予測用ファイルアイテムと相関する予測ファイルアイテムを特定し、予測ファイルアイテムに属するファイルに対象利用者がアクセスできるようにする情報提供手段７とを有するようにしている。また、本発明の必要ファイルの予測用プログラムは、コンピュータに読み込まれ実行されることによって、当該コンピュータを上記の必要ファイルの予測システム１として機能させるものである。図１中の符号８はユーザが有する情報処理端末（例えばパーソナルコンピュータ）を示す。尚、図１中では、利用者用端末８は１台であるが、本システム１は複数のユーザが利用可能であり、利用者用端末８が複数台あっても良いのは勿論である。
【００２７】
ここで、本実施形態におけるファイルとは、電子データとしてのファイルをいう。尚、ファイルの種類は特に限定されるものではなく、例えば、文書、表、画像、プログラム、ＨＴＭＬ等、コンピュータで利用可能なあらゆるタイプのファイルが対象となり得る。本実施形態におけるファイルは、コンピュータが管理可能な電子データであるため、ファイルへのアクセスログを自動記録することが可能である。
【００２８】
例えば本実施形態では、コンピュータをファイル利用履歴管理手段２として機能させるために、Ｗｉｎｄｏｗｓ２０００（マイクロソフト社の商標）用のリソースキットに含まれるプログラムｆｉｌｅ　ｓｐｙを用いている。ｆｉｌｅ　ｓｐｙは、ＩＦＳ（Ｉｎｓｔａｌｌａｂｌｅ　Ｆｉｌｅ　Ｓｙｓｔｅｍ）フィルタドライバの一種であり、ローカルおよびネットワークドライブを監視し、入出力要求パケットを記録することができるプログラムである。ｆｉｌｅ　ｓｐｙは、通常のアプリケーション起動の際に発生するディスクＩＯや、ＯＳが随時行なうディスクＩＯも記録するため、出力のサイズは非常に大きくなってしまう。そこで、本実施形態では、ユーザが利用するファイルが存在するディレクトリを指定し、そのディレクトリの下にあるファイルアクセスのみをフィルタリングして記録するようにしている。但し、ファイル利用履歴管理手段２が、本実施形態の例に限定されるものではない。ファイル利用履歴には、ネットワークで接続される多数台の利用者用端末８によるファイルアクセスが記録され得る。当該ファイル利用履歴から、特定のユーザによるファイルアクセスのみを抽出するには、例えばＩＰアドレスやユーザＩＤ等といったユーザまたは利用者用端末８を特定するための情報を利用できる。
【００２９】
ファイルの特徴（換言すれば、ファイルに関連する作業の特徴）を表す属性として、例えば本実施形態では、「時期」「タイプ」「作業」という３つの属性を定義している。但し、属性の数が３つに限られる訳ではなく、また、属性の内容も本実施形態の例に限定されず、ユーザの作業の形態等に合わせて、任意に設定可能である。本実施形態における属性は、属性値が入るいわば入れ物であり、全てのファイルにおいて、属性は共通であるが、その属性に入る属性値は各ファイルで異なり得る。
【００３０】
本実施形態における各属性は、以下のような意味を持つ。先ず、属性「時期」とは、そのファイルを主に利用する時期を表わす。属性「時期」に入る属性値の具体例としては、例えば、「Ｈ１３」や「上半期」などが挙げられる。また、属性「タイプ」とは、作業を行なう上で、そのファイルが果たす役割や用途、ファイルの種類等を表わす。属性「タイプ」に入る属性値の具体例としては、例えば、「計画」、「成果」、「メモ」、「資料」、「プログラム」、「報告書」などが挙げられる。また、属性「作業」とは、ユーザが行なう作業を特定できるような単語を表わす。属性「作業」に入る属性値の具体例としては、例えば、「研究テーマ名」、「プロジェクト名」、「開発中の製品のコード」などが挙げられる。
【００３１】
ファイルアイテム化手段３では、ファイル利用履歴に現われるファイルについて、属性の値が同じファイル群を同一視し、当該同一視されたファイル群をファイルアイテムとして認識する。これにより、共通の性質を持ったファイル群が、１つのファイルアイテムに統合される。必要な情報の予測はファイルアイテムに対して行なわれ、予測ファイルアイテムに合致するファイルがユーザに提供されることとなる。本実施形態における各ファイルアイテムは、属性値の組で表現される。例えば、「時期」＝「平成年１４度」，「タイプ」＝「計画」，「作業」＝「案件１」であるファイル群は、（Ｈ１４，計画，案件１）と表現されるファイルアイテムに統合される。
【００３２】
ここで、ファイルの属性値を決定する方法としては、例えば、▲１▼予め決めておいた属性値と対応するキーワードをファイルの名前やファイル内に出現する単語から見つけ出す方法、▲２▼意味によって文書をクラスタリングして、各クラスタに対応する単語を属性値にする方法、▲３▼ファイルを分類しておいて、決定木による属性値の決定を行なう方法、等が挙げられる。尚、ユーザやシステム管理者等がファイルの属性値を決定して、ファイルアイテム化手段３としてのコンピュータに入力するようにしても良い。
【００３３】
例えば本実施形態では、上記▲１▼の方法を採用し、ファイルアイテム化手段３によって、ファイル利用履歴に現われるファイルの属性値を決定するようにしている。本実施形態における属性値決定処理の一例を、図２に示すフローチャートを用いて説明する。先ず、特定の拡張子を持つファイルの「タイプ」の属性値を決定する（ステップ１０１）。例えば、「．ｃ」，「．ｂａｔ」という拡張子を持つファイルは、「タイプ」＝「プログラム」とし、「．ｄａｔ」という拡張子を持つファイルは、「タイプ」＝「データ」とし、「．ｈｔｍｌ」という拡張子を持つファイルは、「タイプ」＝「ＨＴＭＬ」とする。次に、予め決めておいたキーワードをファイル名から検索し、「時期」「タイプ」「作業」の属性値を決定する（ステップ１０２）。例えば、”平成１３年度研究計画．ｄｏｃ”というファイルでは、「時期」＝「Ｈ１３」、「タイプ」＝「計画」とする。尚、ステップ１０１で、既に「タイプ」の属性値が決定している場合は、当該属性値は変更しないようにする。次に、ファイルをテキスト化し、ファイルの先頭から１０行目までについて、予め用意したキーワードとのマッチングを行ない、「時期」「タイプ」「作業」の属性値を決定する（ステップ１０３）。ここで、先頭から１０行目までとした理由は、通常は表題や日付け、目的などがファイルの先頭にあるためである。尚、ステップ１０２までで、既に「時期」「タイプ」「作業」の属性値が決定している場合は、当該属性値は変更しないようにする。ここで、ステップ１０２，ステップ１０３においては、テーブル等を用いて、同義語や類義語を統一することが好ましい。例えば、「ＥＭＳ」、「環境マネジメントシステム」、「環境」という単語が含まれていれば、「作業」＝「ＥＭＳ」とし、「プラン」「計画」という単語が含まれていれば、「タイプ」＝「計画」とする。以上までの処理で属性「時期」が決定していない場合には、タイムスタンプから「時期」を決定する（ステップ１０４）。また、属性「タイプ」が決定していない場合には、「タイプ」＝「メモ」とする（ステップ１０５）。また、属性「作業」が決定していない場合には、「作業」＝「その他」とする（ステップ１０６）。
【００３４】
コンテキストとは、ファイルアイテムを分類するための属性値の組み合わせである。コンテキストは、属性値の組み合わせによって、ファイルアイテムを分類する役割を果たし、また、マイニングの対象となるトランザクション群を選択する役割を果たす。コンテキストの導入によって、作業状況によって異なるファイル利用パターンを適格に把握することが可能となる。例えば本実施形態では、各ファイルアイテムが有する属性値と、少なくとも１つの空集合φとを組み合わせることにより、コンテキストを表現する。例えば、「時期」＝「Ｈ１４」を指定するコンテキストであれば、次式の様に表現する。
【００３５】
【数１】
Ｘ_１＝＜｛Ｈ１４｝，φ，φ＞
【００３６】
また、例えば、「作業」＝「プロジェクト１」を指定するコンテキストであれば、次式の様に表現する。
【００３７】
【数２】
Ｘ_２＝＜φ，φ，｛プロジェクト１｝＞
【００３８】
さらに、「時期」＝「Ｈ１４」で尚且つ「作業」＝「プロジェクト１」というように、２つの属性値を同時に指定するコンテキストであれば、次式の様に表現する。
【００３９】
【数３】
Ｘ_３＝＜｛Ｈ１４｝，φ，｛プロジェクト１｝＞
【００４０】
２つの属性値を同時に指定するコンテキストＸ_３と、属性値を１つしか指定しないコンテキストＸ_１，Ｘ_２を比べると、コンテキストＸ_１，Ｘ_２が分類するファイルアイテムの集合は、コンテキストＸ_３が分類するファイルアイテムの集合を必ず含む。この包含関係により、コンテキスト間に上位および下位という関係が成立する。例えば、上記の例では、コンテキストＸ_１，Ｘ_２はコンテキストＸ_３の上位コンテキストとなっている。上位コンテキストであるほど様々な作業状況に当て嵌まり、下位コンテキストであるほど特定の作業状況を厳密に表わすということができる。
【００４１】
コンテキストは、形式的に次のように定義できる。即ち、ファイルアイテムの集合Ｄ＝｛ｄ_１，…，ｄ_ｎ｝が存在し、各ファイルアイテムｄ_ｉ（ｉ＝１〜ｎ）が次式で表現されているとする。但し、ａ_ｉは「時期」の属性値、ｂ_ｉは「タイプ」の属性値、ｃ_ｉは「作業」の属性値を示す。
【００４２】
【数４】
ｄ_ｉ＝（ａ_ｉ，ｂ_ｉ，ｃ_ｉ）
【００４３】
このとき、ファイルアイテムの集合Ｄに対する最小コンテキストＸを次式で定義する。
【００４４】
【数５】
Ｘ＝＜∩_ｉ｛ａ_ｉ｝，∩_ｉ｛ｂ_ｉ｝，∩_ｉ｛ｃ_ｉ｝＞
【００４５】
コンテキストで指定される属性値は、Ｄに属する各ファイルアイテムが共通で持つ値であり、φは共通の属性値を持たないことを表わしている。また、本実施形態において「ファイルアイテムｄ＝（ａ，ｂ，ｃ）がコンテキストＸ＝＜Ａ，Ｂ，Ｃ＞に適合する」とは、次式が成立することをいう。
【００４６】
【数６】
Ｘ＝＜Ａ∩｛ａ｝，Ｂ∩｛ｂ｝，Ｃ∩｛ｃ｝＞
【００４７】
即ち、Ａ，Ｂ，Ｃのそれぞれが、φであるか、｛ａ｝，｛ｂ｝，｛ｃ｝と等しいということになる。また、本実施形態では、ファイルアイテムの集合Ｄの各要素がコンテキストＸに適合する場合、ファイルアイテムの集合ＤはコンテキストＸに適合するという。
【００４８】
以上のように定義されるコンテキスト間には、コンテキストに適合するファイルアイテムの集合の包含関係による階層性が存在する。本実施形態では、コンテキストＸに適合する全てのファイルアイテムの集合をＩ（Ｘ）と表記する。Ｉ（Ｘ）は、コンテキストＸで指定された属性値を持つ全てのファイルアイテムの集合となる。このとき、コンテキストＸ_１がコンテキストＸ_２の上位コンテキストであるとは、次式が成立することをいう。
【００４９】
【数７】
Ｉ（Ｘ_１）⊃Ｉ（Ｘ_２）
【００５０】
また、コンテキストＸ_１がコンテキストＸ_２の下位コンテキストであるとは、次式が成立することをいう。
【００５１】
【数８】
Ｉ（Ｘ_１）⊂Ｉ（Ｘ_２）
【００５２】
例えば本実施形態のコンテキスト生成手段４は、「時期」のみの属性値を指定したコンテキスト（「時期」の属性値の種類の数だけ存在する）、「タイプ」のみの属性値を指定したコンテキスト（「タイプ」の属性値の種類の数だけ存在する）、「作業」のみの属性値を指定したコンテキスト（「作業」の属性値の種類の数だけ存在する）、「時期」と「タイプ」の２つの属性値を指定したコンテキスト（「時期」と「タイプ」の属性値の組み合わせの数だけ存在する）、「時期」と「作業」の２つの属性値を指定したコンテキスト（「時期」と「作業」の属性値の組み合わせの数だけ存在する）、「タイプ」と「作業」の２つの属性値を指定したコンテキスト（「タイプ」と「作業」の属性値の組み合わせの数だけ存在する）、属性値が全て空集合であるコンテキストを、生成するようにしている。但し、例えば本実施形態のコンテキスト生成手段４では、適合するファイルアイテムが無いコンテキストは、省くようにしている。コンテキストに適合するファイルアイテムが無い場合には、ルール抽出手段５によるマイニングの対象となるトランザクション数が０になり、ルール抽出が不可能になるからである。また、例えば本実施形態のコンテキスト生成手段４では、複数のコンテキストが存在するとき、それぞれのコンテキストに適合するファイルアイテム集合が完全に一致する場合には、ルール抽出手段５によりマイニングされるルールが同一になるため、当該複数のコンテキストの中から１つだけを選択するようにしている。
【００５３】
ここで、本実施形態における相関ルール（Ａｓｓｏｃｉａｔｉｏｎ　Ｒｕｌｅ）とは、「Ａが起きたときにＢが起きた」というパターンを、ＡとＢの同時共起頻度に基づいて、Ａ→Ｂという形にルール化したものである。相関ルールは、厳密には以下のように定義される。即ち、ファイルアイテムａ_ｉと全ファイルアイテムの集合Ｉがあったとき、トランザクションｔ_ｊ⊂Ｉを次式のように定義する。
【００５４】
【数９】
ｔ_ｊ＝｛ａ_ｊ１，…，ａ_ｊｎ｝
【００５５】
全てのトランザクションの集合をＴとしたとき、複数のファイルアイテムａ_１，…，ａ_ｎが、トランザクションｔ_１，…，ｔ_ｍ∈Ｔに同時に含まれている場合、ｍが十分大きければ、それらのファイルアイテムａ_１，…，ａ_ｎが同時に利用される傾向が強いということができる。この傾向を表わすのが相関ルールＲであり、次式のように表現される。
【００５６】
【数１０】
Ｒ：ａ_１，…，ａ_ｎ−１　→　ａ_ｎ
【００５７】
Ｒは、ａ_１，…，ａ_ｎ−１があるトランザクションに含まれるとき、ａ_ｎも同じトランザクションに含まれるというルールであり、ａ_１，…，ａ_ｎ−１が利用されていれば、ａ_ｎも利用される、ということを意味している。
【００５８】
ここで、ファイルアイテム数やトランザクション数が多い場合、膨大な数の相関ルールが抽出され得る。高々１回しか起きない組み合わせでも相関ルールになり得るからである。相関ルールの利用を効率良く行なうためには、なんらかの尺度を用いて無用なルールを除去し、ルール数の抑制を図る必要がある。一般に良く使われる尺度として、支持度と確信度がある。
【００５９】
支持度は、任意のトランザクションを１つ選んだときに、相関ルールが成立する確率である。ファイルアイテムの集合Ａ（⊂Ｉ）があったときに、Ａに属する全てのファイルアイテムを含むトランザクションの数をＮ（Ａ）とする。また、全トランザクション数をＮとする。ｂがファイルアイテム、相関ルールＲ：Ａ→ｂであるとき、Ｒの支持度ｓｕｐ（Ｒ）は、次式で定義される。
【００６０】
【数１１】
ｓｕｐ（Ｒ）＝Ｎ（Ａ∪｛ｂ｝）／Ｎ
【００６１】
一方、確信度は、ルールＲが適用可能なときに、ルールＲが成立する確率となっている。即ち、ファイルアイテムの集合Ａ内の全てのファイルアイテムが利用されたときに、ｂが利用される条件付き確率を表わしている。ルールＲの確信度ｃｏｎｆ（Ｒ）は、次式で定義される。
【００６２】
【数１２】
ｃｏｎｆ（Ｒ）＝Ｎ（Ａ∪｛ｂ｝）／Ｎ（Ａ）
【００６３】
支持度と確信度により相関ルールを評価することで、利用可能性が高いルールを選び出すことが可能になる。相関ルールが最低限満たすべき支持度と確信度が与えられたとき、相関ルールを高速に抽出する代表的なアルゴリズムとして、Ａｐｒｉｏｒｉがある（参考文献；Ｒ．　Ａｇｒａｗａｌ，　ａｎｄ　Ｒ．　Ｓｒｉｋａｎｔ：　Ｆａｓｔ　Ａｌｇｏｒｉｔｈｍｓ　ｆｏｒ　Ｍｉｎｉｎｇ　Ａｓｓｏｃｉａｔｉｏｎ　Ｒｕｌｅｓ，　Ｉｎ　Ｐｒｏｃ．　ｏｆ　２０ｔｈ　Ｉｎｔｅｒｎａｔｉｏｎａｌ　Ｃｏｎｆｅｒｅｎｃｅ　ｏｎ　Ｖｅｒｙ　Ｌａｒｇｅ　Ｄａｔａｂａｓｅｓ，　１９９４．）。Ａｐｒｉｏｒｉは、相関ルールになる可能性のあるファイルアイテム集合をルール候補として数え上げ、各ルール候補の支持度を計算して最低支持度を満たす候補だけをルールとして残す。このとき、支持度ｓｕｐ（Ａ∪Ｂ）≧ｓｕｐ（Ａ∪Ｂ∪Ｃ）という性質を利用して、ルール候補の絞り込みを行なう。この性質は、長さｋ＋１のルールは長さｋのルールを含むということを意味している。そこで、最低支持度を満たす長さｋのルールから、長さｋ＋１のルール候補を生成し、各候補の最低支持度の判定をすることで、効率的に相関ルールを抽出することが可能になる。Ａｐｒｉｏｒｉに与える最低支持度と最低確信度を変化させることにより、有用なルールを見逃すというリスクと、ルール数および計算時間との間のトレードオフが成立する。
【００６４】
コンテキストＸの基での相関ルールＲ：Ａ→ｂとは、コンテキストＸで近似される作業で利用されているファイルアイテムが与えられたときに、次に必要とされるファイルアイテムを予測するルールであるべきである。以下、本実施形態では、相関ルールを予測ルールとも呼ぶ。
【００６５】
そこで、本実施形態では、｛Ｒ＝Ａ→ｂ；Ａ⊂Ｉ（Ｘ）｝を満たす、すなわち条件部ＡがコンテキストＸに適合する全ての予測ルールの集合を、コンテキストＸの予測ルール集合Ｒ（Ｘ）と呼ぶ。さらに、より条件を厳しくして、条件部Ａのみでなく、結論部ｂもコンテキストＸに適合するルールの集合を、特に、厳密な予測ルール集合Ｒ_ｓ（Ｘ）と呼ぶ。ルール集合Ｒ（Ｘ）は、厳密なルール集合Ｒ_ｓ（Ｘ）を含むため、予測可能な範囲はＲ（Ｘ）の方が大きい。例えば、コンテキストＸ_１＝＜φ，｛計画｝，φ＞と、ルールＲ_１：（Ｈ１４，計画，その他）→（Ｈ１３，計画，その他）が存在した場合、ルールＲ_１は、ルール集合Ｒ（Ｘ_１）と厳密なルール集合Ｒ_ｓ（Ｘ_１）との双方に含まれる。一方、ルールＲ_２：（Ｈ１４，計画，その他）→（Ｈ１４，メモ，その他）は、ルール集合Ｒ（Ｘ_１）に含まれるが、厳密なルール集合Ｒ_ｓ（Ｘ_１）には含まれない。以下で、Ｒ（Ｘ）とＲ_ｓ（Ｘ_１）を対比する際には、混乱を避けるために、Ｒ（Ｘ）を通常のルール集合と呼ぶ。
【００６６】
Ｎ（Ｘ）を、コンテキストＸに適合する任意のファイルアイテムａ∈Ｉ（Ｘ）を含むトランザクションの数と拡張したとき、ルールＲ∈Ｒ（Ｘ）の支持度ｓｕｐ（Ｒ；Ｘ）は、次式のように定義できる。
【００６７】
【数１３】
ｓｕｐ（Ｒ；Ｘ）＝Ｎ（Ｒ）／Ｎ（Ｘ）
【００６８】
ｓｕｐ（Ｒ；Ｘ）は、コンテキストＸに該当する可能性がある作業が行なわれているときに、ルールＲが成立する確率である。また、コンテキストＸ＝＜φ，φ，φ＞の場合、Ｉ（Ｘ）は全てのファイルアイテムの集合となり、Ｎ（Ｘ）は全体のトランザクション数Ｎと一致するため、ｓｕｐ（Ｒ；Ｘ）＝ｓｕｐ（Ｒ）となる。
【００６９】
最低支持度と最低確信度を与えて、コンテキストＸのルール集合Ｒ（Ｘ）を列挙するには、例えば図３に示す処理によって行える。この場合、先ず、コンテキストＸに適合するファイルアイテムを含まないトランザクションを、マイニング対象から除去する（ステップ２０１）。例えば、Ｉ（Ｘ）＝｛０１，０２，０４｝であったとき、トランザクションｔ_１＝｛０３，０４，０６｝は除去されないが、トランザクションｔ_２＝｛０３，０６，０７｝はマイニング対象から除去される。次に、Ａｐｒｉｏｒｉアルゴリズムを用いて、予測ルールを抽出する（ステップ２０２）。次に、抽出された予測ルール集合から、予測ルールの条件部がコンテキストＸに適合しないルールを除去する（ステップ２０３）。例えば、Ｉ（Ｘ）＝｛０１，０２，０４｝であったとき、ルールＲ_１：０４→０６は除去されないが、ルールＲ_２：０３→０６は除去される。
【００７０】
但し、ルール集合Ｒ（Ｘ）を得るための処理は、図３に示す例には限定されない。例えば、マイニングに要する時間はトランザクション長の指数オーダーであり、最低確信度の大きさにはよらないため、トランザクションが十分長い場合には、図４に示す処理の方が、短時間でルール抽出が行える。この場合、各トランザクションから、コンテキストＸに適合しないファイルアイテムを除去する（ステップ３０１）。トランザクション長が０になった場合には（ステップ３０２；Ｙｅｓ）、そのトランザクションを除去する（ステップ３０３）。例えば、Ｉ（Ｘ）＝｛０１，０２，０４｝であったとき、トランザクションｔ_１＝｛０３，０４，０６｝は｛０４｝になり、トランザクションｔ_２＝｛０３，０６，０７｝はマイニング対象から除去される。次に、最低確信度を０として、Ａｐｒｉｏｒｉアルゴリズムを用いて、予測ルールを抽出する（ステップ３０４）。抽出される予測ルール集合は、最低支持度を満たしたＲ_ｓ（Ｘ）となる。抽出された各ルールＲ：Ａ→ｂから、次式に示すルール候補を生成し、予め定めた最低支持度および最低確信度を満たすかどうか判定し、最低支持度および最低確信度を満たすルール候補を予測ルール集合に追加する（ステップ３０５）。
【００７１】
【数１４】
ルール候補｛Ｒ_ｃ：Ａ∪｛ｂ｝→ｃ｜ｃはＩ（Ｘ）に含まれない｝
【００７２】
図４に示す処理は、Ａｐｒｉｏｒｉ内部でのルール候補生成ステップを特殊化して後処理として追加したものであり、ｓｕｐ（Ｒ；Ｘ）≧ｓｕｐ（Ｒ_ｃ；Ｘ）という性質を利用して、図３に示す処理と同一のルール集合を抽出することができる。
【００７３】
例えば本実施形態のルール抽出手段５では、図４に示す処理に基づいて、ルール集合Ｒ（Ｘ）を抽出するようにしている。また、抽出されたルール集合Ｒ（Ｘ）は、例えばルールデータベース９に蓄積するようにしている。これにより、複数人のユーザの作業履歴に基づく有効なルールをルールデータベース９に蓄積し、各ユーザの作業のノウハウを共有資源として活用することができる。また、例えば本実施形態では、Ａｐｒｉｏｒｉアルゴリズムを用いた予測ルールの抽出処理（ステップ２０２、ステップ３０４）に、Ａｐｒｉｏｒｉアルゴリズムを実装したＧＮＵライセンスのフリーウェアプログラムａｐｒｉｏｒｉ．ｅｘｅを用いている。ａｐｒｉｏｒｉ．ｅｘｅは、ｈｔｔｐ：／／ｆｕｚｚｙ．ｃｓ．ｕｎｉ−ｍａｇｄｅｂｕｒｇ．ｄｅ／￣ｂｏｒｇｅｌｔ／ｓｏｆｔｗａｒｅ．ｈｔｍｌから入手可能である。ここで、ａｐｒｉｏｒｉ．ｅｘｅは、オプションによって通常の支持度定義と独自の支持度定義を切り替えられるが、例えば本実施形態では、通常の支持度定義を用いている。尚、例えば図４に示す処理をステップ３０４で終了させて、厳密なルール集合Ｒ_ｓ（Ｘ）を抽出するようにしても良い。
【００７４】
作業状況推定手段６では、ファイル利用履歴管理手段２が記録した対象ユーザのファイル利用状況に基づいて、ファイルアイテムの群およびコンテキストの群の中から、１または複数の予測用ファイルアイテムを選択する。対象ユーザのファイル利用状況とは、例えば本実施形態では、現在または現在および近い過去（例えば、同内容の作業を行っていたと推定される過去数時間ないし数日間の予め定める期間）における対象ユーザに特化したファイル利用履歴をいう。１または複数の予測用ファイルアイテムは、例えば、対象ユーザのファイル利用状況に現れるファイルの属性値に基づいて選択する。ファイルの属性値の決定は、例えば図２を用いて説明した処理と同じ処理によって行う。
【００７５】
ここで、１または複数の予測用コンテキストを如何に選択するか、即ち、ユーザが必要としているファイルを予測するにあたり、どのコンテキストに属する予測ルール集合を適用すれば良いのかが、問題となる。予測ルール集合が大きければ、予測可能な範囲が増え、提供する情報の取りこぼしが少なくなる。しかし、ユーザにとって不必要な情報も増加し、適切な情報提示の妨げになる。一方、予測ルール集合が小さ過ぎれば、必要とされる情報をユーザに提供できなくなる可能性がある。
【００７６】
現在もしくは近い過去においてユーザに利用されたファイルアイテムの集合をＡとすると、当該Ａが適合するコンテキストによって近似される作業が行なわれていた、と考えることができる。一般には、ファイルアイテムの集合Ａが適合するコンテキストは、複数存在し得る。あるコンテキストにＡが適合していれば、その上位コンテキストにもＡが適合するためである。ただし、コンテキストＸに対するファイルアイテム集合Ｉ（Ｘ）の大きさが大きくなるほど、Ａ以外の情報を仮定していることになり、ユーザにとって不必要な情報も増加し得る。
【００７７】
そこで、本実施形態の作業状況推定手段６では、Ａ以外の情報をできるだけ仮定しなくて済む近似、即ち、Ａを含む最小コンテキストを選択するようにする。Ａを含む最小コンテキストの選択には、単一コンテキストを選択する場合と、複数コンテキストを選択する場合が考えられる。
【００７８】
単一コンテキストを選択する場合、Ａ⊂Ｉ（Ｘ_ｉ）となるコンテキスト群｛Ｘ_ｉ｝の内、ファイルアイテム集合の大きさ、即ちファイルアイテム集合に含まれるファイルアイテムの個数｜Ｉ（Ｘ_ｋ）｜が最小となるコンテキストＸ_ｋを選択する。これは、Ａに対応する作業が１つである場合に相当する。
【００７９】
複数コンテキストを選択する場合、Ａ⊂Ｉ（Ｘ_ｉ）となるコンテキストの組み合わせの内、組み合わせた後のファイルアイテム集合の大きさ、即ち組み合わせた後のファイルアイテム集合に含まれるファイルアイテムの個数｜∪_ｉＩ（Ｘ_ｉ）｜が最小となる組み合わせを選択する。これは、各コンテキストに対応する複数の作業が並列に行なわれている場合に相当する。尚、コンテキストが１つしか選ばれない場合もある。また、｜∪_ｉＩ（Ｘ_ｉ）｜が最小となるコンテキストの組み合わせが複数存在する場合には、例えばコンテキスト間の同時共起確率が高い組み合わせを選択するようにする。これにより、同時に行なわれることが多い作業の組み合わせを選択するようにできる。
【００８０】
一般に、各コンテキストが作業を良く近似しているのであれば、複数コンテキストを選択する方が、現在ユーザが行っている作業をより良い精度で近似できることになり、予測精度が向上すると考えられる。但し、作業状況推定手段６の処理は、上記の例に必ずしも限定されるものではなく、例えば、関連しそうな属性値をユーザに選択してもらい、それを基にコンテキストを選択するようにしても良い。
【００８１】
作業状況推定手段６により予測用ファイルアイテムと予測用コンテキストとが選択されることで、情報提供手段７では、当該予測用コンテキストに対応する相関ルールをルールデータベース９から検索し、且つ検索された相関ルールを適用して、当該予測用ファイルアイテムと相関する予測ファイルアイテムを特定することができる。そして、情報提供手段７は、特定された該予測ファイルアイテムに属するファイルに、ユーザが容易にアクセス可能であるようにする。ファイルにアクセス可能とする方法は特に限定せず、例えば、該当ファイルをユーザが有する利用者用端末８に送信しても良く、或いは、該当ファイルに対してハイパーリンクが張られているＨＴＭＬファイルをユーザが有する利用者用端末８に送信しても良い。ここで、情報提供手段７は、ファイルの所在を把握する必要がある。このために、例えば、ファイルアイテム化手段３によって、ファイルの所在と対応するファイルアイテムとを関連付けて管理しておくようにしても良い。また、イントラネット、共有ファイル群、文書データベース等を対象とした周知のファイル検索技術を利用するようにしても良い。
【００８２】
ここで、コンテキストに対応する相関ルールは複数存在し得るので、予測ファイルアイテムも複数特定され得る。ユーザが必要としていると予測されたファイル群の数が膨大になる場合には、ユーザの所属や興味等を利用して、提示するファイルを絞り込むようにしても良い。また、予測ファイルアイテムについて、適用した相関ルールが成立する確率（支持度や確信度）が高い順にランク付けして、上位から予め定めた個数分の予測ファイルアイテムだけを選択するようにしても良い。さらに、ファイル群を見易く整理して提示するようにしても良い。例えば、ファイルアイテムの属性値を利用して階層的に整理したり、現在利用しているファイル群と類似しているものをまとめたり、予測精度等を用いてランク付けするようにしても良い。このように情報を整理しておくことで、必要としている情報の取捨選択をユーザが容易に行なうことができる。その他、情報を大きく分類しておきユーザの選択に応じて絞り込んでいく方法、代表例となる情報を提示しておいてから提供する情報の幅を徐々に広げていく方法、等を適宜採用しても良い。
【００８３】
【実施例】
上述した必要ファイルの予測方法およびシステムの有効性を検証するための実験を行った。同実験では、ファイル利用履歴管理手段２により、１人のユーザのファイルアクセスを４９日に渡って記録した。ファイルのファイルアイテム化のために予め与える属性値は、上記期間内に起きた作業を考慮して、以下の通りに定めた。即ち、「時期」の属性値は｛Ｈ１２，Ｈ１３｝、「タイプ」の属性値は｛メモ，計画，予算，ヒアリング，プログラム，データ｝、「作業」の属性値は｛テーマ１，ＥＭＳ，その他｝とした。実験中にユーザが行なった主な作業は、環境マネジメントシステム（ＥＭＳ）のホームページ更新、プログラム作成、研究ヒアリング用資料の作成、「テーマ１（研究テーマである「次世代型知的情報システムの要素技術」を指す）」の来年度予算の内訳作成、研究メモ作成であった。実験中の作業としては、「テーマ１」に関するメモ作成やプログラム作成を行なうことが多かった。また、今回の実験では、前年度以前の研究テーマを全て「その他」に分類した。
【００８４】
実験期間中にユーザが用いたファイル数は３６、それをファイルアイテム化したときのファイルアイテム数は１７となった。表１に、ファイル利用履歴に出現したファイルアイテムを示す。表１では、各行がファイルアイテムを示しており、１列目がファイルアイテムを特定する通し番号を、２列目がファイルアイテムの属性値を、３列目がそのファイルアイテムに統合されたファイル数を表す。
【００８５】
【表１】

【００８６】
表２に、ファイルアイテム化手段３によりファイルアイテム化したファイル利用履歴の一部を示す。表２の各行は１日の作業を表わしており、２列目が、利用順にファイルアイテムを並べたファイルアイテム列を、３列目が、予測ルールの抽出を行なうために重複したファイルアイテムを省き、整列させたファイルアイテム集合となっている。ルール抽出手段５によるマイニングの際には、３列目のファイルアイテム集合が用いられる。
【００８７】
【表２】

【００８８】
属性値の組み合わせによる総コンテキスト数は５４であるが、実際に生成されたコンテキスト数は２３となった。これは、適合するファイルアイテムが無いコンテキストと、適合するファイルアイテムが一致するコンテキストとが省かれたためである。例えば、コンテキスト＜｛Ｈ１２｝，｛ＨＴＭＬ｝，φ＞を考えてみると、今回の実験では、「Ｈ１２」と「ＨＴＭＬ」という属性値を同時に持つようなファイルアイテムは存在しないため、このコンテキストを省いた。また、例えば、今回の実験では、コンテキスト＜｛Ｈ１３｝，｛予算｝，φ＞と、コンテキスト＜φ，｛予算｝，φ＞に適合するファイルアイテムが一致する（｛１４，１６｝）ため、前者のコンテキストを省いた。
【００８９】
１日の利用記録を１トランザクションとして、最低支持度は１０％、最低確信度は５０％という条件で、ルール抽出手段５により予測ルール集合の抽出を行なった。尚、本実験では、通常の予測ルール集合Ｒ（Ｘ）と、厳密な予測ルール集合Ｒ_ｓ（Ｘ）の双方を抽出した。結果を表３に示す。
【００９０】
【表３】

【００９１】
表３の各行はコンテキストＸに対応し、各列は以下のような意味を持つ。即ち、１列目は、コンテキストＸに適合するファイルアイテムの集合Ｉ（Ｘ）を表す。２列目は、コンテキストＸに適合するファイルアイテムを含むトランザクションの数Ｎ（Ｘ）を表す。３列目は、ルール数を表し、厳密な予測ルール集合の大きさ｜Ｒ_ｓ（Ｘ）｜と、通常の予測ルール集合の大きさ｜Ｒ（Ｘ）｜とを表す。４列目は、上位コンテキストでは抽出できず、そのコンテキストでのマイニングによって初めて抽出されたルールの数を表す。例えば、Ｘ＝＜φ，｛計画｝，φ＞での新規ルールとは、Ｘでマイニングされたルールのうち、＜φ，φ，φ＞でマイニングされていないルールを指す。また、ａ／ｂ／ｃは、ａは＜φ，φ，φ＞に対しての新規ルール数、ｂは最初の非φ属性を持つ上位コンテキストに対しての新規ルール数、ｃは２番目の非φ属性を持つ上位コンテキストに対しての新規ルール数を表わす。例えば、コンテキスト＜｛Ｈ１２｝，｛計画｝，φ＞の厳密なルール集合の適用下では、＜φ，φ，φ＞に対して２つの新規ルールが、＜｛Ｈ１２｝，φ，φ＞に対して１つの新規ルールが、＜φ，｛計画｝，φ＞に対して０個の新規ルールが抽出されたことを表している。また、抽出されたルールの例を表４に示す。表４では、見易さのために、ルール中の全てのファイルアイテムに共通する属性値を２列目に括り出している。
【００９２】
【表４】

【００９３】
コンテキストＸ_０＝＜φ，φ，φ＞のルール集合は、コンテキストを導入せずにＡｐｒｉｏｒｉを適用した場合のルール集合となっている。そこで、コンテキストＸ_０と各コンテキストでのルール集合を比較する。表３から分かるように、各コンテキストでのルール数が減っていることから、コンテキストに特化したルールのみが抽出されていることが分かる。また、新規ルールが抽出されていることから、コンテキストを導入することで、稀にしか起きない作業に対する予測ルールが各コンテキストで抽出されていることが分かる。コンテキストＸ_０と個々のコンテキストを比較すると、ほとんどの場合で、ルール集合による予測可能な範囲はコンテキストＸ_０の方が大きい。しかし、コンテキストを導入した場合の予測ルール集合全体での予測可能範囲を比較してみると、コンテキストを導入した場合の方が大きいことが分かる。これを示すのが表５である。
【００９４】
【表５】

【００９５】
表５の各行は、予測を行なう際に、コンテキストを導入しないルール集合を用いた場合（コンテキスト無し）と、コンテキストを導入した場合であって通常の予測ルール集合全体を用いた場合（コンテキスト（通常））と、コンテキストを導入した場合であって厳密な予測ルール集合全体を用いた場合（コンテキスト（厳密））との、３つのケースを表す。２列目の予測可能なファイルアイテムは、いずれかの予測ルールにより予測され得るファイルアイテムを表わす。３列目の予測不能ファイルアイテム数は、そのケースでは決して予測できないファイルアイテムの数を表わす。ここで、全ファイルアイテム数は１７である。表５を見れば分かるように、「コンテキスト無し」のケースでは予測不能なファイルアイテム数が１０と多い。このことからも、コンテキストを導入しない場合には、稀にしかおきない作業での予測が成功しない若しくは成功し難いことが分かる。一方、コンテキストを導入したケースでは、予測不能なファイルアイテム数が４または３と少ないため、稀にしかおきない作業においても適切な予測が行なえる可能性がある。したがって、コンテキストを導入することによって、稀にしかおきない作業においても適切な予測を行えることが確認できる。
【００９６】
次に、単一コンテキストを選択する場合と複数コンテキストを選択する場合および通常の予測ルール集合と厳密な予測ルール集合を用いる場合について、予測性能に対する影響を評価する。
【００９７】
ユーザが利用中のファイル群に対応する、予測に用いるファイルアイテム列Ａを、トランザクションに対応するファイルアイテム列（表２の２列目）から抜き出す。例えば、表２の１行目のファイルアイテム列”０１　０２　０３　０４　０５　０３　０４　０４”から、５〜７番目のファイルアイテムの列”０５　０３　０４”を抜き出す。抜き出すＡの選択を、単純に一様乱数で決定した場合、利用される頻度の高いファイルアイテムのみから成るＡが抜き出される可能性が高くなり、稀な作業下での予測性能を評価することが困難になる。そこで、１７個のファイルアイテムのそれぞれに対して、各ファイルアイテムを必ず含むＡを１つずつランダムに抜き出すこととした。例えば、ファイルアイテム１２に対しては、１２を含むファイルアイテム列（”０８　１２　０１”，”０１　１２　０１”等）の内の１つがランダムに抜き出される。抜き出されるＡの長さは２もしく３はとし、乱数で決定した。ただし、要素に重複がある場合（例えば”０１１２　０１”）は、Ａから後ろの重複部分を省いた（例えば”０１　１２”となる）。
【００９８】
予想すべきファイルアイテムは、元々のファイルアイテム列から予測に用いるファイルアイテム列Ａを除いたファイルアイテム集合とした。例えば、”０１　０２　０３　０４　０３　０１”という列があったとき、Ａ＝｛０３，０４｝として与えた場合の予測すべきファイルアイテムは、｛０１，０２，０３｝となる。尚、重複するファイルアイテムは省いた。
【００９９】
そして、（１）コンテキストを利用せずに単純にＡｐｒｉｏｒｉを適用したルール集合（コンテキスト無し）、（２）Ａを含む単一の最小コンテキストを選択したときの厳密なルール集合（１コンテキスト（厳密））、（３）Ａを含む単一の最小コンテキストを選択したときの通常のルール集合（１コンテキスト（通常））、（４）Ａを含む２つのコンテキストの最小組み合わせを選択したときの厳密なルール集合（２コンテキスト（厳密））、（５）Ａを含む２つのコンテキストの最小組み合わせを選択したときの通常のルール集合（２コンテキスト（通常））、の５つのルール集合を用いて、１７個のファイルアイテム列Ａに対しての予測を行なった。予測に際しては、最低支持度１０％、最低確信度５０％のルール集合を用いた。
【０１００】
予測結果の評価には、情報検索における検索結果の評価に用いられる再現率と精度および総合的な評価基準であるＦ値を利用した。再現率は予測の取りこぼしの少なさを評価し、精度は予測結果中の不要なファイルアイテムの少なさを評価していることになる。Ｎを、予想すべきファイルアイテムの内、実際に予想されたファイルアイテムの数とすると、再現率と精度、Ｆ値は、以下の式で定義できる。
【０１０１】
【数１５】
再現率＝Ｎ／予想すべきファイルアイテム数
【０１０２】
【数１６】
精度＝Ｎ／予想されたファイルアイテム数
【０１０３】
【数１７】
Ｆ値＝２・再現率・精度／（再現率＋精度）
【０１０４】
各ルール集合を用いた予測の評価を、表６に示す。
【０１０５】
【表６】

【０１０６】
このとき、〇、△、×はそれぞれ、最良、次善、最悪の値を示している。表６から分かるように、再現率と精度の間にはトレードオフが存在する。また、厳密なルール集合の方が通常のルール集合より精度で勝り、再現率で劣ることが分かる。総合的な評価であるＦ値で比べてみると、２コンテキスト（通常）と、１コンテキス（厳密）が、コンテキスト無しの場合より良い結果が得られている。
【０１０７】
再現率が高い「２コンテキスト（通常）」の予測結果には、作業で必要とされるほとんどのファイルアイテムが含まれていることになる。そこで、本実施例では、ルールの確信度によって予測ファイルアイテムの順位付けを行なうことで、提供する情報の予測精度の改善を試みた。
【０１０８】
ファイルアイテムの確信度を、そのファイルアイテムを予測したルールの確信度の内、最も大きいものとする。例えば、ファイルアイテムｂを予測する、確信度がそれぞれ５０％，８０％のルールＲ_１とＲ_２があったとする。このとき、ｂの予測にＲ_１，Ｒ_２の双方が用いられていれば、ｂの確信度は８０％になる。２コンテキスト（通常）を用いた場合に、予測されたファイルアイテムを確信度順に並べたのが表７である。
【０１０９】
【表７】

【０１１０】
表７の各行はそれぞれ、１７個の与えられたファイルアイテム列に対する、予測すべきファイルアイテムと予測されたファイルアイテムを示している。表７の１列目は、ユーザが行っている作業で利用中のファイルアイテムとして与えられたファイルアイテム列を表している。２列目は、与えられたファイルアイテム列から予測すべきファイルアイテムを表している。１列目と２列目に出現するファイルアイテムの集合が元々のトランザクションとなっている。３列目は、実際に予測されたファイルアイテムとその確信度を示している（Ｘ／ＹＹ．Ｙは、Ｘがファイルアイテムを表し、ＹＹ．ＹがファイルアイテムＸの確信度を表している）。予測されたファイルアイテムは、確信度が高い順に並んでおり、太字で書かれているのが、予測すべきだったファイルアイテムを、斜体字で書かれているのが、予測すべきではなかったが、予測するために与えられたファイルアイテム列に含まれているファイルアイテムを表わす。
【０１１１】
表７から分かるように、予測したファイルアイテムの内、確信度が高いものが本来予測すべきファイルアイテムであるという傾向がある。そこで、予測したファイルアイテムから上位ｎ位までを選んだ場合の精度を計算すると、ｎ＝２，３の場合にそれぞれ０．７２，０．６３となり、確信度を考慮しない場合の精度０．５２を上回る結果が得られた。また、コンテキストを利用しない場合の精度０．５４も上回った。
【０１１２】
予測精度を悪化させている原因の１つとして、予測するために与えられていたファイルアイテム列（例えば、表７の８行目１列”０７　０８　１０”）に現われるファイルアイテムが、予測すべきファイルアイテム中に現われてしまうこと（表７の斜体字；例えば、８行目３列　０７／１００．０，０８／１００．０）を挙げることができる。これは、与えられたファイルアイテム列の一部分から、与えられたファイルアイテム列中の他のファイルアイテムが予測されてしまうためである。例えば、表７の８行目では、０７と０８の同時利用頻度が高く、０７→０８と０８→０７というルールが適用されるために起きる。これらのファイルアイテムは、現在利用中のファイルと同一種類のファイル群を表わしている。これらのファイルを、利用中のファイルと類似したファイルとして分類し、必要に応じてユーザが利用できるようにしておけば、情報提供の邪魔にならずに、かえって有用な場合が多い。したがって、ユーザへの適切な情報提供を阻害する、本質的に不要なファイルアイテムは、表７中の予測したファイルアイテムの内、太字でも斜体字でも無いファイルアイテムであると考えられる。
【０１１３】
予測したファイルアイテムから確信度上位ｎ位までを選択したとき、そこに含まれる不要ファイルアイテム数と、予測結果に含まれるべきなのに含まれていないファイルアイテム（取りこぼしファイルアイテム）の数を図５に示す。図５から以下のことが分かる。第１に、ｎを小さくすれば予測精度が向上するが、取りこぼしが多くなる。例えば、ｎ＝２とした場合、不要ファイルアイテムは平均０．１８個しか含まれなくなるが、取りこぼしファイルアイテム数は平均１．１８個となる。第２に、不要ファイルアイテム数と取りこぼしファイルアイテム数がバランスしているのは、ｎ＝３と４の間である。以上の結果から、ユーザへの情報提供の際に、ｎ＝３または４から始めて、徐々に提供範囲を拡大または縮小することで、作業状況に見合った情報の質と量を制御することが可能になり、有効な情報提供が可能になると考えられる。
【０１１４】
以上、本実施例の結果から、（ａ）本発明に係るルール抽出の方法により、コンテキスト無しの通常のデータマイニング手法に比べて、予測可能なファイルアイテム数が増加すること、（ｂ）本発明に係るコンテキスト選択の方法により予測性能が向上すること、（ｃ）予測結果のうち確信度が高いｎ個のファイルアイテムを選ぶことで予測精度が向上すること、（ｄ）上記ｎを増減させることで提供された情報の質と量のトレードオフを制御できること、が確認された。以上のことから、作業で必要となる情報の予測と提供をする場合に本発明が極めて有効であることが確認された。
【０１１５】
以上のように本発明によれば、複数ファイルの利用を、単一ファイルアイテムの利用に置き換えるので、ファイル利用履歴中のファイルアイテム１つ当たりの見かけ上の利用頻度を増加させることができ、抽象化された一般性を持つルール抽出が可能になる。また、共通の性質を持ったファイル群を統合したファイルアイテムをデータマイニングの対象としているので、個々のファイルをデータマイニングの対象とする場合に比べて、マイニングに要する時間を減らすことができる。さらに、新規作成されたファイルを既存のファイルアイテムに帰着させることもできるため、新規ファイルが予測の対象外になるという事態を防ぐことが可能になる。また、作業に必要となるファイルを、対応するファイルアイテムでカテゴライズしてユーザに情報提供することで、ユーザ側での情報の取捨選択が容易になる。
【０１１６】
また、任意の作業で利用されるファイルの種類を、適切に選択されたコンテキストまたはコンテキストの組み合わせによって、近似することができる。その意味でコンテキストはユーザの作業状況を反映するものであるということができる。コンテキストの選択によって、現在ユーザが行なっている作業で必要とされるファイルアイテムの範囲が推定される。予測範囲が限定されることで、次に必要とされるファイルアイテムの予測の性能を向上させることが可能になる。
【０１１７】
さらに、コンテキスト導入によって、ファイルアイテムの属性値に基づいてトランザクション群が分割される。これにより、作業の並列性に起因して、予測精度を悪化させる有害な相関ルールが抽出されることが防止される。例えば、打ち合わせ資料作成作業Ｐ_１を行なっているときに、Ａ，Ｂ，Ｃ，Ｄというファイルが利用され、新規プロジェクトの計画立案作業Ｐ_２を行なっているときに、Ｘ，Ｙというファイルが利用されたとする。このとき、Ａ→Ｂ→Ｘ→Ｙ→Ｃ→Ｄという時間順序で各ファイルが利用されたとする。コンテキスト導入によって、例えば「計画」という属性値を持つＸ，Ｙと、例えば「打ち合わせ資料」という属性値を持つファイルアイテムＡ，Ｂ，Ｃ，Ｄとを、別個の作業で利用されるものとして捉えることができ、各ファイルに該当するファイルアイテムを含むトランザクションに範囲を限定してマイニングを行なえば、作業Ｐ_１と作業Ｐ_２のそれぞれを行なっているときの予測ルールを個別に抽出することが可能になり、最低支持度を満たす可能性も高まる。
【０１１８】
さらに、稀にしか起きないような作業では、当該作業に固有のファイル利用パターンが現われる頻度が低くなるため、通常のマイニング手法を用いた場合、希な作業に該当する予測ルールが、スクリーニングされてしまう可能性が高い（例えば、四半期（約９０日）毎に行なう報告を作成する作業があった場合、その作業に関する固有のルールは、ほぼ１／９０の支持度しか得られない可能性があり、支持度が低すぎて無視される可能性がある）が、コンテキスト毎に予測ルールを抽出することで、コンテキストで指定されるファイルアイテム集合に対する頻度は高くなる。即ち、ファイルアイテム全体で考えると支持度が低すぎて無視されるが、コンテキストを限定した場合には十分に有効であるルールを、抽出することが可能になる。したがって、稀にしか起きない作業に対しても、その作業固有のルールを抽出することが可能になる。また、そのコンテキストでは不必要なルールを抽出する手間を省くことができる。
【０１１９】
なお、上述の実施形態は本発明の好適な実施の一例ではあるがこれに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変形実施可能である。例えば、対象利用者のファイル利用状況に基づいて予測用ファイルアイテムや予測用コンテキストを選択するにあたって、または、ユーザに予測ファイルの情報提供を行うにあたって、各ユーザのプロファイル（例えば所属部署、共同作業者、等々の情報）を利用して、当該プロファイルが類似する他のユーザについての利用ファイルパターンや予測ルールを参照して、不要と推定される情報を省くようにしても良い。また本発明方法は、必ずしもシステム化（装置化）されるものに限らず、その場合、ファイルは必ずしも電子データに限らず、場合によっては紙の情報であっても良い。
【０１２０】
【発明の効果】
以上の説明から明らかなように、請求項１記載の必要ファイルの予測方法および請求項７記載の必要ファイルの予測システムおよび請求項８記載の必要ファイルの予測用プログラムによれば、自身が過去に作成したファイルまたは他のユーザが過去に作成したファイルを有効活用することができ、作業のノウハウを資源として共有化することができる。
【０１２１】
さらに、本発明によれば、複数ファイルの利用を、単一ファイルアイテムの利用に置き換えるので、ファイル利用履歴中のファイルアイテム１つ当たりの見かけ上の利用頻度を増加させることができ、抽象化された一般性を持つルール抽出が可能になる。また、ファイルアイテムをデータマイニングの対象とすることで、個々のファイルをデータマイニングの対象とする場合に比べて、マイニングに要する時間を減らすことができる。さらに、新規作成されたファイルを既存のファイルアイテムに帰着させることもできるため、新規ファイルが予測の対象外になるという事態を防ぐことも可能である。また、作業に必要となるファイルを、対応するファイルアイテムでカテゴライズしてユーザに情報提供することもでき、ユーザ側での情報の取捨選択も容易になる。
【０１２２】
さらに、コンテキストの導入によって、ユーザが並列的に行っている作業毎に予測ルールを抽出することが可能となる。また、ファイルアイテム全体で考えると支持度が低すぎて無視されるが、コンテキストを限定した場合には十分に有効であるルールを、抽出することが可能になる。したがって、稀にしか起きない作業に対しても、その作業固有のルールを抽出することが可能になる。
【０１２３】
さらに、請求項２および３記載の必要ファイルの予測方法によれば、コンテキストで近似される作業で利用されているファイルアイテムが与えられたときに、次に必要とされるファイルアイテムを予測するルールを抽出できる。
【０１２４】
さらに、請求項４および５記載の必要ファイルの予測方法によれば、対象利用者が最近利用したファイル以外の情報をできるだけ仮定しなくて済み、対象利用者にとって不必要な情報の提供を低減することができる。
【０１２５】
さらに、請求項６記載の必要ファイルの予測方法によれば、対象利用者にとって不必要な情報の提供を低減することができ、また、選択する予測ファイルアイテムの上位ｎ個の範囲を調整することで、対象利用者の作業状況に見合った情報の質と量を制御することが可能となる。
【図面の簡単な説明】
【図１】本発明の必要ファイルの予測システムの実施の一形態を示し、概略構成を示すブロック図である。
【図２】ファイル利用履歴に現われるファイルの属性値を決定する処理の一例を示す概略フローチャートである。
【図３】ルール抽出手段により相関ルールを抽出する処理の一例を示す概略フローチャートである。
【図４】ルール抽出手段により相関ルールを抽出する処理の他の例を示す概略フローチャートである。
【図５】予測ファイルアイテムについて、適用した相関ルールの確率（確信度）が高い順にランク付けして上位ｎ個選択した場合の、予測精度と取りこぼし数のトレードオフの関係を表す図である。
【符号の説明】
１　必要ファイルの予測システム
２　ファイル利用履歴管理手段
３　ファイルアイテム化手段
４　コンテキスト生成手段
５　ルール抽出手段
６　作業状況推定手段
７　情報提供手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method, a system, and a program for predicting a file required by a target user. More specifically, the present invention relates to a method, a system, and a program in which a data mining technique for extracting an association rule is applied to a required file prediction.
[0002]
[Prior art]
With the spread of personal computers and the Internet, and the spread of information systems using communication network technology within corporate organizations, the digitization of information such as documents, documents, and charts has rapidly progressed. For this reason, various kinds of information exist on computers inside and outside the organization, and information necessary for performing work can often be obtained via a network.
[0003]
2. Description of the Related Art As a conventional technology for assisting a user of information (hereinafter, also simply referred to as a user in this specification) to obtain information, there is a technology represented by a search engine on the Internet. This search engine provides a link to a corresponding web page by giving a characteristic of information required by a user as a keyword. At the time of the provision, there are devised methods such as omitting duplicate pages and performing ranking in the order of the possibility that the user needs it.
[0004]
Web \ Usage \ Ming is a technique for an information system or website administrator as an information provider to grasp a user's intention and improve the mechanism so as to enhance user's convenience. (References: J. Srivastava, R. Cooley, M. Deshpahde, and P.-N. Tan: Web Usage Mining: Discovery and Applications, D.A.P.D. 2, 2000). Web @ Usage @ Mining is a type of data mining that discovers a correlation rule such as "a user who has seen page A has a strong tendency to see page B" from an access log to a page in a site recorded on a web server. is there. The result of the Web Usage Mining is used for improving the site configuration such as the page content and the link configuration between pages. For example, when a pattern such as “a user who has viewed page A also sees page B” frequently appears, it can be estimated that many users simultaneously need the respective information described in page A and page B. Therefore, when it is not possible to directly follow the link from the page A to the page B, by providing a link between the AB and the appropriate guide, it is possible to easily obtain the information required by the user.
[0005]
In addition, a system that autonomously collects information according to the user's preference (references: O. Etzioni: Moving Up the Information Food Foundation Chain: Deploying Softbots on the Internet, Internet, Internet, Online, Online, Internet, Online, Internet, Online, Internet, Online, Internet, Internet, Online, Internet, Internet, Internet, Internet, Internet, Internet, Infection, Internet, Online 1996.) and a system for providing information to users based on collaborative filtering (references: N. Good, J. B. Schaffer, et al .: Combining Collaborative Filtering with Personal Agents for Bette). Recommendations, InProc. Of AAAI-99, 1999.) have been proposed in the past.
[0006]
[Problems to be solved by the invention]
However, even if effective information for work exists on computers inside and outside the organization, few users are conscious of where information not normally used exists. Therefore, when performing important but rare tasks (such as semi-annual work report creation) or new tasks (such as new work accompanying a transfer or creating materials for a new project), users can obtain the necessary information from where Often you don't know what to do.
[0007]
In addition, in information search using a search engine, it is difficult to select an appropriate keyword, and a large number of documents and figures that are likely to be related are presented, so that the user often gets lost in selecting information.
[0008]
Further, in a system that autonomously collects information according to a user's preference or a system that is based on collaborative filtering, irrelevant recommendations are often made because work conditions are not considered. In general, if the work situation is limited, it is considered that the type of information used in the work situation and the use pattern of the information can be specified to some extent. However, it is almost impossible to cover all possible work situations and determine in advance the information required for each work situation.
[0009]
Here, in order to predict the file required by the user, the usage history of the file is analyzed by applying Web Usage Mining, and "a user who uses file A has a strong tendency to use file B". It is conceivable to perform data mining for finding association rules. However, in this case, there are the following problems.
[0010]
First, the number of file reuses in daily work is not very large. Even when performing the same work, different files having similar contents may be used. Therefore, if individual files are simply treated as data mining targets, there is a high possibility that generality association rules will not be extracted.
[0011]
Second, files such as new documents and charts are created on a daily basis, and in order to extract association rules including such new files, it is necessary to wait for the new files to be reused. .
[0012]
Third, harmful correlation rules that deteriorate prediction accuracy may be extracted due to the parallelism of work. In other words, in daily work, several tasks are often performed in parallel. For example, consider the situation where you are planning materials for a project that will start next year while preparing materials for a meeting. Meeting material creation work P₁Are used, files A, B, C, and D are used, and a new project planning work P₂Is performed, files X and Y are used. At this time, it is assumed that each file is used in a time sequence of A → B → X → Y → C → D. Then, the file usage history is recorded in the order of A → B → X → Y → C → D, and the prediction rule set extracted therefrom includes {A, B, C, D} and {X, Y} (For example, A → X). Such a rule is that the actual work is P₁And P₂Is useful when are running in parallel. But P₁If only X is performed, it is not necessary to predict X, and this is a harmful rule in that the prediction accuracy is deteriorated.
[0013]
Fourth, correlation rules for important but rarely occurring tasks (eg, quarterly reports, work progress reports at the end of the month, work on planning and budgeting for next year, etc.) It is very unlikely to be extracted because it is too small. Even if the work is a highly important work that always occurs regularly.
[0014]
Therefore, an object of the present invention is to provide a method, a system, and a program for appropriately predicting a file required by a user.
[0015]
[Means for Solving the Problems]
In order to achieve the above object, a method for predicting a required file according to claim 1 defines a plurality of attributes representing characteristics of a file and attribute values that can be taken by each of the attributes in advance, and stores the attribute values in a file use history recorded in the past. For a file that appears, a group of files having the same attribute value is collected as one file item to create a file usage history as a file item, a plurality of contexts that are combinations of attribute values are generated, and based on the file usage history that has been turned into a file item. In addition, in association with each of the contexts, a correlation rule established between the file items is extracted, and one or more of the file item group and the context group are selected from the file item group and the context group based on the file usage status of the target user. Select the prediction file item and one or more prediction contexts, By applying a correlation rule corresponding to the measurement for the context, so that identifies the prediction file items that correlate with prediction file items, predicts as a file in need of object user files belonging to the prediction file items.
[0016]
The system for predicting a required file according to claim 7, wherein the file usage history management means for recording the file usage history, a plurality of attributes representing the characteristics of the file, and the attribute values that can be taken by each of the attributes. A file itemizing unit that collects a group of files having the same attribute value as one file item for a file appearing in the history and converts the file use history into a file item, and a context generating unit that generates a plurality of contexts that are combinations of attribute values. A rule extracting means for extracting a correlation rule established between file items based on a file use history made into file items and corresponding to each of contexts, and a target user recorded by a file use history management means File access based on file usage A task status estimating means for selecting one or a plurality of file items for prediction and one or a plurality of contexts for prediction from a group of systems and a group of contexts, and applying a correlation rule corresponding to the context for prediction; Information providing means for identifying a predicted file item correlated with the file item for use and allowing the target user to access a file belonging to the predicted file item.
[0017]
The program for predicting a required file according to claim 8, wherein the file use history management means for recording a file use history, a plurality of attributes representing the characteristics of the file and the attribute value that each of the attributes can take, the file File itemizing means for collecting files having the same attribute value as one file item for a file appearing in the usage history and converting the file usage history into a file item, and context generating means for generating a plurality of contexts which are combinations of attribute values And a rule extracting means for extracting a correlation rule established between the file items based on the file use history converted into file items and corresponding to each of the contexts, and a target use recorded by the file use history management means. Based on the file usage status of A work situation estimating means for selecting one or a plurality of prediction file items and one or a plurality of prediction contexts from a group of items and a group of contexts, and applying a correlation rule corresponding to the prediction context to perform prediction. The computer is caused to function as information providing means for specifying a predicted file item correlated with the file item for use and allowing a target user to access a file belonging to the predicted file item.
[0018]
Therefore, files having a common property are integrated into one file item. File items are classified by context, and association rules established between file items are extracted for each context. Then, a context that approximates the work situation is selected based on the work situation performed by the user, and file items required for the work are derived by the association rule. The user can easily access a file matching the file item, that is, a file presumed to be required by the user.
[0019]
According to a second aspect of the present invention, in the method of predicting a required file according to the first aspect, the association rule is extracted such that a file item belonging to the condition part belongs to a corresponding context. Therefore, when a file item used in the work approximated by the context is given, a rule for predicting the next required file item can be extracted.
[0020]
According to a third aspect of the present invention, in the method for predicting a required file according to the first aspect, the association rule is such that both the file item belonging to the condition part and the file item belonging to the conclusion part belong to the corresponding context. , Shall be extracted. In this case, a rule set that fits strictly to the work situation in which the context is approximated but has a narrower predictable range than the invention described in claim 2 is extracted.
[0021]
According to a fourth aspect of the present invention, in the method for predicting a required file according to any one of the first to third aspects, one or a plurality of file items for prediction are specified based on an attribute value of a file appearing in a file usage status. Then, one context in which the number of file items belonging to the prediction file item and belonging to the other is minimized is selected. Therefore, information other than the file recently used by the target user need not be assumed as much as possible, and provision of unnecessary information to the target user can be reduced. This is equivalent to a case where the target user is performing one operation to select a single context.
[0022]
According to a fifth aspect of the present invention, in the method for predicting a required file according to any one of the first to third aspects, one or more prediction file items are specified based on an attribute value of the file appearing in the file usage status. Then, a combination of contexts to which the file item for prediction belongs and the number of file items belonging to other files is the smallest is selected. Therefore, information other than the file recently used by the target user need not be assumed as much as possible, and provision of unnecessary information to the target user can be reduced. This corresponds to a case where the target user is performing a plurality of tasks in parallel to select a plurality of contexts.
[0023]
According to a sixth aspect of the present invention, in the method for predicting a required file according to any one of the first to fifth aspects, the predicted file items are ranked in descending order of the probability that the applied correlation rule is satisfied, and , A predetermined number of predicted file items are selected. Therefore, it is possible to reduce the provision of unnecessary information for the target user, and to adjust the range of the top n items of the predicted file item to be selected, thereby improving the quality of the information in accordance with the target user's work situation. And the amount can be controlled.
[0024]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the configuration of the present invention will be described in detail based on an embodiment shown in the drawings.
[0025]
FIGS. 1 to 4 show an embodiment of a method for predicting a necessary file according to the present invention. This method of estimating the required file defines a plurality of attributes representing the characteristics of the file and attribute values that can be taken by each of the attributes in advance, and for a file appearing in the file use history recorded in the past, a group of files having the same attribute value. The file usage history is filed as one file item, and a plurality of contexts, which are combinations of attribute values, are generated. Based on the file usage history filed as a file item, each context is associated with each other. A correlation rule established between file items is extracted, and one or more prediction file items and one or more prediction file items are selected from a group of file items and a group of contexts based on the file usage status of the target user. Select a context and select the correlation By applying, to identify a prediction file items that correlate with the prediction file item, so that predicted as a file in need of object user files belonging to the prediction file items.
[0026]
In the present embodiment, for example, a central processing unit (CPU), a main storage device, an external storage device, an output device such as a display, an input device such as a keyboard, a communication interface with an external information processing device, and the like are used. One or a plurality of known computers are used as a system for estimating necessary files. As shown in FIG. 1, the required file prediction system 1 is based on a file usage history management unit 2 that records a file usage history, a plurality of attributes representing the characteristics of the file, and attribute values that each of the attributes can take. For a file appearing in the file usage history, a file itemization unit 3 for grouping files having the same attribute value as one file item and converting the file usage history into a file item, and a plurality of contexts as a combination of attribute values are generated. Context generation means 4 and rule extraction means for extracting a correlation rule established between file items based on the file usage history filed into file items and corresponding to each of the contexts generated by context generation means 4 5 and the pair recorded by the file usage history management means 2 A work situation estimating means 6 for selecting one or a plurality of prediction file items and one or a plurality of prediction contexts from a group of file items and a group of contexts based on a file usage situation of the user; Information providing means 7 for applying a correlation rule corresponding to a context to specify a predicted file item correlated with the predicted file item, and allowing the target user to access a file belonging to the predicted file item. ing. The required file prediction program of the present invention is read and executed by a computer to cause the computer to function as the required file prediction system 1 described above. Reference numeral 8 in FIG. 1 indicates an information processing terminal (for example, a personal computer) owned by the user. In FIG. 1, the number of user terminals 8 is one, but the system 1 can be used by a plurality of users, and it goes without saying that a plurality of user terminals 8 may be provided.
[0027]
Here, the file in the present embodiment refers to a file as electronic data. Note that the type of file is not particularly limited, and may be any type of file available on a computer, such as a document, a table, an image, a program, and HTML. Since the file according to the present embodiment is electronic data that can be managed by a computer, an access log to the file can be automatically recorded.
[0028]
For example, in this embodiment, a program file @ spy included in a resource kit for Windows 2000 (a trademark of Microsoft Corporation) is used to cause a computer to function as the file use history management unit 2. file @ spy is a type of IFS (Installable File System) filter driver, and is a program capable of monitoring local and network drives and recording an input / output request packet. Since file @ spy records a disk IO generated when a normal application is started or a disk IO that is performed by the OS as needed, the output size becomes very large. Therefore, in the present embodiment, a directory in which a file used by a user exists is specified, and only file accesses under the directory are filtered and recorded. However, the file use history management unit 2 is not limited to the example of the present embodiment. In the file use history, file accesses by a large number of user terminals 8 connected via a network can be recorded. In order to extract only a file access by a specific user from the file use history, information for specifying the user or the user terminal 8, such as an IP address or a user ID, can be used.
[0029]
As attributes representing the characteristics of the file (in other words, the characteristics of the work related to the file), for example, in the present embodiment, three attributes “time”, “type”, and “work” are defined. However, the number of attributes is not limited to three, and the content of the attributes is not limited to the example of the present embodiment, and can be set arbitrarily according to the form of work of the user. The attribute in the present embodiment is a so-called container in which the attribute value is stored. The attribute is common to all the files, but the attribute value included in the attribute may be different for each file.
[0030]
Each attribute in the present embodiment has the following meaning. First, the attribute “time” indicates a time when the file is mainly used. Specific examples of the attribute value included in the attribute “time” include “H13” and “first half”. The attribute "type" indicates the role and use of the file in performing the work, the type of the file, and the like. Specific examples of attribute values included in the attribute “type” include, for example, “plan”, “result”, “memo”, “material”, “program”, and “report”. The attribute “work” represents a word that can specify the work performed by the user. Specific examples of the attribute value included in the attribute “work” include, for example, “research theme name”, “project name”, “code of product under development”, and the like.
[0031]
The file itemizing means 3 identifies files having the same attribute value for files appearing in the file usage history, and recognizes the identified files as file items. As a result, a group of files having a common property is integrated into one file item. The necessary information is predicted for the file item, and a file matching the predicted file item is provided to the user. Each file item in the present embodiment is represented by a set of attribute values. For example, a file group of “time” = “14th year of Heisei”, “type” = “plan”, and “work” = “case 1” is a file group expressed as (H14, plan, case 1). Be integrated.
[0032]
Here, the method of determining the attribute value of the file includes, for example, (1) a method of finding a keyword corresponding to a predetermined attribute value from a file name or a word appearing in the file, and (2) a meaning depending on the meaning. There are a method of clustering documents and setting words corresponding to each cluster to attribute values, a method of (3) classifying files and determining attribute values by a decision tree, and the like. Note that a user or a system administrator may determine the attribute value of a file and input the attribute value to a computer as the file itemizing means 3.
[0033]
For example, in the present embodiment, the method (1) is adopted, and the attribute value of the file appearing in the file use history is determined by the file itemizing means 3. An example of the attribute value determination processing according to the present embodiment will be described with reference to the flowchart shown in FIG. First, the "type" attribute value of a file having a specific extension is determined (step 101). For example, a file having the extension “.c” or “.bat” has “type” = “program”, a file having the extension “.dat” has “type” = “data”, and “ A file having an extension of “.html” has “type” = “HTML”. Next, a keyword determined in advance is searched from the file name, and attribute values of “time”, “type”, and “work” are determined (step 102). For example, in the file “2001 research plan.doc”, “time” = “H13” and “type” = “plan”. If the attribute value of “type” has already been determined in step 101, the attribute value is not changed. Next, the file is converted to text, and matching is performed with previously prepared keywords for the tenth line from the beginning of the file to determine attribute values of “time”, “type”, and “work” (step 103). Here, the reason for setting the tenth line from the top is that the title, date, purpose, etc. are usually at the top of the file. If the attribute values of “time”, “type”, and “work” have already been determined up to step 102, the attribute values are not changed. Here, in

steps

102 and 103, it is preferable to unify synonyms and synonyms using a table or the like. For example, if the words “EMS”, “environmental management system”, and “environment” are included, “work” = “EMS”, and if the words “plan” and “plan” are included, “type” ”=“ Plan ”. If the attribute “time” has not been determined in the above processing, “time” is determined from the time stamp (step 104). If the attribute “type” has not been determined, “type” = “memo” (step 105). If the attribute “work” has not been determined, “work” = “other” (step 106).
[0034]
The context is a combination of attribute values for classifying file items. The context plays a role in classifying file items according to a combination of attribute values, and plays a role in selecting a transaction group to be mined. By introducing the context, it becomes possible to properly grasp the file usage patterns that differ depending on the work situation. For example, in the present embodiment, a context is expressed by combining an attribute value of each file item with at least one empty set φ. For example, if the context specifies “time” = “H14”, it is expressed as in the following expression.
[0035]
(Equation 1)
X₁= <H14}, φ, φ
[0036]
For example, if the context specifies “work” = “project 1”, it is expressed as in the following expression.
[0037]
(Equation 2)
X₂= <Φ, φ, {Project 1}>
[0038]
Further, if the context specifies two attribute values at the same time, such as “time” = “H14” and “work” = “project 1”, the expression is expressed by the following expression.
[0039]
(Equation 3)
X₃= <{H14}, φ, {Project 1}>
[0040]
Context X that specifies two attribute values at the same time₃And context X that specifies only one attribute value₁, X₂Context X₁, X₂The set of file items classified by₃Always include a set of file items that are classified by. By this inclusive relation, a relation of higher order and lower order is established between contexts. For example, in the above example, context X₁, X₂Is the context X₃Is the upper context. It can be said that the higher the context, the more applicable it is to various work situations, and the lower the context, the more precisely the specific work situation is represented.
[0041]
A context can be formally defined as: That is, a set of file items D = ｛d₁, ..., d_n｝ Exists and each file item d_i(I = 1 to n) is represented by the following equation. Where a_iIs the attribute value of “time”, b_iIs the attribute value of "type", c_iIndicates the attribute value of “work”.
[0042]
(Equation 4)
d_i= (A_i, B_i, C_i)
[0043]
At this time, the minimum context X for the file item set D is defined by the following equation.
[0044]
(Equation 5)
X = <∩_i｛A_i｝、 ∩_i｛B_i｝、 ∩_i｛C_i｝＞
[0045]
The attribute value specified by the context is a value that each file item belonging to D has in common, and φ indicates that there is no common attribute value. Further, in the present embodiment, “the file item d = (a, b, c) conforms to the context X = <A, B, C>” means that the following equation is satisfied.
[0046]
(Equation 6)
X = <A {a}, B {b}, C {c}>
[0047]
That is, each of A, B, and C is φ or equal to {a}, {b}, and {c}. In this embodiment, when each element of the file item set D conforms to the context X, the file item set D is said to conform to the context X.
[0048]
Between the contexts defined as described above, there is a hierarchy due to the inclusion relationship of a set of file items that match the context. In the present embodiment, a set of all file items conforming to the context X is denoted by I (X). I (X) is a set of all file items having the attribute value specified by the context X. At this time, context X₁Is context X₂The above context means that the following equation is satisfied.
[0049]
(Equation 7)
I (X₁) ⊃I (X₂)
[0050]
Also, context X₁Is context X₂Is a lower context of the following expression.
[0051]
(Equation 8)
I (X₁) ⊂I (X₂)
[0052]
For example, the context generation unit 4 of the present embodiment may specify a context in which only the attribute value of “time” is specified (there are as many as the number of attribute value types of “time”), and a context in which only the attribute value of “type” is specified ( There are as many attribute types as "type"), contexts with attribute values only for "work" (existing as many as attribute type of "work"), and "time" and "type" A context that specifies two attribute values (there are as many as the number of combinations of attribute values of “time” and “type”), and a context that specifies two attribute values of “time” and “work” (“time” and “time” The number of combinations of attribute values of "work" exists), the context in which two attribute values of "type" and "work" are specified (the number of combinations of attribute values of "type" and "work" exist), All attribute values The context is the empty set, so that generated. However, for example, in the context generation unit 4 of the present embodiment, the context in which there is no matching file item is omitted. This is because if there is no file item that matches the context, the number of transactions to be mined by the rule extracting means 5 becomes 0, and rule extraction becomes impossible. Also, for example, in the context generation unit 4 of the present embodiment, when there are a plurality of contexts and the file item sets conforming to each context completely match, the rules mined by the rule extraction unit 5 are the same. Therefore, only one is selected from the plurality of contexts.
[0053]
Here, the association rule (Association @ Rule) in the present embodiment is a pattern in which “A occurred and B occurred when A occurred” is changed into A → B based on the co-occurrence frequency of A and B. It is a rule. Strictly speaking, the association rule is defined as follows. That is, file item a_iAnd when there is a set I of all file items, the transaction t_j⊂I is defined as the following equation.
[0054]
(Equation 9)
t_j= ｛A_j1, ..., a_jn｝
[0055]
When a set of all transactions is T, a plurality of file items a₁, ..., a_nIs the transaction t₁, ..., t_mIf they are included in $ T at the same time, if m is large enough, those file items a₁, ..., a_nCan be used at the same time. This tendency is represented by the association rule R, which is expressed by the following equation.
[0056]
(Equation 10)
R: a₁, ..., a_n-1→ a_n
[0057]
R is a₁, ..., a_n-1When a transaction contains a_nIs included in the same transaction.₁, ..., a_n-1If is used, a_nIs also used.
[0058]
Here, when the number of file items and the number of transactions are large, an enormous number of correlation rules can be extracted. This is because even a combination that occurs at most once can be a correlation rule. In order to use the association rules efficiently, it is necessary to remove unnecessary rules by using some scale to suppress the number of rules. Commonly used measures include support and confidence.
[0059]
The support is the probability that the correlation rule is satisfied when one arbitrary transaction is selected. When there is a file item set A (⊂I), the number of transactions including all the file items belonging to A is defined as N (A). Also, let N be the total number of transactions. When b is a file item and the correlation rule R: A → b, the support degree sup (R) of R is defined by the following equation.
[0060]
[Equation 11]
sup (R) = N (A {b}) / N
[0061]
On the other hand, the certainty factor is a probability that the rule R is established when the rule R is applicable. That is, when all the file items in the file item set A are used, b represents the conditional probability of being used. The confidence conf (R) of the rule R is defined by the following equation.
[0062]
(Equation 12)
conf (R) = N (A {b}) / N (A)
[0063]
By evaluating the association rule based on the support and the confidence, it becomes possible to select a rule that is highly available. Apriori is a typical algorithm that extracts association rules at high speed when given a minimum of support and certainty that should be satisfied by the association rules (references: R. Agrawal, and R. Srikant: Fast Algorithms for) Mining Association Rules, In Proc. Of 20th International Conference on Very Large Databases, 1994.). Aprili counts a file item set that may be an association rule as a rule candidate, calculates the support of each rule candidate, and leaves only the candidate that satisfies the minimum support as a rule. At this time, rule candidates are narrowed down by using the property of support degree sup (A∪B) ≧ sup (A∪B∪C). This property means that a rule of length k + 1 includes a rule of length k. Therefore, a rule candidate of length k + 1 is generated from a rule of length k that satisfies the minimum support, and the minimum support of each candidate is determined, so that the correlation rule can be efficiently extracted. . By changing the minimum support and minimum confidence given to Apriori, there is a trade-off between the risk of missing useful rules and the number of rules and computation time.
[0064]
The correlation rule R: A → b under the context X is a rule that predicts a file item that is required next when a file item used in the work approximated by the context X is given. Should be. Hereinafter, in the present embodiment, the association rule is also referred to as a prediction rule.
[0065]
Therefore, in the present embodiment, a set of all the prediction rules satisfying {R = A → b; A {I (X)}, that is, the condition part A conforming to the context X is defined as a prediction rule set R ( X). Further, by making the conditions more strict, a set of rules that not only the condition part A but also the conclusion part b conforms to the context X, in particular, a strict prediction rule set R_s(X). The rule set R (X) is a strict rule set R_sSince (X) is included, the predictable range is larger for R (X). For example, context X₁= <Φ, {plan}, φ> and rule R₁: If (H14, plan, other) → (H13, plan, other) exists, the rule R₁Is the rule set R (X₁) And strict rule set R_s(X₁). On the other hand, rule R₂: (H14, plan, other) → (H14, memo, other) is the rule set R (X₁) But strict rule set R_s(X₁) Is not included. Below, R (X) and R_s(X₁), R (X) is called an ordinary rule set to avoid confusion.
[0066]
When N (X) is extended to the number of transactions including any file item a @ I (X) that conforms to the context X, the support sup (R; X) of the rule R @ R (X) becomes It can be defined like an expression.
[0067]
(Equation 13)
sup (R; X) = N (R) / N (X)
[0068]
sup (R; X) is a probability that the rule R is satisfied when an operation that may correspond to the context X is being performed. In the case of context X = <φ, φ, φ>, I (X) is a set of all file items, and N (X) matches the total number of transactions N, so that sup (R; X) = sup (R).
[0069]
To enumerate the rule set R (X) of the context X by giving the minimum support and the minimum certainty, for example, the processing shown in FIG. 3 can be performed. In this case, first, a transaction that does not include a file item conforming to the context X is removed from mining targets (step 201). For example, when I (X) = {01,02,04}, the transaction t₁= {03,04,06} is not removed, but the transaction t₂= {03,06,07} is removed from mining targets. Next, a prediction rule is extracted using the Priori algorithm (step 202). Next, a rule in which the condition part of the prediction rule does not conform to the context X is removed from the extracted prediction rule set (step 203). For example, when I (X) = {01,02,04}, the rule R₁: 04 → 06 is not removed, but rule R₂: 03 → 06 is removed.
[0070]
However, the processing for obtaining the rule set R (X) is not limited to the example shown in FIG. For example, since the time required for mining is an exponential order of the transaction length and does not depend on the magnitude of the minimum certainty, if the transaction is sufficiently long, the processing shown in FIG. I can do it. In this case, a file item that does not conform to the context X is removed from each transaction (step 301). When the transaction length becomes 0 (Step 302; Yes), the transaction is removed (Step 303). For example, when I (X) = {01,02,04}, the transaction t₁= {03,04,06} becomes {04} and the transaction t₂= {03,06,07} is removed from mining targets. Next, a prediction rule is extracted using the Priori algorithm with the minimum certainty factor set to 0 (step 304). The extracted prediction rule set is R_s(X). From each of the extracted rules R: A → b, a rule candidate represented by the following formula is generated, and it is determined whether or not a predetermined minimum support and a minimum certainty are satisfied, and a rule candidate that satisfies the minimum support and the minimum certainty is determined. Is added to the prediction rule set (step 305).
[0071]
[Equation 14]
Rule candidate ｛R_c: A {b} → c | c is not included in I (X)}
[0072]
The processing shown in FIG. 4 is a specialization of the rule candidate generation step inside Apriori and added as post-processing, and sup (R; X) ≧ sup (R_cX), the same rule set as the process shown in FIG. 3 can be extracted.
[0073]
For example, the rule extracting means 5 of the present embodiment extracts a rule set R (X) based on the processing shown in FIG. The extracted rule set R (X) is stored in, for example, the rule database 9. As a result, effective rules based on the work histories of a plurality of users can be stored in the rule database 9, and the know-how of each user's work can be used as a shared resource. Further, for example, in the present embodiment, the GNU-licensed freeware program apriori.com which implements the Apriori algorithm is included in the prediction rule extraction processing using the Apriori algorithm (steps 202 and 304). exe is used. apriori. exe is http: // fuzzy. cs. uni-magdeburg. de / @ borgelt / software. available from html. Here, apriori. exe can be switched between a normal support definition and a unique support definition by an option. For example, in this embodiment, the normal support definition is used. Incidentally, for example, the processing shown in FIG._s(X) may be extracted.
[0074]
The work situation estimation means 6 selects one or a plurality of prediction file items from a group of file items and a group of contexts based on the file use situation of the target user recorded by the file use history management means 2. For example, in the present embodiment, the file usage status of the target user refers to the target user in the present or present and near past (for example, a predetermined period of several hours or several days in the past where it is estimated that the same work has been performed). Refers to specialized file usage history. The one or more prediction file items are selected based on, for example, the attribute values of the file appearing in the file usage status of the target user. The determination of the attribute value of the file is performed by, for example, the same processing as the processing described with reference to FIG.
[0075]
Here, how to select one or a plurality of prediction contexts, that is, which prediction rule set belongs to which context should be applied in predicting a file required by the user is a problem. If the prediction rule set is large, the range that can be predicted increases, and the information to be provided is not missed. However, unnecessary information for the user also increases, which hinders appropriate information presentation. On the other hand, if the prediction rule set is too small, there is a possibility that required information cannot be provided to the user.
[0076]
Assuming that a set of file items used by the user in the current or near past is A, it can be considered that the work approximated by the context in which A is suitable has been performed. In general, there can be more than one context to which the set A of file items fits. This is because if A matches a certain context, A also matches its upper context. However, as the size of the file item set I (X) with respect to the context X increases, information other than A is assumed, and information unnecessary for the user may increase.
[0077]
Therefore, the work situation estimating means 6 of the present embodiment selects an approximation that does not require as much information other than A as possible, that is, selects a minimum context including A. The selection of the minimum context including A includes a case where a single context is selected and a case where a plurality of contexts are selected.
[0078]
When selecting a single context, A⊂I (X_i) Context group X_i}, The size of the file item set, that is, the number of file items included in the file item set | I (X_k) | Context X that minimizes_kSelect This corresponds to the case where the number of operations corresponding to A is one.
[0079]
When selecting multiple contexts, A⊂I (X_i), The size of the combined file item set, that is, the number of file items included in the combined file item set | ∪_iI (X_i) Select the combination that minimizes |. This corresponds to a case where a plurality of tasks corresponding to each context are performed in parallel. In some cases, only one context is selected. Also, | ∪_iI (X_iIf there are a plurality of combinations of contexts in which | is minimum, for example, a combination having a high simultaneous co-occurrence probability between contexts is selected. Thereby, it is possible to select a combination of works that are often performed simultaneously.
[0080]
In general, if each context closely approximates the work, selecting a plurality of contexts can approximate the work currently performed by the user with higher accuracy, and thus the prediction accuracy is considered to be improved. However, the processing of the work status estimating means 6 is not necessarily limited to the above example. For example, the user may select an attribute value that is likely to be related, and the context may be selected based on the attribute value. good.
[0081]
When the prediction file item and the prediction context are selected by the work status estimating means 6, the information providing means 7 searches the rule database 9 for a correlation rule corresponding to the prediction context, and searches for the correlation rule. By applying a rule, a predicted file item correlated with the predicted file item can be specified. Then, the information providing unit 7 allows the user to easily access the file belonging to the specified predicted file item. The method of making the file accessible is not particularly limited. For example, the file may be transmitted to the user terminal 8 owned by the user, or an HTML file having a hyperlink to the file may be transmitted. It may be transmitted to the user terminal 8 owned by the user. Here, the information providing means 7 needs to grasp the location of the file. For this purpose, for example, the file itemization means 3 may associate and manage the location of the file with the corresponding file item. In addition, a well-known file search technology for an intranet, a group of shared files, a document database, or the like may be used.
[0082]
Here, since there can be a plurality of association rules corresponding to the context, a plurality of prediction file items can be specified. If the number of files predicted to be required by the user becomes enormous, the files to be presented may be narrowed down using the user's affiliation or interest. In addition, the predicted file items may be ranked in descending order of the probability (support or confidence) that the applied correlation rule is established, and only a predetermined number of predicted file items may be selected from the top. . Furthermore, the file group may be presented in an easily organized manner. For example, file items may be arranged hierarchically using attribute values, files similar to the currently used file group may be combined, or ranking may be performed using prediction accuracy or the like. By organizing the information in this way, the user can easily select necessary information. In addition, a method of broadly classifying information and narrowing it down according to a user's selection, a method of presenting representative information and then gradually expanding the range of information to be provided, and the like are appropriately adopted. May be.
[0083]
【Example】
An experiment was performed to verify the effectiveness of the above-described required file prediction method and system. In the experiment, the file access history management unit 2 recorded the file access of one user over 49 days. The attribute values given in advance for the file itemization of the file are determined as follows in consideration of the work that has occurred during the above period. That is, the attribute value of “time” is {H12, H13}, the attribute value of “type” is {memo, plan, budget, hearing, program, data}, and the attribute value of “work” is {theme 1, EMS, etc. ｝ The main tasks that the user performed during the experiment were updating the homepage of the environmental management system (EMS), creating a program, creating materials for research interviews, and selecting "Theme 1 (Research theme:" Elements of next-generation intelligent information systems ". Next, a breakdown of the budget for the next fiscal year and a research memo were prepared. During the experiment, a memo and a program for “Theme 1” were often created. In this experiment, all the research themes before the previous year were classified as "Other".
[0084]
The number of files used by the user during the experiment period was 36, and the number of file items when it was converted into file items was 17. Table 1 shows the file items that have appeared in the file usage history. In Table 1, each row indicates a file item, the first column indicates a serial number identifying the file item, the second column indicates an attribute value of the file item, and the third column indicates the number of files integrated into the file item. Represent.
[0085]
[Table 1]

[0086]
Table 2 shows a part of the file use history that has been filed by the file itemization means 3. Each row in Table 2 represents a day's work. The second column excludes file item columns in which file items are arranged in the order of use, and the third column excludes duplicate file items for extracting prediction rules. , And a file item set arranged. At the time of mining by the rule extracting means 5, a file item set in the third column is used.
[0087]
[Table 2]

[0088]
Although the total number of contexts based on the combination of attribute values is 54, the number of contexts actually generated is 23. This is because the context where there is no matching file item and the context where the matching file item matches are omitted. For example, considering the context <{H12}, {HTML}, φ>, in this experiment, there is no file item having the attribute values “H12” and “HTML” at the same time. Omitted. Further, for example, in this experiment, the context <{H13}, {budget}, φ> matches the file item conforming to the context <φ, {budget}, φ> ({14, 16}). The former context has been omitted.
[0089]
Assuming that the daily usage record is one transaction, the rule extraction means 5 extracts a prediction rule set on the condition that the minimum support is 10% and the minimum certainty is 50%. In this experiment, a normal prediction rule set R (X) and a strict prediction rule set R_sBoth (X) were extracted. Table 3 shows the results.
[0090]
[Table 3]

[0091]
Each row in Table 3 corresponds to context X, and each column has the following meaning. That is, the first column represents a set I (X) of file items conforming to the context X. The second column represents the number N (X) of transactions that include file items that match context X. The third column indicates the number of rules, and the size of the exact prediction rule set | R_s(X) | and the size | R (X) | of a normal prediction rule set. The fourth column indicates the number of rules that cannot be extracted in the upper context and are first extracted by mining in that context. For example, a new rule in X = <φ, {plan}, φ> indicates a rule not mined in <φ, φ, φ> among rules mined in X. In addition, a / b / c indicates that a is the number of new rules for <φ, φ, φ>, b is the number of new rules for the first higher context having a non-φ attribute, and c is the second Indicates the number of new rules for the upper context having the non-φ attribute. For example, under the application of a strict rule set of context <{H12}, {plan}, φ>, two new rules for <φ, φ, φ> are added to <{H12}, φ, φ>. On the other hand, one new rule indicates that 0 new rules have been extracted for <φ, {plan}, φ>. Table 4 shows examples of the extracted rules. In Table 4, for the sake of clarity, attribute values common to all file items in the rule are grouped in the second column.
[0092]
[Table 4]

[0093]
Context X₀The rule set of = <φ, φ, φ> is a rule set in a case where Apriori is applied without introducing a context. So, context X₀And the rule set in each context. As can be seen from Table 3, since the number of rules in each context is reduced, it can be seen that only rules specialized for the context are extracted. In addition, since new rules are extracted, it can be understood that, by introducing a context, a prediction rule for a work that rarely occurs is extracted in each context. Context X₀And the individual contexts, in most cases, the predictable range of the rule set is context X₀Is larger. However, a comparison of the predictable range of the entire prediction rule set when the context is introduced indicates that the case where the context is introduced is larger. Table 5 shows this.
[0094]
[Table 5]

[0095]
Each row in Table 5 shows a case where a rule set without introducing a context is used (no context) and a case where a context is introduced and the entire normal prediction rule set is used (a context (normal )) And the case where a context is introduced and the entire strict prediction rule set is used (context (strict)). The predictable file item in the second column indicates a file item that can be predicted by any of the prediction rules. The number of unpredictable file items in the third column indicates the number of file items that cannot be predicted in that case. Here, the total number of file items is 17. As can be seen from Table 5, in the case of "no context", the number of unpredictable file items is as large as 10. From this, it can be seen that, when the context is not introduced, the prediction in the rare work is not successful or is unlikely to succeed. On the other hand, in the case where the context is introduced, the number of unpredictable file items is as small as four or three, so that an appropriate prediction may be able to be performed even in rare cases. Therefore, by introducing the context, it can be confirmed that an appropriate prediction can be performed even in a rare operation.
[0096]
Next, the effects on the prediction performance are evaluated when a single context is selected, when multiple contexts are selected, and when a normal prediction rule set and a strict prediction rule set are used.
[0097]
The file item sequence A used for prediction corresponding to the file group used by the user is extracted from the file item sequence (second column in Table 2) corresponding to the transaction. For example, the fifth to seventh file item column “05/03/04” is extracted from the file item column “01/02/03/04/05/03/04/04” in the first row of Table 2. If the selection of A to be extracted is simply determined by a uniform random number, there is a high possibility that A consisting only of frequently used file items will be extracted, and the prediction performance under rare operations will be evaluated. Becomes difficult. Therefore, for each of the 17 file items, one A that always includes each file item is randomly extracted. For example, for the file item 12, one of the file item strings including “12” (“08 @ 12 @ 01”, “01 @ 12 @ 01”, etc.) is randomly extracted. The length of the extracted A was set to 2 or 3, and determined by random numbers. However, when there is an overlap in the elements (for example, “0112 01”), the subsequent overlapping part from A is omitted (for example, “01 12”).
[0098]
The file items to be predicted were a file item set obtained by removing the file item sequence A used for prediction from the original file item sequence. For example, when there is a column “01 02 03 04 03 01”, the file item to be predicted when A = {03,04} is given is {01,02,03}. Note that duplicate file items have been omitted.
[0099]
Then, (1) a rule set in which Apriori is simply applied without using a context (no context), and (2) a strict rule set when a single minimum context including A is selected (1 context (strict) ), (3) Normal rule set when a single minimum context including A is selected (1 context (normal)), (4) Strict rules when a minimum combination of two contexts including A is selected Using five rule sets of a set (2 contexts (strict)) and (5) a normal rule set when selecting the minimum combination of two contexts including A (2 contexts (normal)), 17 A prediction was made for file item sequence A. For prediction, a rule set with a minimum support of 10% and a minimum confidence of 50% was used.
[0100]
For the evaluation of the prediction result, the recall rate and accuracy used for evaluating the search result in the information search and the F value which is a comprehensive evaluation standard were used. The recall rate evaluates the number of missed predictions, and the accuracy evaluates the number of unnecessary file items in the prediction results. Assuming that N is the number of file items that are actually predicted among the file items to be predicted, the recall rate, accuracy, and F value can be defined by the following equations.
[0101]
[Equation 15]
Recall = N / number of expected file items
[0102]
(Equation 16)
Accuracy = N / number of expected file items
[0103]
[Equation 17]
F value = 2 · Recall / Accuracy / (Recall + Accuracy)
[0104]
Table 6 shows the evaluation of the prediction using each rule set.
[0105]
[Table 6]

[0106]
At this time, 〇, △, and × indicate the best, next best, and worst values, respectively. As can be seen from Table 6, there is a trade-off between recall and accuracy. In addition, it can be seen that the strict rule set is more accurate than the normal rule set and is inferior in recall. Comparing with the F value which is a comprehensive evaluation, better results are obtained with 2 contexts (normal) and 1 context (strict) than without context.
[0107]
The prediction result of "2 contexts (normal)" having a high recall includes most of the file items required for the work. Therefore, in the present embodiment, an attempt is made to improve the prediction accuracy of the information to be provided by ranking the predicted file items according to the certainty factor of the rules.
[0108]
Assume that the certainty factor of the file item is the largest among the certainty factors of the rules that predicted the file item. For example, a rule R that predicts file item b and has a certainty factor of 50% and 80%, respectively₁And R₂Suppose there was. At this time, R₁, R₂If both are used, the certainty factor of b becomes 80%. Table 7 shows that the predicted file items are arranged in order of certainty when two contexts (normal) are used.
[0109]
[Table 7]

[0110]
Each row of Table 7 shows a file item to be predicted and a predicted file item for 17 given file item columns, respectively. The first column in Table 7 shows a file item sequence given as a file item being used in the work performed by the user. The second column indicates a file item to be predicted from a given file item column. The set of file items that appear in the first and second columns is the original transaction. The third column shows the actually predicted file item and its certainty (X / YY.Y indicates that X represents the file item and YY.Y indicates the certainty of the file item X). . Predicted file items are listed in descending order of confidence, and those in bold are those that should have been predicted, and those that are in italics should not be predicted Represents a file item included in the file item sequence given for prediction.
[0111]
As can be seen from Table 7, there is a tendency that among the predicted file items, a file item having a high degree of certainty is a file item to be predicted originally. Therefore, when the accuracy when the top n items are selected from the predicted file items is calculated, the accuracy is 0.72 and 0.63 when n = 2 and 3, respectively, and the accuracy when the certainty factor is not considered is 0.52. Was obtained. In addition, the accuracy when the context was not used was also higher than 0.54.
[0112]
As one of the causes of the deterioration of the prediction accuracy, the file item appearing in the file item string given for the prediction (for example, the first row, the first column “07 08 10” in Table 7) should be predicted. Appearing in a file item (italic characters in Table 7; for example, the third row in the eighth row @ 07 / 100.0, 08 / 100.0) can be mentioned. This is because another file item in the given file item sequence is predicted from a part of the given file item sequence. For example, the eighth row of Table 7 occurs because the simultaneous use frequency of 07 and 08 is high, and the rules 07 → 08 and 08 → 07 are applied. These file items represent a file group of the same type as the currently used file. If these files are classified as files similar to the file being used and are made available to the user as needed, they are often useful rather than hindering information provision. Therefore, the essentially unnecessary file items that hinder the provision of appropriate information to the user are considered to be the file items that are neither bold nor italic among the predicted file items in Table 7.
[0113]
FIG. 5 shows the number of unnecessary file items included in the predicted file items and the number of unnecessary file items (missing file items) that should be included in the prediction result when the highest confidence n is selected from the predicted file items. Show. The following can be seen from FIG. First, the prediction accuracy is improved by reducing n, but the number of misses increases. For example, when n = 2, only 0.18 unnecessary file items are included on average, but the number of missing file items is 1.18 on average. Second, the number of unnecessary file items and the number of missing file items are balanced between n = 3 and n = 4. From the above results, when providing information to the user, it is possible to control the quality and amount of information according to the work situation by gradually expanding or reducing the provision range starting from n = 3 or 4. It is thought that effective information provision becomes possible.
[0114]
As described above, according to the results of the present embodiment, (a) the rule extraction method according to the present invention increases the number of predictable file items as compared with a normal data mining method without a context, and (b) the present invention. That the prediction performance is improved by the context selection method according to (1), (c) that the prediction accuracy is improved by selecting n file items having high confidence among the prediction results, and (d) that the above n is increased or decreased. It was confirmed that the trade-off between the quality and quantity of the information provided in could be controlled. From the above, it was confirmed that the present invention is extremely effective in predicting and providing information required for work.
[0115]
As described above, according to the present invention, the use of a plurality of files is replaced by the use of a single file item, so that the apparent use frequency of each file item in the file use history can be increased, Rule extraction with generalized generality becomes possible. Further, since file items obtained by integrating a group of files having common properties are targeted for data mining, the time required for mining can be reduced as compared with the case where individual files are targeted for data mining. Furthermore, since a newly created file can be reduced to an existing file item, it is possible to prevent a situation in which a new file is excluded from prediction targets. In addition, by categorizing the files required for the work by corresponding file items and providing the information to the user, the user can easily select the information.
[0116]
In addition, the type of file used in an arbitrary operation can be approximated by an appropriately selected context or a combination of contexts. In that sense, it can be said that the context reflects the work situation of the user. The context selection estimates the range of file items needed for the work the user is currently performing. By limiting the prediction range, it is possible to improve the performance of the prediction of the next required file item.
[0117]
Further, by introducing the context, the transaction group is divided based on the attribute value of the file item. This prevents a harmful correlation rule that deteriorates the prediction accuracy from being extracted due to the parallelism of the work. For example, meeting material creation work P₁Are used, files A, B, C, and D are used, and a new project planning work P₂Is performed, files X and Y are used. At this time, it is assumed that each file is used in a time sequence of A → B → X → Y → C → D. By introducing the context, for example, X, Y having an attribute value of “plan” and file items A, B, C, D having an attribute value of “meeting material” are regarded as being used in separate operations. If the mining is performed with the scope limited to the transaction including the file item corresponding to each file, the operation P₁And work P₂Can be individually extracted, and the possibility of satisfying the minimum support is increased.
[0118]
Furthermore, in a work that rarely occurs, the frequency of occurrence of a file usage pattern specific to the work is reduced. Therefore, when a normal mining method is used, a prediction rule corresponding to the rare work is screened. (For example, if there is a task to create a report that is performed quarterly (about 90 days), the rule specific to that task may only be able to obtain almost 1/90 support. However, the support is too low and may be ignored), but by extracting the prediction rules for each context, the frequency for the file item set specified by the context increases. That is, when the entire file item is considered, the support level is too low and is ignored, but when the context is limited, a rule that is sufficiently effective can be extracted. Therefore, it is possible to extract a rule specific to the work that rarely occurs. Further, in that context, it is possible to save the trouble of extracting unnecessary rules.
[0119]
The above embodiment is an example of a preferred embodiment of the present invention, but the present invention is not limited to this, and various modifications can be made without departing from the gist of the present invention. For example, when selecting a prediction file item or a prediction context based on a target user's file usage status, or when providing prediction file information to a user, each user's profile (eg, department to which the user belongs, , Etc.), and referring to a use file pattern or a prediction rule for another user having a similar profile, information that is estimated to be unnecessary may be omitted. The method of the present invention is not necessarily limited to a system (device). In this case, the file is not necessarily limited to electronic data, and may be paper information in some cases.
[0120]
【The invention's effect】
As is apparent from the above description, according to the method for predicting a required file according to claim 1, the system for predicting a required file according to claim 7, and the program for predicting a required file according to claim 8, The created file or a file created by another user in the past can be effectively used, and the know-how of the work can be shared as a resource.
[0121]
Further, according to the present invention, since the use of a plurality of files is replaced with the use of a single file item, the apparent use frequency of each file item in the file use history can be increased, and the abstraction can be performed. Rule extraction with generality becomes possible. In addition, by setting file items as data mining targets, the time required for mining can be reduced as compared with the case where individual files are targeted for data mining. Furthermore, since a newly created file can be reduced to an existing file item, it is possible to prevent a situation in which the new file is out of the prediction target. Further, files necessary for the work can be categorized by corresponding file items and information can be provided to the user, and the user can easily select information.
[0122]
Furthermore, by introducing a context, it is possible to extract a prediction rule for each work performed by the user in parallel. Also, considering the file item as a whole, the support level is too low and is ignored, but when the context is limited, rules that are sufficiently effective can be extracted. Therefore, it is possible to extract a rule specific to the work that rarely occurs.
[0123]
Furthermore, according to the required file prediction method according to the second and third aspects, when a file item used in an operation approximated by a context is given, a rule for predicting a next required file item is provided. Can be extracted.
[0124]
Further, according to the method for predicting required files according to

claims

4 and 5, information other than the files recently used by the target user need not be assumed as much as possible, and provision of unnecessary information for the target user is reduced. be able to.
[0125]
Furthermore, according to the method for predicting required files according to claim 6, it is possible to reduce the provision of unnecessary information for the target user, and to adjust the range of the top n predicted file items to be selected. Thus, it is possible to control the quality and quantity of information corresponding to the work situation of the target user.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an embodiment of a necessary file prediction system according to the present invention, and showing a schematic configuration thereof.
FIG. 2 is a schematic flowchart illustrating an example of a process of determining an attribute value of a file appearing in a file use history.
FIG. 3 is a schematic flowchart illustrating an example of a process of extracting a correlation rule by a rule extracting unit.
FIG. 4 is a schematic flowchart illustrating another example of a process of extracting a correlation rule by a rule extracting unit.
FIG. 5 is a diagram illustrating a trade-off relationship between prediction accuracy and the number of missed files when ranking is performed in descending order of the probability (confidence) of an applied association rule for a prediction file item and the top n items are selected.
[Explanation of symbols]
1) Necessary file prediction system
2. File usage history management means
3) File itemization means
4 Context generation means
5 Rule extraction means
6. Work status estimation means
7) Information provision means

Claims

A plurality of attributes representing the characteristics of a file and attribute values that can be taken by each of the attributes are defined in advance, and for files appearing in a file usage history recorded in the past, a group of files having the same attribute value are collected as one file item. The file usage history is converted to a file item, a plurality of contexts that are combinations of the attribute values are generated, and based on the file usage history converted to the file item, and further associated with each of the contexts, A correlation rule established between items is extracted, and one or more prediction file items and one or more prediction files are selected from the file item group and the context group based on the file usage status of the target user. A context for prediction, and Applying the correlation rule, specifying a predicted file item correlated with the predicted file item, and predicting a file belonging to the predicted file item as a file required by the target user. Forecasting method.

2. The method according to claim 1, wherein the association rule is extracted such that file items belonging to a condition part belong to the corresponding context.

The method according to claim 1, wherein the association rule is extracted such that both the file item belonging to the condition part and the file item belonging to the conclusion part belong to the corresponding context.

One or more of the prediction file items are specified based on the attribute value of the file appearing in the file usage status, and the context in which the prediction file item belongs and the number of the file items belonging to the others is the least is determined. The method for predicting a required file according to any one of claims 1 to 3, wherein one is selected.

One or a plurality of the prediction file items are specified based on the attribute values of the file appearing in the file usage status, and the number of the file items to which the prediction file item belongs and to which the other file items belong to is the smallest. 4. The method according to claim 1, wherein a combination is selected.

The prediction file items are ranked in descending order of the probability that the applied correlation rule is established, and a predetermined number of the prediction file items are selected from a higher rank. How to predict required files described in

A file usage history management unit that records a file usage history, and a file having the same attribute value for a file appearing in the file usage history, based on a plurality of attributes representing characteristics of the file and attribute values that each of the attributes can take. File itemizing means for grouping the groups as one file item and converting the file use history into a file item, context generating means for generating a plurality of contexts which are combinations of the attribute values, and use of the file itemized file A rule extraction unit for extracting a correlation rule established between the file items based on a history and corresponding to each of the contexts; and a file usage status of the target user recorded by the file usage history management unit. The group of file items based on And one or more prediction file items and one or more prediction contexts selected from a group of the contexts, and a work situation estimating unit, and applying the correlation rule corresponding to the prediction contexts, An information providing unit for identifying a predicted file item correlated with the predicted file item and allowing the target user to access a file belonging to the predicted file item.

A file usage history management unit that records a file usage history, and a file having the same attribute value for a file appearing in the file usage history, based on a plurality of attributes representing characteristics of the file and attribute values that each of the attributes can take. File itemizing means for grouping the groups as one file item and converting the file use history into a file item, context generating means for generating a plurality of contexts which are combinations of the attribute values, and use of the file itemized file A rule extraction unit for extracting a correlation rule established between the file items based on a history and corresponding to each of the contexts; and a file usage status of the target user recorded by the file usage history management unit. The group of file items based on And one or more prediction file items and one or more prediction contexts selected from a group of the contexts, and a work situation estimating unit, and applying the correlation rule corresponding to the prediction contexts, Predicting required files by causing a computer to function as information providing means for identifying a predicted file item correlated with the predicted file item and enabling the target user to access a file belonging to the predicted file item. Program.