JP7434867B2

JP7434867B2 - Methods, devices and storage media for extracting information from web pages

Info

Publication number: JP7434867B2
Application number: JP2019223095A
Authority: JP
Inventors: ジョン・ジョォングアン; 遥孟; 俊孫
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-12-18
Filing date: 2019-12-10
Publication date: 2024-02-21
Anticipated expiration: 2039-12-10
Also published as: JP2020098596A; CN111339457B; CN111339457A

Description

本開示は、自然言語処理に関し、具体的には、複数のウェブページに基づく情報抽出に関する。 TECHNICAL FIELD This disclosure relates to natural language processing, and specifically to information extraction based on multiple web pages.

インターネットから情報を収集、抽出することは、知識ベースを構築する重要な手段である。例えば、電子商取引会社のウェブページから製品の情報を抽出し、製品の知識ベースを構築することができる。従来の方法は、主に２種類の方法に分類される。 Collecting and extracting information from the Internet is an important means of building a knowledge base. For example, product information can be extracted from an e-commerce company's web pages to build a product knowledge base. Conventional methods are mainly classified into two types.

１種類目の方法は、類似の構造を有するページ（例えば、電子商取引会社のウェブサイトの製品リストのページでは、各ページの構造は類似する）の場合は、手動でテンプレートを作成し、或いは教師なし、教師ありの方法によりウェブページに含まれる製品情報の構造テンプレートを学習して、これらの学習により得られた構造テンプレートを用いて他の類似のウェブページを解析してもよい。図１Ａに示すように、携帯電話のページの構造情報を学習することで、図書及び靴の製品情報を抽出してもよい。 The first method is to manually create a template or have a teacher create a template for pages with a similar structure (for example, the product list page of an e-commerce company's website has a similar structure). It is also possible to learn structural templates of product information included in web pages using a non-supervised method, and use the structural templates obtained by these learning methods to analyze other similar web pages. As shown in FIG. 1A, product information on books and shoes may be extracted by learning the structure information of the mobile phone page.

２種類目の方法は、単一の構造を有する（非類似の）ページの場合は、図１Ｂに示すように、ウェブページの構造を動的に解析し、キーワードのリストにより関連情報のウェブページにおける位置を特定し、値を抽出してもよい。 In the case of pages with a single structure (dissimilar), the second method dynamically analyzes the structure of the web page and searches the web page for related information using a list of keywords, as shown in Figure 1B. You may also specify the position in and extract the value.

以下は、本発明の態様を基本的に理解させるために、本発明の簡単な概要を説明する。なお、この簡単な概要は、本発明を網羅的な概要ではなく、本発明のポイント又は重要な部分を意図的に特定するものではなく、本発明の範囲を意図的に限定するものではなく、後述するより詳細的な説明の前文として、単なる概念を簡単な形で説明することを目的とする。 The following presents a brief summary of the invention in order to provide a basic understanding of aspects of the invention. This brief summary is not an exhaustive summary of the present invention, does not intentionally specify the main points or important parts of the present invention, and is not intended to intentionally limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

本発明は、ウェブページから情報を抽出する方法、装置及び記憶媒体を提供する。 The present invention provides a method, apparatus, and storage medium for extracting information from web pages.

本発明の１つの態様では、ウェブページから情報を抽出する方法であって、前記ウェブページ及びその全ての拡張ウェブページにおける前記ウェブページのドメイン名を含む各ページについて木構造を生成するステップと、前記木構造におけるナビゲーションバーノードを決定するステップと、前記ナビゲーションバーノードによりカバーされる、１つ又は複数のキーワードにマッチする葉ノードを決定するステップと、マッチする葉ノードに対応するページにおける情報を抽出するステップと、を含む、方法を提供する。 In one aspect of the invention, a method for extracting information from a web page includes the steps of: generating a tree structure for each page containing the domain name of the web page in the web page and all extended web pages thereof; determining a navigation bar node in the tree structure; determining a leaf node covered by the navigation bar node that matches one or more keywords; and determining information on a page corresponding to the matching leaf node. A method is provided, comprising: extracting.

本発明のもう１つの態様では、ウェブページから情報を抽出する装置であって、前記ウェブページ及びその全ての拡張ウェブページにおける前記ウェブページのドメイン名を含む各ページについて木構造を生成する木構造生成手段と、前記木構造におけるナビゲーションバーノードを決定するナビゲーションバーノード決定手段と、前記ナビゲーションバーノードによりカバーされる、１つ又は複数のキーワードにマッチする葉ノードを決定するマッチノード決定手段と、マッチする葉ノードに対応するページにおける情報を抽出する情報抽出手段と、を含む、装置を提供する。 Another aspect of the present invention provides an apparatus for extracting information from a web page, the apparatus generating a tree structure for each page containing a domain name of the web page in the web page and all extended web pages thereof. a generation means; a navigation bar node determination means for determining a navigation bar node in the tree structure; a match node determination means for determining a leaf node covered by the navigation bar node that matches one or more keywords; and an information extraction means for extracting information in a page corresponding to a matching leaf node.

本発明の他の態様では、対応するコンピュータプログラムコード、コンピュータ読み取り可能な記憶媒体及びコンピュータプログラムプロダクトをさらに提供する。 Other aspects of the invention further provide corresponding computer program code, computer readable storage media, and computer program products.

本発明に係るウェブページから情報を抽出する方法及び装置によれば、ホームページのＵＲＬ（ユニフォームリソースロケータ）に基づいて、同一のドメイン名において分布している複数のウェブページから必要な情報を抽出することができる。 According to the method and apparatus for extracting information from web pages according to the present invention, necessary information is extracted from a plurality of web pages distributed under the same domain name based on the URL (uniform resource locator) of the home page. be able to.

以下は図面を参照しながら本発明の好ましい実施形態を詳細に説明することにより、本発明の上記及び他の利点はより明確になる。 The above and other advantages of the present invention will become clearer through the following detailed description of preferred embodiments of the present invention with reference to the drawings.

本開示の上記及び他の利点及び特徴を理解させるために、以下は図面を参照しながら本開示の具体的な実施形態を詳細に説明する。図面及び以下の詳細な説明は本明細書に含まれ、本明細書の一部を構成する。同一の機能及び構造を有する素子は同一の符号で示される。なお、これらの図面は単なる本開示の典型的な例を説明するためのものであり、本開示の範囲を限定するものではない。
類似の構造を有するウェブページの例を示す図である。単一の構造を有するウェブページの情報抽出の例を示す図である。複数のページの情報抽出の例を示す図である。本発明の方法の全体的な流れの例を示す図である。本発明の実施形態に係るウェブページから情報を抽出する方法の流れを示すフローチャートである。ナビゲーションバーノードに対応するＨＴＭＬ構造及びＤｏｍ木構造の例を示す図である。情報抽出の例を示す図である。本発明の実施形態に係るウェブページから情報を抽出する装置の例を示すブロック図である。本発明の実施形態に係る方法及び／又は装置を実現可能な汎用パーソナルコンピュータの例示的な構成を示すブロック図である。 In order to provide an understanding of these and other advantages and features of the present disclosure, specific embodiments of the present disclosure will now be described in detail with reference to the drawings. The drawings and the following detailed description are included in and constitute a part of this specification. Elements having the same function and structure are designated by the same reference numerals. Note that these drawings are merely for illustrating typical examples of the present disclosure, and do not limit the scope of the present disclosure.
FIG. 2 is a diagram illustrating an example of a web page with a similar structure. FIG. 3 is a diagram illustrating an example of information extraction of a web page having a single structure. FIG. 3 is a diagram illustrating an example of information extraction of multiple pages. FIG. 2 is a diagram showing an example of the overall flow of the method of the present invention. 1 is a flowchart illustrating a method for extracting information from a web page according to an embodiment of the present invention. It is a diagram showing an example of an HTML structure and a Dom tree structure corresponding to a navigation bar node. FIG. 3 is a diagram showing an example of information extraction. 1 is a block diagram illustrating an example of an apparatus for extracting information from a web page according to an embodiment of the present invention. FIG. 1 is a block diagram illustrating an exemplary configuration of a general-purpose personal computer capable of implementing a method and/or apparatus according to an embodiment of the present invention. FIG.

以下、図面を参照しながら本発明の例示的な実施例を詳細に説明する。説明の便宜上、明細書には実際の実施形態の全ての特徴が示されていない。なお、実際に実施する際に、開発者の具体的な目標を実現するために、特定の実施形態を変更してもよい、例えばシステム及び業務に関する制限条件に応じて実施形態を変更してもよい。また、開発作業が非常に複雑であり、且つ時間がかかるが、本公開の当業者にとって、この開発作業は単なる例の作業である。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings. For convenience of explanation, not all features of an actual implementation are described in the specification. Note that during actual implementation, the specific embodiment may be changed in order to realize the specific goals of the developer, for example, the embodiment may be changed depending on the system and business constraints. good. In addition, the development work is very complex and time-consuming, but for those skilled in the art to whom this disclosure is concerned, this development work is just an example work.

なお、本発明を明確にするために、図面には本発明に密に関連する装置の構成要件及び／又は処理のステップのみが示され、本発明と関係のない細部が省略されている。 It should be noted that, in order to make the present invention clear, only the structural features of the apparatus and/or process steps closely related to the present invention are shown in the drawings, and details unrelated to the present invention are omitted.

上述したように、インターネットから情報を収集、抽出することは、知識ベースを構築する重要な手段である。図１Ａ及び図１Ｂに示す従来技術は、一部の要求を満たすことができるが、依然として限界がある。 As mentioned above, collecting and extracting information from the Internet is an important means of building a knowledge base. Although the prior art shown in FIGS. 1A and 1B can meet some requirements, it still has limitations.

図２Ａに示すように、ｈｔｔｐ：／／ｏｗｔｗａｒｅ.ｃｏｍは会社のホームページのＵＲＬであり、製品、協力会社、連絡先などの会社の情報は異なるページに分布し、３つのページの主要情報の所在する部分も類似の構造を有しない。 As shown in Figure 2A, http://owtware.com is the URL of the company's homepage, and the company's information such as products, partner companies, and contacts are distributed on different pages, and the main information location of the three pages is The parts that do not have a similar structure.

ホームページのＵＲＬのみが既知である場合、従来方法は、このような複数のページに分布している情報を抽出することができない。一方、通常ホームページのＵＲＬは容易に入手できる。このため、ホームページのＵＲＬ情報を拡張して他の情報を抽出する方法は、依然として解決すべき問題である。 If only the URL of the home page is known, conventional methods cannot extract such information distributed across multiple pages. On the other hand, the URL of a home page is usually easily obtainable. Therefore, a method for extracting other information by extending the URL information of a home page remains a problem to be solved.

従来技術に存在する問題を解決するために、本発明は、ホームページＵＲＬのみが既知である場合、（１）関連情報を含む他のページを自動的に拡張し、（２）各関連ページから主要情報を含む位置を取得し、（３）異なる属性タイプを有するページについて個別の情報抽出を行うことができる、複数のページに基づく情報抽出方法を提供する。 To solve the problems existing in the prior art, the present invention (1) automatically expands other pages containing related information when only the home page URL is known; and (2) extracts the main page from each related page. A multi-page based information extraction method is provided that can obtain locations containing information and (3) perform separate information extraction for pages with different attribute types.

図２Ｂは本発明の方法の全体的な流れの例を示す図である。図２Ｂに示すように、本発明に係る方法は、主に以下の３つの部分を含む。 FIG. 2B is a diagram showing an example of the overall flow of the method of the present invention. As shown in FIG. 2B, the method according to the present invention mainly includes the following three parts.

（１）ホームページを拡張することで複数のページの集合を取得する。 (1) Obtain a set of multiple pages by expanding the homepage.

（２）統計的方法を用いてウェブページの集合に対して統計的な分類を行い、ナビゲーションバーノード（ｎａｖｉｇａｔｉｏｎｂａｒｎｏｄｅ）を取得し、そして、キーワード辞書を用いてナビゲーションバーノードに含まれる葉ノードのテキストのマッチングを行い、マッチするノード情報に基づいて抽出すべきページを取得する。 (2) Perform statistical classification on a set of web pages using a statistical method to obtain navigation bar nodes, and then use a keyword dictionary to classify leaf nodes included in the navigation bar nodes. , and retrieve the page to be extracted based on the matching node information.

（３）抽出すべきページの情報タイプに応じて、異なる解析器を用いて抽出を行う。 (3) Extraction is performed using different analyzers depending on the information type of the page to be extracted.

以下は、図３、図４Ａ及び図４Ｂを参照しながら、本発明の実施形態に係るウェブページから情報を抽出する方法を詳細に説明する。 Hereinafter, a method for extracting information from a web page according to an embodiment of the present invention will be described in detail with reference to FIGS. 3, 4A, and 4B.

図３は本発明の実施形態に係るウェブページから情報を抽出する方法の流れを示すフローチャートである。 FIG. 3 is a flowchart illustrating a method for extracting information from a web page according to an embodiment of the present invention.

まず、ステップ３０１において、ウェブページ及びその全ての拡張ウェブページにおける該ウェブページのドメイン名を含む各ページについて木構造を生成する。具体的には、本実施形態では、図２Ａに示すＵＲＬを一例にすると、会社ホームページＵＲＬはｕ_ｒｏｏｔ＝ｈｔｔｐ：／／ｗｗｗ．ｏｗｔｗａｒｅ．ｃｏｍ／であり、抽出すべき情報は該会社の他の属性、例えば製品、連絡先などである。 First, in step 301, a tree structure is generated for each page including the domain name of the web page in the web page and all its extended web pages. Specifically, in this embodiment, taking the URL shown in FIG. 2A as an example, the company homepage URL is u _root = http://www. outsideware. com/, and the information to be extracted is other attributes of the company, such as products, contacts, etc.

まず、クローラー（ｃｒａｗｌｅｒ）を用いてｕ_ｒｏｏｔに対応するＨＴＭＬページｐ_ｒｏｏｔをクロールし、ページを解析してそれに含まれる全てのＵＲＬの集合ｕ＝［ｕ_０，ｕ_１，ｕ_２，…，ｕ_ｎ］を取得する。ページに含まれるＵＲＬが該会社に関連する場合があり、関連しない場合もあり、例えば広告や外部リンクなどの場合もあると考慮すると、特定のルールに従って一部のＵＲＬの集合ｕ’＝［ｕ’_０，ｕ’_１，ｕ’_２，…，ｕ’_ｎ］を選択し、ここで、ｕ’_ｉにはｄｏｍａｉｎ（ｕ_ｒｏｏｔ）が含まれ、ｄｏｍａｉｎ（ＵＲＬ）はＵＲＬトップレベルドメイン名を抽出する操作であり、例えばｄｏｍａｉｎ（ｕ_ｒｏｏｔ）＝ｗｗｗ．ｏｗｔｗａｒｅ．ｃｏｍ。このように、同一のドメイン名を有する全てのＵＲＬ、例えばｈｔｔｐ：／／ｗｗｗ.ｏｗｔｗａｒｅ.ｃｏｍ／ｉｎｄｅｘ.ｐｈｐ／ｚｈ／ｐｒｏｄｕｃｔｓ／を保持することができる。 First, use a crawler to crawl the HTML page _root corresponding to u _root , analyze the page, and set all URLs contained therein u = [u ₀ , u ₁ , u ₂ ,..., u _n ]. Considering that the URLs contained in a page may or may not be related to the company, such as advertisements or external links, a set of some URLs u' = [u ' ₀ , u' ₁ , u' ₂ , ..., u' _n ], where u' _i contains domain(u _root ) and domain(URL) extracts the URL top-level domain name. For example, domain(u _root )=www. outsideware. com. In this way, all URLs with the same domain name, for example http://www.owtware.com/index.php/zh/products/, can be kept.

好ましくは、ｕ’_ｉに対応するページｐ_ｉは他のＵＲＬ情報を含む可能性があると考慮するため、ｐ_ｉをさらに拡張してもよい。各ｐｉについて、同様のルールでＵＲＬ及び対応するページを拡張し、毎回の拡張の後に同一のＵＲＬ及びページを併合する。拡張のプロセスはｎ回だけ繰り返してもよい。一定の数のページを取得でき、且つページの数が多すぎないように、通常ｎ＝２にしてもよい。これによって、同一のドメイン名を有するページの集合ｐ＝［＜ｐ_０，ｕ_０＞，＜ｐ_１，ｕ_１＞，＜ｐ_２，ｕ_２＞，…，＜ｐ_ｎ，ｕ_ｎ＞］を取得でき、ここで、ｐ_ｉはウェブページを表し、ｕ_ｉはウェブページに対応するＵＲＬを表す。 Preferably, p _i may be further expanded to take into account that the page p _i corresponding to u' _i may include other URL information. For each pi, expand URLs and corresponding pages using similar rules, and merge identical URLs and pages after each expansion. The expansion process may be repeated n times. Normally, n may be set to 2 so that a certain number of pages can be obtained and the number of pages is not too large. As a result, a set of pages with the same domain name p = [<p ₀ , u ₀ >, <p ₁ , u ₁ >, <p ₂ , u ₂ >, ..., <p _n , _un >] where p _i represents a web page and u _i represents a URL corresponding to the web page.

次に、ステップ３０２において、木構造におけるナビゲーションバーノードを決定する。具体的には、本実施形態では、集合ｐからナビゲーションバーノードを取得する。上述したように、目的は、集合ｐから該会社情報を含むページ、例えば製品、連絡先などを取得することである。通常、ナビゲーションバーノードにおけるリンクにより、これらの情報に対応するページを取得できる。ナビゲーションバーノードを情報アンカーとして選択する主な理由は３つある。 Next, in step 302, navigation bar nodes in the tree structure are determined. Specifically, in this embodiment, navigation bar nodes are obtained from the set p. As mentioned above, the purpose is to obtain pages containing the company information, such as products, contacts, etc., from the set p. Links in navigation bar nodes typically allow you to retrieve pages corresponding to this information. There are three main reasons to choose navigation bar nodes as information anchors.

（１）情報は正確である。ナビゲーションバーノードに含まれるリンクが指向するページは、会社の紹介と見なすことができる。例えば、「製品とサービス」に対応するページは該会社の製品を紹介し、「連絡先」は会社の住所、電話番号などの情報のページにリンクする。ウェブページにおける他の部分に出現するリンクは、必ずしも該会社の情報を説明するものではなく、他の会社の紹介や広告などの情報である可能性がある。 (1) Information is accurate. The page to which the link contained in the navigation bar node points can be considered an introduction to the company. For example, a page corresponding to "Products and Services" introduces the company's products, and "Contact Information" links to a page with information such as the company's address and phone number. Links that appear in other parts of the web page do not necessarily explain information about the company, but may be information such as introductions or advertisements for other companies.

（２）情報は全面的である。ナビゲーションバーノードは基本的に該会社に関連する全ての情報を含み、ナビゲーションバーノードを取得すると、関連情報を含む全てのページを取得でき、これは後続の情報抽出に非常に役に立つ。 (2) Information is comprehensive. The navigation bar node basically contains all the information related to the company, and when you get the navigation bar node, you can get all the pages containing related information, which is very useful for subsequent information extraction.

（３）比較的に取得しやすい。異なるウェブページは異なる構造を有する可能性があるが、ナビゲーションバーノードの様式は殆ど同じである。このような共通性により、ウェブ構造からナビゲーションバーノードの位置を正確に見つけることができる。 (3) Relatively easy to obtain. Although different web pages may have different structures, the style of navigation bar nodes is largely the same. This commonality allows navigation bar nodes to be accurately located in the web structure.

以下は、ナビゲーションバーノードの決定方法を例示的に説明する。 Below, a method for determining a navigation bar node will be exemplified.

上記の３つの特徴により、各ページｐ_ｉ（ｐ_ｉ∈ｐ）におけるノードを計数することで、頻繁に出現するノードを取得してもよい。これらのノードにはナビゲーションバーノードが含まれるため、特徴値に基づいてこれらの頻繁に出現するノードを並び替えることでナビゲーションバーノードを取得してもよい。具体的な方法は以下の通りである。 Due to the above three characteristics, frequently appearing nodes may be obtained by counting the nodes in each page p _i (p _i ∈p). Since these nodes include navigation bar nodes, the navigation bar nodes may be obtained by rearranging these frequently appearing nodes based on feature values. The specific method is as follows.

図４Ａに示すように、集合ｐにおける各ページｐ_ｉについて、まずｐ_ｉをＤｏｍ木の構造に変換する。 As shown in FIG. 4A, for each page p _i in the set p, p _i is first converted into a Dom tree structure.

Ｄｏｍ木における各葉ノードｎｏｄｅ_ｉについて、ｎｏｄｅ_ｉの経路パターンｐａｔｈ_ｉを取得し、ｐａｔｈ_ｉは、該葉ノードに対応するテキストと、ｎ番目の先祖ノードまでの経路により構成される。実際の経験によると、殆どのページでは、ｎは５以上の整数値であってもよい。例えば、ナビゲーションバーノード「連絡先」について、ｎ＝５の場合は、ｐａｔｈ_ｉ＝「ｕｌ＿ｌｉ＿ｕｌ＿ｌｉ＿ａ＿連絡先」を取得できる。 For each leaf node node _i in the Dom tree, a path pattern path _i of node _i is obtained, and path _i is composed of the text corresponding to the leaf node and the path to the n-th ancestor node. Practical experience shows that for most pages, n may be an integer value greater than or equal to 5. For example, for the navigation bar node "Contact", if n=5, path _i = "ul_li_ul_li_a_contact" can be obtained.

次に、各ｐａｔｈ_ｉの文書頻度ｄｆ_ｉ、即ちｐａｔｈ_ｉが異なる文書に出現する回数を算出する。統計により経路頻度辞書ｎｏｄｅ＿ｐａｔｔｅｒｎ＿ｄｉｃｔｉｏｎａｒｙ｛＜ｐａｔｈ_１，ｄｆ_１＞，…，＜ｐａｔｈ_ｎ，ｄｆ_ｎ>}を取得してもよく、ここで、ｄｆ_ｉ＞ｔ、ｔは次のように設定された閾値である。

Next, the document frequency df _i of each path _i , that is, the number of times path _i appears in different documents, is calculated. A path frequency dictionary node_pattern_dictionary {<path ₁ , df ₁ >, ..., <path _n , df _n >} may be obtained by statistics, where df _i >t, t is a threshold set as follows. It is.

ページ数｜ｐ｜の最終結果への影響を低減するために、閾値ｔを段階的に設定する。 In order to reduce the influence of the number of pages |p| on the final result, the threshold value t is set in stages.

経路頻度辞書を取得した後、集合ｐにおける各ｐ_ｉに対応するＤｏｍ木構造に対して２回目の走査を行い、今回は、各非葉ノードｎｏｄｅ_ｉについて、それによりカバーされる全てのＮＵＬＬでない葉ノードの集合がｃ＝［ｃ_０，ｃ_１，ｃ_２，…，ｃ_ｎ］となると仮定すると、各ｃ_ｉについて、ｐａｔｈ_ｉ（ｃ_ｉ）が経路頻度辞書ｎｏｄｅ＿ｐａｔｔｅｒｎ＿ｄｉｃｔｉｏｎａｒｙに存在する場合、該ｎｏｄｅ_ｉの情報を記録する。最後に、候補辞書ｃａｎｄｉｄａｔｅ＿ｐａｔｔｅｒｎ＿ｄｉｃｔｉｏｎａｒｙ｛＜ｐａｔｈ_１,［ｄｆ_１,ｃｎ_１］＞，…,＜ｐａｔｈ_ｎ,［ｄｆ_ｎ,ｃｎ_ｎ］＞｝を取得してもよく、ここで、ｐａｔｈ_ｉは非葉ノードｎｏｄｅ_ｉから先祖ノードまでの経路情報を表し、ｄｆ_ｉは文書頻度を表し、ｃｎ_ｉはｎｏｄｅ_ｉによりカバーされる全てのＮＵＬＬでない葉ノードの数を表す。葉ノードのｐａｔｈ_ｉとは異なって、非葉ノードのｐａｔｈ_ｉはテキスト情報を含まない。図４Ａにおける３に示すように、「連絡先」からｕｌノードまでの対応する経路はｕｌ＿ｌｉ＿ｕｌ＿ｄｉｖ＿ｄｉｖであり、ｎ＝５となる。 After obtaining the path frequency dictionary, a second scan is performed on the Dom tree structure corresponding to each p _i in the set p, and this time, for each non-leaf node node _i , all non-NULLs covered by it are scanned. Assuming that the set of leaf nodes is c = [c ₀ , c ₁ , c ₂ , ..., c _n ], for each c _i , if path _i (c _i ) exists in the path frequency dictionary node_pattern_dictionary, then the node Record the information of _i . Finally, a candidate dictionary candidate_pattern_dictionary {<path ₁ ,[df ₁ ,cn ₁ ]>,...,<path _n ,[df _n ,cn _n ]>} may be obtained, where path _i is a non-leaf It represents the path information from node _i to the ancestor node, df _i represents the document frequency, and cn _i represents the number of all non-NULL leaf nodes covered by node _i . Unlike path _i of leaf nodes, path _i of non-leaf nodes does not contain textual information. As shown at 3 in FIG. 4A, the corresponding path from the "contact" to the ul node is ul_li_ul_div_div, and n=5.

最後に、（ｃｎ＊ｄｆ／｜ｐ｜）の値に従って候補辞書ｃａｎｄｉｄａｔｅ＿ｐａｔｔｅｒｎ＿ｄｉｃｔｉｏｎａｒｙを並び替え、最大値に対応する経路をナビゲーションバーノード経路テンプレートとして取得し、該最大値に対応する経路における先祖ノードをナビゲーションバーノードとして決定してもよい。該会社のホームページの下にある所定のＨＴＭＬページについて、該テンプレートを用いてナビゲーションバーノードの位置を特定してもよい。 Finally, sort the candidate dictionary candidate_pattern_dictionary according to the value of (cn*df/|p|), obtain the route corresponding to the maximum value as a navigation bar node route template, and navigate the ancestor node in the route corresponding to the maximum value. It may also be determined as a bar node. The template may be used to locate navigation bar nodes for a given HTML page under the company's home page.

なお、上記の統計的方法を用いてナビゲーションバーノードを決定することは、単なるナビゲーションバーノードの決定方法の一例である。本発明は、これに限定されず、他の適切な方法を用いてナビゲーションバーノードを決定してもよい。 Note that determining a navigation bar node using the above statistical method is merely an example of a method for determining a navigation bar node. The present invention is not limited thereto, and other suitable methods may be used to determine navigation bar nodes.

次に、ステップ３０３において、ナビゲーションバーノードによりカバーされる、１つ又は複数のキーワードにマッチする葉ノードを決定する。具体的には、本実施形態では、ステップ３０２においてナビゲーションバーノードが取得された後に、該ナビゲーションバーノードによりカバーされる各ＮＵＬＬでない葉ノードについて、辞書ｋｅｙｗｏｒｄ＿ｄｉｃｔを用いて葉ノードに対応するテキストのマッチングを行う。辞書ｋｅｙｗｏｒｄ＿ｄｉｃｔには、例えば「製品紹介」、「連絡先」などの所定のキーワードが含まれる。葉ノードがキーワードにマッチする場合、対応するＨＴＭＬ要素から「ｈｒｅｆ」属性を検索してもよく、その属性値は対応するウェブページのＵＲＬである。例えば、図４Ａにおける「連絡先」ノードに対応するＨＴＭＬ要素には次のリンクが含まれる。 Next, in step 303, leaf nodes covered by the navigation bar node that match one or more keywords are determined. Specifically, in this embodiment, after a navigation bar node is obtained in step 302, for each non-NULL leaf node covered by the navigation bar node, the dictionary keyword_dict is used to match the text corresponding to the leaf node. I do. The dictionary keyword_dict includes predetermined keywords such as "product introduction" and "contact information", for example. If a leaf node matches a keyword, the "href" attribute may be retrieved from the corresponding HTML element, and the attribute value is the URL of the corresponding web page. For example, the HTML element corresponding to the "Contact" node in FIG. 4A includes the following link:

ｈｒｅｆ＝ｈｔｔｐ：／／ｗｗｗ.ｏｗｔｗａｒｅ.ｃｏｍ／ｉｎｄｅｘ.ｐｈｐ／ｚｈ／ａｂｏｕｔ／ｃｏｎｔａｃｔ－ｕｓ／
従って、集合ｐから関連情報を含むウェブページの集合ｐ’＝［＜ｐ’_０，ｕ’_０，ｔ’_０＞，＜ｐ’_１，ｕ’_１，ｔ’_１＞，＜ｐ’_２，ｕ’_２，ｔ’_２＞，…，＜ｐ’_ｎ，ｕ’_ｎ，ｔ’_ｎ＞］を選択してもよく、ここで、ｐ’_ｉ及びｕ’_ｉは上記の定義されたｐ_ｉ及びｕ_ｉと同じであり、ｔ’_ｉは、該ページに対応するタイプ、例えば製品、人物、連絡先などを表す。これによって、ページの異なるタイプに応じて、異なる解析器を選択して抽出を行うことができる。 href=http://www.owtware.com/index.php/zh/about/contact-us/
Therefore, a set of web pages containing related information from the set p' = [<p' ₀ , u' ₀ , t' ₀ >, <p' ₁ , u' ₁ , t' ₁ >, <p' ₂ , u' ₂ , t' ₂ >, ..., <p' _n , u' _n , t' _n >], where p' _i and u' _i are the above defined p _i and u _i , and t' _i represents the type corresponding to the page, such as product, person, contact information, etc. This allows different analyzers to be selected for extraction depending on different types of pages.

各ｐ’_ｉについて、まず、ＨＴＭＬページを前処理する必要がある。前処理の目的は、まずページにおける主要情報を抽出することである。このプロセスは共通のものであり、ウェブページのタイプｔ’とは関係がない。抽出された結果は、後で抽出を行う時の入力としてもよい。図４Ｂの（１）に示すように、元のＨＴＭＬページには多くの内容が含まれているが、実線の枠で示される部分のみが必要な内容であり、ナビゲーションバーノード、サイドリスト、ラベルＦｏｏｔｅｒなどの要素を含む他の部分を全て除去する必要があり、除去しないと、抽出時にノイズデータの影響を受けやすくなる。 For each p' _i , we first need to preprocess the HTML page. The purpose of preprocessing is to first extract the main information on the page. This process is common and independent of the web page type t'. The extracted results may be used as input when performing extraction later. As shown in (1) of Figure 4B, the original HTML page contains a lot of content, but only the parts shown in solid lines are necessary, including navigation bar nodes, side lists, and labels. It is necessary to remove all other parts including elements such as Footer, otherwise the extraction will be susceptible to noise data.

ステップ３０２において生成された経路頻度辞書ｎｏｄｅ＿ｐａｔｔｅｒｎ＿ｄｉｃｔｉｏｎａｒｙ及び候補辞書ｃａｎｄｉｄａｔｅ＿ｐａｔｔｅｒｎ＿ｄｉｃｔｉｏｎａｒｙを考慮すると、以下の方法を用いてナビゲーションバーノードによりカバーされる１つ又は複数のキーワードにマッチする葉ノードを決定してもよい。 Considering the route frequency dictionary node_pattern_dictionary and the candidate dictionary candidate_pattern_dictionary generated in step 302, the following method may be used to determine leaf nodes that match one or more keywords covered by a navigation bar node.

集合ｐ’ｉにおける非葉ノードｎｏｄｅ_ｉについて、それによりカバーされる全てのＮＵＬＬでない葉ノードの集合がｃ＝［ｃ_０，ｃ_１，ｃ_２，…，ｃ_ｎ］であると仮定すると、次の３つの条件が同時に満たされた場合、ｎｏｄｅ_ｉが１つ又は複数のキーワードにマッチする葉ノードを含むターゲット内容ノードであると決定してもよい。

For non-leaf node node _i in set p'i, assuming that the set of all non-NULL leaf nodes covered by it is c=[c ₀ , c ₁ , c ₂ ,..., c _n ], then If the following three conditions are met simultaneously, node _i may be determined to be a target content node containing leaf nodes that match one or more keywords.

ここで、ｃ_ｉはｎｏｄｅ_ｉによりカバーされるＮＵＬＬでない葉ノードであり、ｃ_ｊはｎｏｄｅ_ｊによりカバーされるＮＵＬＬでない葉ノードであり、ｉ≠ｊとなり、ｔｅｘｔ＿ｌｅｎ（＊）は葉ノードに対応するテキストの長さを表す。言い換えれば、ｎｏｄｅ_ｉによりカバーされる全てのＮＵＬＬでない葉ノードのテキストの合計長さは、他のノードｎｏｄｅ_ｊによりカバーされる全てのＮＵＬＬでない葉ノードのテキストの合計長さよりも大きい。 Here, c _i is a non-NULL leaf node covered by node _i , c _j is a non-NULL leaf node covered by node _j , i≠j, and text_len(*) corresponds to a leaf node. Represents the length of the text. In other words, the total length of the text of all non-NULL leaf nodes covered by node _i is greater than the total length of the text of all non-NULL leaf nodes covered by another node node _j .

上記の３つの条件を同時に満たすノードｎｏｄｅ_ｉが決定されると、所定のキーワードにマッチする葉ノードが決定されることを意味する。 When a node _i that simultaneously satisfies the above three conditions is determined, this means that a leaf node that matches a predetermined keyword is determined.

最後に、ステップ３０４において、マッチする葉ノードに対応するページにおける情報を抽出する。具体的には、本実施形態では、上記３つの条件を満たすノードｎｏｄｅ_ｉが決定された後、該ノードによりカバーされる葉ノードに含まれる情報を抽出してもよい。 Finally, in step 304, information on the page corresponding to the matching leaf node is extracted. Specifically, in this embodiment, after the node node _i that satisfies the above three conditions is determined, the information included in the leaf nodes covered by the node may be extracted.

好ましくは、その各葉ノードを独立した属性抽出空間としてもよく、図４Ｂにおける（２）及び（３）に示すように、各ノード＜ｄｉｖｃｌａｓｓ＝“ｐａｎｅｌ－ｇｒｉｄ－ｃｅｌｌ”…＞を独立した属性空間とする。これによって、属性値の境界を決定することができ、即ち、各値はセクション{{…}}からの値のみである。例えば、人物情報を抽出する場合、セクション{{…}}に含まれる情報は同一の人物を表すためのものであり、異なる{{…}}の情報は異なる人物を表すと見なしてもよいため、抽出エラーを回避することができる。 Preferably, each leaf node may be made into an independent attribute extraction space, and each node <div class="panel-grid-cell"...> may be made into an independent attribute extraction space, as shown in (2) and (3) in FIG. 4B. Let it be an attribute space. This allows determining the boundaries of the attribute values, ie each value is only a value from section {{...}}. For example, when extracting person information, the information included in section {{…}} may be considered to represent the same person, and the information in different {{…}} may be considered to represent different people. , extraction errors can be avoided.

好ましくは、抽出範囲が決定された後、ｐ’_ｉのタイプｔ’_ｉに応じて、異なる解析器、例えばエンティティ認識器（ＮＥＲ）、固有名詞認識器、数値認識器などを選択して特定情報の抽出を行ってもよい。図４Ｂの（３）では、固有名詞認識器の結果の例を示している。 Preferably, after the extraction range is determined, a different analyzer, such as an entity recognizer (NER), a proper noun recognizer, or a numerical value recognizer, is selected depending on the type t' _i of p' _i to extract the specific information. may be extracted. (3) in FIG. 4B shows an example of the results of the proper noun recognizer.

なお、以上は会社ホームページに基づいて関連情報を抽出することを説明しているが、本発明はこれに限定されず、必要に応じて任意のウェブページの任意の情報の抽出に拡張されてもよい。 Although the above describes extracting related information based on a company homepage, the present invention is not limited to this, and may be extended to extracting any information from any web page as necessary. good.

上記の方法は、コンピュータ実行可能なプログラムにより完全に実現されてもよいし、ハードウェア及び／又はファームウェアを用いて部分的又は完全に実現されてもよい。ハードウェア及び／又はファームウェアにより実現される場合、又はコンピュータ実行可能なプログラムがプログラムを実行可能なハードウェア装置にロードされる場合、後述するウェブページから情報を抽出する装置が実現される。以下は、上述した詳細な内容を省略し、これらの装置の概要を説明する。なお、これらの装置は上記の方法を実行することができるが、上記方法は後述する装置の構成部を採用し、或いは構成部により実行されるものに限定されない。 The above method may be fully implemented by a computer-executable program, or may be partially or completely implemented using hardware and/or firmware. When implemented in hardware and/or firmware, or when a computer executable program is loaded onto a hardware device capable of executing the program, an apparatus for extracting information from a web page as described below is implemented. In the following, the detailed contents mentioned above will be omitted and an overview of these devices will be explained. Note that these devices can execute the above-mentioned method, but the above-mentioned method is not limited to one that employs or is executed by a component of the device described later.

図５は本発明の実施形態に係るウェブページから情報を抽出する装置５００の例を示すブロック図である。装置５００は、木構造生成部５０１、ナビゲーションバーノード決定部５０２、マッチノード決定部５０３及び情報抽出部５０４を含む。木構造生成部５０１は、ウェブページ及びその全ての拡張ウェブページにおける該ウェブページのドメイン名を含む各ページについて木構造を生成する。ナビゲーションバーノード決定部５０２は、該木構造におけるナビゲーションバーノードを決定する。マッチノード決定部５０３は、該ナビゲーションバーノードによりカバーされる、１つ又は複数のキーワードにマッチする葉ノードを決定する。情報抽出部５０４は、マッチする葉ノードに対応するページにおける情報を抽出する。 FIG. 5 is a block diagram illustrating an example of an apparatus 500 for extracting information from a web page according to an embodiment of the present invention. The device 500 includes a tree structure generation section 501, a navigation bar node determination section 502, a match node determination section 503, and an information extraction section 504. The tree structure generation unit 501 generates a tree structure for each page including the domain name of the web page in the web page and all its extended web pages. The navigation bar node determining unit 502 determines the navigation bar node in the tree structure. The match node determining unit 503 determines leaf nodes covered by the navigation bar node that match one or more keywords. The information extraction unit 504 extracts information on the page corresponding to the matching leaf node.

図５に示すウェブページから情報を抽出する装置５００は図３に示す方法に対応する。よって、ウェブページから情報を抽出する装置５００の詳細は、図３におけるウェブページから情報を抽出する方法について説明において既に詳細に説明され、ここでその説明を省略する。 The apparatus 500 for extracting information from a web page shown in FIG. 5 corresponds to the method shown in FIG. Accordingly, the details of the apparatus 500 for extracting information from a web page have already been described in detail in the description of the method for extracting information from a web page in FIG. 3, and the description thereof will be omitted here.

上記処理及び装置はソフトウェア及び／又はファームウェアにより実現されてもよい。ソフトウェア及び／又はファームウェアにより実施されている場合、記憶媒体又はネットワークから専用のハードウェア構成を有するコンピュータ（例えば図６示されている汎用パーソナルコンピュータ６００）に上記方法を実施するためのソフトウェアを構成するプログラムをインストールしてもよく、該コンピュータは各種のプログラムがインストールされている場合は各種の機能などを実行できる。 The above processes and devices may be implemented by software and/or firmware. When implemented by software and/or firmware, the software for implementing the above method is configured from a storage medium or a network to a computer having a dedicated hardware configuration (e.g., general purpose personal computer 600 shown in FIG. 6). Programs may be installed, and the computer can perform various functions when various programs are installed.

図６は本発明の実施形態に係る方法及び／又は装置を実現可能な汎用パーソナルコンピュータの例示的な構成を示すブロック図である。図６において、中央処理部（ＣＰＵ）６０１は、読み出し専用メモリ（ＲＯＭ）６０２に記憶されているプログラム、又は記憶部６０８からランダムアクセスメモリ（ＲＡＭ）６０３にロードされたプログラムにより各種の処理を実行する。ＲＡＭ６０３には、必要に応じて、ＣＰＵ６０１が各種の処理を実行するに必要なデータが記憶されている。ＣＰＵ６０１、ＲＯＭ６０２、及びＲＡＭ６０３は、バス６０４を介して互いに接続されている。入力／出力インターフェース６０５もバス６０４に接続されている。 FIG. 6 is a block diagram illustrating an exemplary configuration of a general-purpose personal computer that can implement the method and/or apparatus according to the embodiments of the present invention. In FIG. 6, a central processing unit (CPU) 601 executes various processes using programs stored in a read-only memory (ROM) 602 or programs loaded into a random access memory (RAM) 603 from a storage unit 608. do. The RAM 603 stores data necessary for the CPU 601 to execute various processes as necessary. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output interface 605 is also connected to bus 604.

入力部６０６（キーボード、マウスなどを含む）、出力部６０７（ディスプレイ、例えばブラウン管（ＣＲＴ）、液晶ディスプレイ（ＬＣＤ）など、及びスピーカなどを含む）、記憶部６０８（例えばハードディスクなどを含む）、通信部６０９（例えばネットワークのインタフェースカード、例えばＬＡＮカード、モデムなどを含む）は、入力／出力インターフェース６０５に接続されている。通信部６０９は、ネットワーク、例えばインターネットを介して通信処理を実行する。必要に応じて、ドライバ６１０は、入力／出力インターフェース６０５に接続されてもよい。取り外し可能な媒体６１１は、例えば磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどであり、必要に応じてドライバ６１０にセットアップされて、その中から読みだされたコンピュータプログラムは必要に応じて記憶部６０８にインストールされている。 Input unit 606 (including a keyboard, mouse, etc.), output unit 607 (including a display, such as a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.), storage unit 608 (including, for example, a hard disk), communication A unit 609 (including, for example, a network interface card, such as a LAN card, modem, etc.) is connected to the input/output interface 605 . The communication unit 609 executes communication processing via a network, for example, the Internet. Optionally, driver 610 may be connected to input/output interface 605. The removable medium 611 is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., and is set up in the driver 610 as necessary, and the computer program read from it is stored in the storage unit as necessary. 608 is installed.

ソフトウェアにより上記処理を実施する場合、ネットワーク、例えばインターネット、又は記憶媒体、例えば取り外し可能な媒体６１１を介してソフトウェアを構成するプログラムをインストールする。 When implementing the above processing using software, a program constituting the software is installed via a network, such as the Internet, or a storage medium, such as the removable medium 611.

なお、これらの記憶媒体は、図６に示されている、プログラムを記憶し、機器と分離してユーザへプログラムを提供する取り外し可能な媒体６１１に限定されない。取り外し可能な媒体６１１は、例えば磁気ディスク（フロッピーディスク（登録商標）を含む）、光ディスク（光ディスク－読み出し専用メモリ（ＣＤ－ＲＯＭ）、及びデジタル多目的ディスク（ＤＶＤ）を含む）、光磁気ディスク（ミニディスク（ＭＤ）（登録商標））及び半導体メモリを含む。或いは、記憶媒体は、ＲＯＭ６０２、記憶部６０８に含まれるハードディスクなどであってもよく、プログラムを記憶し、それらを含む機器と共にユーザへ提供される。 Note that these storage media are not limited to the removable medium 611 shown in FIG. 6 that stores a program and provides the program to the user separately from the device. Removable media 611 may include, for example, magnetic disks (including floppy disks), optical disks (including optical disks - read only memory (CD-ROM), and digital versatile disks (DVD)), and magneto-optical disks (miniature disks). (MD) (registered trademark)) and semiconductor memory. Alternatively, the storage medium may be the ROM 602, a hard disk included in the storage unit 608, etc., which stores the programs and is provided to the user together with the device containing them.

本発明は、対応するコンピュータプログラムコード、機器が読み取り可能な命令コードが記憶されているコンピュータプログラムプロダクトをさらに提供する。該命令コードは、機器により読み取られ、実行される際に、上記の本発明の実施例に係る方法を実行することができる。 The invention further provides a computer program product in which a corresponding computer program code, machine readable instruction code is stored. The instruction code, when read and executed by the device, can perform the method according to the embodiment of the invention described above.

それに応じて、本発明は、機器が読み取り可能な命令コードを含むプログラムプロダクトが記録されている記憶媒体をさらに含む。該記憶媒体は、フロッピーディスク、光ディスク、光磁気ディスク、メモリカード、メモリスティック等を含むが、これらに限定されない。 Accordingly, the invention further includes a storage medium having recorded thereon a program product including machine readable instruction codes. The storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, a memory stick, and the like.

また、上述の各実施例を含む実施形態に関し、更に以下の付記を開示する。
（付記１）
ウェブページから情報を抽出する方法であって、
前記ウェブページ及びその全ての拡張ウェブページにおける前記ウェブページのドメイン名を含む各ページについて木構造を生成するステップと、
前記木構造におけるナビゲーションバーノードを決定するステップと、
前記ナビゲーションバーノードによりカバーされる、１つ又は複数のキーワードにマッチする葉ノードを決定するステップと、
マッチする葉ノードに対応するページにおける情報を抽出するステップと、を含む、方法。
（付記２）
統計的方法を用いて前記ナビゲーションバーノードを決定する、付記１に記載の方法。
（付記３）
前記木構造におけるナビゲーションバーノードを決定するステップは、
前記木構造に出現する回数が所定閾値よりも大きい葉ノードのみを含む非葉ノードを決定するステップと、
前記非葉ノードを並び替えて前記ナビゲーションバーノードを決定するステップと、を含む、付記２に記載の方法。
（付記４）
葉ノードの出現回数が所定閾値よりも大きいか否かを決定することは、
前記葉ノードのテキスト及び経路情報の前記木構造における出現回数が前記所定閾値よりも大きいか否かを決定すること、を含む、付記３に記載の方法。
（付記５）
前記経路情報は、前記葉ノードからそのｎ番目の先祖ノードまでの経路であり、ｎは正整数である、付記４に記載の方法。
（付記６）
ｎは５以上である、付記５に記載の方法。
（付記７）
前記非葉ノードを並び替えて前記ナビゲーションバーノードを決定するステップは、
前記非葉ノードの特徴値を計算するステップであって、前記特徴値は、前記非葉ノードによりカバーされる葉ノードの数及び前記回数により決定される、ステップと、
前記非葉ノードのうちの最大の特徴値を有する非葉ノードを前記ナビゲーションバーノードとして決定するステップと、を含む、付記３に記載の方法。
（付記８）
前記特徴値は、前記非葉ノードによりカバーされる葉ノードの数と前記回数との積の、前記ウェブページのドメイン名を含むページの総数に対する比率である、付記７に記載の方法。
（付記９）
マッチする葉ノードに対応するページにおける情報を抽出するステップは、
前記マッチする葉ノードに対応するページに含まれるターゲットノードを決定するステップと、
前記ターゲットノードによりカバーされる各葉ノードのテキストをそれぞれ抽出するステップと、を含む、付記１乃至８の何れかに記載の方法。
（付記１０）
前記ターゲットノードは、
前記ターゲットノードに含まれる各葉ノードのテキスト及び経路情報の前記木構造における出現回数が前記所定閾値以下であること、
前記ターゲットノードが、前記木構造に出現する回数が所定閾値よりも大きい葉ノードのみを含む非葉ノードのうちの非葉ノードではないこと、及び
前記ターゲットノードに含まれる全ての葉ノードのテキストの合計長さが該木構造における他の非葉ノードのテキストの合計長さよりも大きいこと、により決定される、付記９に記載の方法。
（付記１１）
前記ターゲットノードによりカバーされる各葉ノードのテキストをそれぞれ抽出するステップは、
前記ターゲットノードに対応するページのタイプに応じて、異なる解析器を選択して抽出を行うステップ、を含む、付記９に記載の方法。
（付記１２）
前記ターゲットノードの各葉ノードを独立した属性抽出空間とする、付記１１に記載の方法。
（付記１３）
前記解析器は、エンティティ認識器、固有名詞認識器又は数値認識器である、付記１１に記載の方法。
（付記１４）
決定されたナビゲーションバーノードの経路情報を用いて前記ウェブページ及びその全ての拡張ウェブページにおけるナビゲーションバーノードを決定する、付記１乃至８の何れかに記載の方法。
（付記１５）
ＵＲＬトップレベルドメイン名を抽出することにより、前記ウェブページ及びその全ての拡張ウェブページにおける前記ウェブページのドメイン名を含むページを決定する、付記１乃至８の何れかに記載の方法。
（付記１６）
前記木構造は、ＨＴＭＬ文書オブジェクトモデル（ＤＯＭ）である、付記１乃至８の何れかに記載の方法。
（付記１７）
前記キーワードは、所定のキーワードである、付記１乃至８の何れかに記載の方法。
（付記１８）
前記拡張ウェブページをｎ回だけ拡張して前記ウェブページのドメイン名を含むページを取得し、ｎは２以上の整数である、付記１乃至８の何れかに記載の方法。
（付記１９）
ウェブページから情報を抽出する装置であって、
前記ウェブページ及びその全ての拡張ウェブページにおける前記ウェブページのドメイン名を含む各ページについて木構造を生成する木構造生成手段と、
前記木構造におけるナビゲーションバーノードを決定するナビゲーションバーノード決定手段と、
前記ナビゲーションバーノードによりカバーされる、１つ又は複数のキーワードにマッチする葉ノードを決定するマッチノード決定手段と、
マッチする葉ノードに対応するページにおける情報を抽出する情報抽出手段と、を含む、装置。
（付記２０）
プログラムを記憶したコンピュータ読み取り可能な記憶媒体であって、前記プログラムがプロセッサにより実行される際に、
ウェブページ及びその全ての拡張ウェブページにおける前記ウェブページのドメイン名を含む各ページについて木構造を生成するステップと、
前記木構造におけるナビゲーションバーノードを決定するステップと、
前記ナビゲーションバーノードによりカバーされる、１つ又は複数のキーワードにマッチする葉ノードを決定するステップと、
マッチする葉ノードに対応するページにおける情報を抽出するステップと、を実行させる、記憶媒体。 Further, regarding the embodiments including the above-mentioned examples, the following additional notes are further disclosed.
(Additional note 1)
A method for extracting information from a web page, the method comprising:
generating a tree structure for each page containing the domain name of the web page in the web page and all its extended web pages;
determining a navigation bar node in the tree structure;
determining leaf nodes covered by the navigation bar node that match one or more keywords;
extracting information in a page corresponding to a matching leaf node.
(Additional note 2)
The method of clause 1, wherein the navigation bar node is determined using a statistical method.
(Additional note 3)
The step of determining a navigation bar node in the tree structure includes:
determining non-leaf nodes that include only leaf nodes whose number of appearances in the tree structure is greater than a predetermined threshold;
3. The method of claim 2, comprising reordering the non-leaf nodes to determine the navigation bar node.
(Additional note 4)
Determining whether the number of occurrences of a leaf node is greater than a predetermined threshold includes:
3. The method according to claim 3, comprising determining whether the number of occurrences of text and route information of the leaf node in the tree structure is greater than the predetermined threshold.
(Appendix 5)
The method according to appendix 4, wherein the route information is a route from the leaf node to its nth ancestor node, where n is a positive integer.
(Appendix 6)
The method according to appendix 5, wherein n is 5 or more.
(Appendix 7)
sorting the non-leaf nodes to determine the navigation bar node;
calculating a feature value of the non-leaf node, the feature value being determined by the number of leaf nodes covered by the non-leaf node and the number of times;
3. The method according to claim 3, comprising determining a non-leaf node having a maximum feature value among the non-leaf nodes as the navigation bar node.
(Appendix 8)
8. The method of claim 7, wherein the feature value is a ratio of the product of the number of leaf nodes covered by the non-leaf nodes and the number of times to the total number of pages containing the domain name of the web page.
(Appendix 9)
The step of extracting information on the page corresponding to the matching leaf node is as follows:
determining a target node included in a page corresponding to the matching leaf node;
9. A method according to any one of appendices 1 to 8, comprising the step of respectively extracting the text of each leaf node covered by the target node.
(Appendix 10)
The target node is
The number of occurrences of text and route information of each leaf node included in the target node in the tree structure is equal to or less than the predetermined threshold;
The target node is not a non-leaf node among non-leaf nodes that include only leaf nodes whose number of occurrences in the tree structure is greater than a predetermined threshold, and the text of all leaf nodes included in the target node is 9. The method of clause 9, wherein the total length is determined by being greater than the total length of text of other non-leaf nodes in the tree structure.
(Appendix 11)
Respectively extracting the text of each leaf node covered by the target node comprises:
The method according to appendix 9, comprising selecting different analyzers to perform the extraction depending on the type of page corresponding to the target node.
(Appendix 12)
The method according to appendix 11, wherein each leaf node of the target node is an independent attribute extraction space.
(Appendix 13)
12. The method according to appendix 11, wherein the analyzer is an entity recognizer, a proper noun recognizer, or a numerical value recognizer.
(Appendix 14)
9. The method according to any one of appendices 1 to 8, wherein navigation bar nodes in the web page and all extended web pages thereof are determined using the route information of the determined navigation bar nodes.
(Additional note 15)
9. The method according to any one of appendices 1 to 8, wherein a page containing the domain name of the web page in the web page and all its extension web pages is determined by extracting a URL top-level domain name.
(Appendix 16)
9. The method according to any one of appendices 1 to 8, wherein the tree structure is an HTML document object model (DOM).
(Appendix 17)
9. The method according to any one of appendices 1 to 8, wherein the keyword is a predetermined keyword.
(Appendix 18)
9. The method according to any one of appendices 1 to 8, wherein the expanded web page is expanded n times to obtain a page that includes the domain name of the web page, where n is an integer of 2 or more.
(Appendix 19)
A device for extracting information from a web page, the device comprising:
Tree structure generating means for generating a tree structure for each page including the domain name of the web page in the web page and all extended web pages thereof;
navigation bar node determining means for determining a navigation bar node in the tree structure;
match node determining means for determining a leaf node covered by the navigation bar node that matches one or more keywords;
an information extraction means for extracting information in a page corresponding to a matching leaf node.
(Additional note 20)
A computer-readable storage medium storing a program, the program being executed by a processor;
generating a tree structure for each page containing the domain name of the web page in the web page and all its extended web pages;
determining a navigation bar node in the tree structure;
determining leaf nodes covered by the navigation bar node that match one or more keywords;
and extracting information in a page corresponding to a matching leaf node.

なお、用語「含む」、「有する」又は他の任意の変形は、排他的に含むことに限定されず、一連の要素を含むプロセス、方法、物又は装置は、これらの要素を含むことだけではなく、明示的に列挙されていない他の要素、又はこのプロセス、方法、物若しくは装置の固有の要素を含む。また、さらなる制限がない限り、用語「１つの…を含む」より限定された要素は、該要素を含むプロセス、方法、物又は装置に他の同一の要素が存在することを排除しない。 Note that the terms "comprising," "having," or any other variations are not limited to exclusive inclusion, and a process, method, object, or device that includes a series of elements is not limited to just including those elements. and includes other elements not explicitly listed or elements inherent in the process, method, object, or apparatus. Also, unless there are further limitations, the term "comprising an" more restricted element does not exclude the presence of other identical elements in a process, method, article, or apparatus that includes the element.

以上は図面を参照しながら本発明の好ましい実施例を説明しているが、上記実施例及び例は例示的なものであり、制限的なものではない。当業者は、特許請求の範囲の主旨及び範囲内で本発明に対して各種の修正、改良、均等的なものに変更してもよい。これらの修正、改良又は均等的なものに変更することは本発明の保護範囲に含まれるものである。 Although preferred embodiments of the present invention have been described above with reference to the drawings, the embodiments and examples described above are illustrative and not restrictive. Those skilled in the art may make various modifications, improvements, and equivalent changes to the present invention within the spirit and scope of the claims. These modifications, improvements, or equivalent changes are included within the protection scope of the present invention.

Claims

A method for extracting information from a web page, the method comprising:
generating a tree structure for each page containing the domain name of the web page in the web page and all its extended web pages;
determining a navigation bar node in the tree structure;
determining leaf nodes covered by the navigation bar node that match one or more keywords;
A method for performing the steps of: extracting information in a page corresponding to a matching leaf node.

The step of determining a navigation bar node in the tree structure includes:
determining non-leaf nodes that include only leaf nodes whose number of appearances in the tree structure is greater than a predetermined threshold;
2. The method of claim 1, comprising reordering the non-leaf nodes to determine the navigation bar node.

Determining whether the number of occurrences of a leaf node is greater than a predetermined threshold includes:
3. The method of claim 2, comprising determining whether the number of occurrences of text and route information of the leaf node in the tree structure is greater than the predetermined threshold.

4. The method of claim 3, wherein the path information is a path from the leaf node to its nth ancestor node, where n is a positive integer.

5. The method according to claim 4, wherein n is 5 or more.

sorting the non-leaf nodes to determine the navigation bar node;
calculating a feature value of the non-leaf node, the feature value being determined by the number of leaf nodes covered by the non-leaf node and the number of times;
3. The method of claim 2, comprising: determining the non-leaf node having the largest feature value among the non-leaf nodes as the navigation bar node.

The step of extracting information on the page corresponding to the matching leaf node is as follows:
determining a target node included in a page corresponding to the matching leaf node;
7. A method according to any preceding claim, comprising the step of respectively extracting the text of each leaf node covered by the target node.

The target node is
The number of occurrences of text and route information of each leaf node included in the target node in the tree structure is equal to or less than a predetermined threshold;
The target node is not a non-leaf node among non-leaf nodes that include only leaf nodes whose number of occurrences in the tree structure is greater than a predetermined threshold, and the text of all leaf nodes included in the target node is 8. The method of claim 7, wherein the total length is determined by being greater than the total length of text of other non-leaf nodes in the tree structure.

A device for extracting information from a web page, the device comprising:
Tree structure generating means for generating a tree structure for each page including the domain name of the web page in the web page and all extended web pages thereof;
navigation bar node determining means for determining a navigation bar node in the tree structure;
match node determining means for determining a leaf node covered by the navigation bar node that matches one or more keywords;
an information extraction means for extracting information in a page corresponding to a matching leaf node.

A computer-readable storage medium storing a program, the program being executed by a processor;
generating a tree structure for each page containing the domain name of the web page in the web page and all its extended web pages;
determining a navigation bar node in the tree structure;
determining leaf nodes covered by the navigation bar node that match one or more keywords;
and extracting information in a page corresponding to a matching leaf node.