CN101916272B

CN101916272B - A Data Source Selection Method for Deep Web Data Integration

Info

Publication number: CN101916272B
Application number: CN2010102501247A
Authority: CN
Inventors: 方巍; 毕硕本; 文学志
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Changshu Guli Technology Venture Service Co ltd
Priority date: 2010-08-10
Filing date: 2010-08-10
Publication date: 2012-04-25
Anticipated expiration: 2030-08-10
Also published as: CN101916272A

Abstract

The invention discloses a data source selection method for deep web data integration, which comprises the following steps of: firstly, selecting a deep web data source which has relatively high correlation degree with customer query based on query interface semantic features and by combining an ontology library; secondly, evaluating the quality of data sources through the quality evaluation model of the data sources; and finally, acquiring a data source set which has high correlation degree with the customer query and higher quality according to the quality evaluation condition. Compared with the prior art, the method can improve the accuracy of deep webpage query and reduce information redundancy and improve query efficiency at the same time.

Description

A Data Source Selection Method for Deep Web Data Integration

技术领域 technical field

本发明涉及一种基于网络的数据源选择方法，具体涉及一种由网络查询接口连接的深层网的数据源选择方法，用于深层网数据源的集成服务。The invention relates to a network-based data source selection method, in particular to a deep network data source selection method connected by a network query interface, which is used for integration services of deep network data sources.

背景技术 Background technique

随着网络数据库的广泛应用，网络正在加速的“深化”。互联网上有大量页面是由后台数据库动态产生，这部分信息不能直接通过静态链接获取，只能通过填写表单提交查询来获取，由于传统的网络爬虫(Crawler)不具有填写表单的能力，爬不出这些页面。因此，现有的搜索引擎搜索不出这部分页面信息，从而导致这部分信息对用户是隐藏、不可见的，被称为深层网(Deep Web，又称为Invisible Web，Hidden Web)。Deep Web是一个与Surface Web相对应的概念，最初由Dr.Jill Ellsworth于1994年提出，指那些由普通搜索引擎难以发现其信息内容的Web页面。Deep Web信息一般存储在数据库中，需要通过查询接口提交查询进行访问，和静态页面相比通常信息量更大，主题更专一，信息质量更好，信息结构化更好，增长速度更快。研究表明，Deep Web信息是Surface Web信息的500倍，有近450,000个Deep Web站点。实现大规模DeepWeb数据集成是方便用户使用Deep Web信息的一个有效途径。With the wide application of network database, the network is accelerating the "deepening". There are a large number of pages on the Internet that are dynamically generated by the background database. This part of the information cannot be obtained directly through static links, but can only be obtained by filling out forms and submitting queries. Since traditional web crawlers (Crawlers) do not have the ability to fill out forms, they cannot crawl. these pages. Therefore, existing search engines cannot search for this part of the page information, resulting in this part of the information being hidden and invisible to users, which is called the Deep Web (Deep Web, also known as Invisible Web, Hidden Web). Deep Web is a concept corresponding to Surface Web, originally proposed by Dr. Jill Ellsworth in 1994, referring to those Web pages whose information content is difficult to find by common search engines. Deep Web information is generally stored in a database and needs to be accessed through a query interface. Compared with static pages, it usually has a larger amount of information, more specific topics, better information quality, better information structure, and faster growth. Research shows that there is 500 times more information on the Deep Web than on the Surface Web, with nearly 450,000 Deep Web sites. Realizing large-scale Deep Web data integration is an effective way to facilitate users to use Deep Web information.

大规模Deep Web集成系统主要包含：1)数据源发现(Deep Web Discovery)；2)查询接口抽取(Query Interface Extraction)；3)数据源选择(Source selection)；4)查询转换(Query Transfer)；5)结果合成(Result Merging)这五个关键部分。The large-scale Deep Web integration system mainly includes: 1) Deep Web Discovery; 2) Query Interface Extraction; 3) Source selection; 4) Query Transfer; 5) The five key parts of Result Merging.

Deep Web数据源包括多种主题的数据资源，而且在某个主题上Deep Web数据源也有许多，这些数据源属于同一主题，但数据质量上差别很大：有些是过时的、不准确或不一致的，而有些是更新及时、准确一致的。并且这些数据源包含的数据量大小不一，互相覆盖，有的覆盖大，有的甚至完全包含其它的数据源。以商业和教育这两个领域为例，根据Complete Planet的统计，存在上千个Web数据库，由于Complete Planet只是搜集了整个Deep Web数据源中大约7％的Web数据库，所以在现实中还要远远大于这个数字(Bergman M.K.The Deep Web：Surfacing Hidden Value.In Journal ofElectronic Publishing，2002，7(1)：8912-8914)。Kabra G等提出了一种选择那些和用户查询请求内容最为接近的(Top-k)k个Deep Web数据源进行查询的方法(KabraG，Li CK，Chang KCC.Query routing：Finding Ways in the Maze of the Deep Web.In Proc.of the ICDE，2005，64-73)。上述方法只处理查询接口简单属性关系，而且是通过关键词进行查询表单，这些方法没有考虑到查询接口各属性间语义关系，而且进行相应数据源选择过程中数据源选择结果的准确率低，而且返回数据源结果不全等。随着Web数据库数量的不断增长，使得Deep Web数据源的选择成为一个亟待解决的关键问题。Deep Web data sources include data resources on a variety of topics, and there are many Deep Web data sources on a certain topic. These data sources belong to the same topic, but the data quality varies greatly: some are outdated, inaccurate, or inconsistent. , while some are updated in a timely, accurate and consistent manner. Moreover, these data sources contain different amounts of data, covering each other, some covering a large amount, and some even completely containing other data sources. Taking business and education as an example, according to the statistics of Complete Planet, there are thousands of Web databases. Since Complete Planet only collected about 7% of the Web databases in the entire Deep Web data source, it is far behind in reality. Much larger than this number (Bergman M.K. The Deep Web: Surfacing Hidden Value. In Journal of Electronic Publishing, 2002, 7(1): 8912-8914). Kabra G et al proposed a method of selecting (Top-k) k Deep Web data sources closest to the content of the user's query request for query (KabraG, Li CK, Chang KCC. Query routing: Finding Ways in the Maze of the Deep Web. In Proc. of the ICDE, 2005, 64-73). The above methods only deal with the simple attribute relationship of the query interface, and query the form through keywords. These methods do not take into account the semantic relationship between the attributes of the query interface, and the accuracy of the data source selection result in the process of selecting the corresponding data source is low, and The returned data source result is not equal. With the continuous growth of the number of Web databases, the selection of Deep Web data sources has become a key issue that needs to be solved urgently.

发明内容 Contents of the invention

本发明的目的是针对现有技术的不足，提供一种高效、准确的深层网数据源选择方法，从而提高深层网数据源的选择效率和准确度。The purpose of the present invention is to provide an efficient and accurate method for selecting deep network data sources to improve the selection efficiency and accuracy of deep network data sources.

数据源选择是指在给定Deep Web数据源查询接口集和某个用户查询的条件下，选择与用户查询相关度大于某一设定的阈值的查询接口集或者选择相关度值较大的前k个数据源的查询接口集的过程。数据源选择主要是为了选择覆盖程度高，重叠程度小的数据库，避免出现大量的冗余和无关信息；用户希望找到相应的高质量的查询结果，又希望能得到相同结果之间的对比情况。现有数据源选择方法大多是直接计算用户查询与查询接口的相关度来进行关键词匹配，由于以下三方面原因导致使用现有方法时，用户查询通常是不准确的，并且具有较高的冗余度，同时会发现一些不相关的数据源：Data source selection refers to selecting a query interface set whose relevance to user query is greater than a set threshold or selecting a query interface set with a larger correlation value under the condition of a given Deep Web data source query interface set and a certain user query. The process of querying the interface set of k data sources. The main purpose of data source selection is to select databases with high coverage and small overlap to avoid a large amount of redundant and irrelevant information; users hope to find corresponding high-quality query results, and also hope to obtain comparisons between the same results. Most existing data source selection methods directly calculate the correlation between user queries and query interfaces to perform keyword matching. Due to the following three reasons, when using existing methods, user queries are usually inaccurate and have high redundancy. redundancy while discovering some irrelevant data sources:

首先是由于同一个领域中存在大量可访问Deep Web资源，访问Internet上大量的Deep Web是个费时又费力的过程；其次各数据库的数据质量相差很大，有些是过时的、不准确或不一致的，而有些是更新及时、准确一致的，并不是每一个Deep Web都能够满足一个特定的查询，显然任何一个领域的Deep Web不可能包含该领域中所有的信息，因此也不可能满足这个领域的任意查询；最后就是一个领域中大部分的Deep Web数据源包含的数据量大小不一，互相覆盖，有的覆盖大，甚至完全包含其它的数据源；而且它们之间还存在着冗余的信息，而对于一个查询而言，访问Deep Web次数越多，返回信息的冗余度也会越大，极大地增加冗余信息的处理难度。Firstly, because there are a large number of accessible Deep Web resources in the same field, accessing a large number of Deep Web resources on the Internet is a time-consuming and laborious process; secondly, the data quality of each database varies greatly, and some are outdated, inaccurate or inconsistent. And some are timely, accurate and consistent. Not every Deep Web can satisfy a specific query. Obviously, the Deep Web in any field cannot contain all the information in this field, so it is impossible to satisfy any query in this field. In the end, most of the Deep Web data sources in a field contain data of different sizes, covering each other, some covering a large area, and even completely including other data sources; and there is redundant information between them, For a query, the more visits to the Deep Web, the greater the redundancy of the returned information, which greatly increases the difficulty of processing redundant information.

基于以上分析可知，在Deep Web数据源的选择这一步要达到的目标是如何从一个领域中大量的Deep Web数据源中选择出合适的子集，减少访问Deep Web的数量和使得查询结果中冗余度足够小，而且查询代价更低。Based on the above analysis, it can be seen that the goal to be achieved in the selection of Deep Web data sources is how to select a suitable subset from a large number of Deep Web data sources in a field, reduce the number of access to Deep Web and make the query results redundant. The margin is small enough, and the query cost is lower.

为此，我们利用查询接口语义特征，基于领域本体将用户查询进行了扩展，这样，所选择的查询接口集更能满足用户的查询要求。具体的说，本发明技术方案如下：To this end, we use the semantic features of the query interface to extend the user query based on the domain ontology, so that the selected query interface set can better meet the user's query requirements. Specifically, the technical scheme of the present invention is as follows:

一种用于深层网数据集成的数据源选择方法，其特征在于，包括以下步骤：A data source selection method for deep web data integration, characterized in that it comprises the following steps:

步骤A、对查询接口进行解析；Step A, analyzing the query interface;

步骤B、构建本体库并通过本体库把相应查询信息转化为本体信息；Step B, construct ontology library and transform corresponding query information into ontology information through ontology library;

步骤C、计算本体信息与各数据源的相关度，根据相关度选择满足预先设定的条件的数据源；对于给定目标查询接口对象DWI_i和查询本体Q_i，相关度按照如下公式计算：Step C, calculate the correlation degree between the ontology information and each data source, and select the data source that meets the preset conditions according to the correlation degree; for a given target query interface object DWI _i and query ontology Q _i , the correlation degree is calculated according to the following formula:

$R R (({DWI DWI}_{i i},, {Q Q}_{i i})) = = \frac{{Σ Σ}_{i i = = 11}^{m m} (({DWI DWI}_{i i} \times \times {Q Q}_{i i}))}{\sqrt{{Σ Σ}_{i i = = 11}^{m m} {(({DWI DWI}_{i i}))}^{22}} \times \times \sqrt{{Σ Σ}_{i i = = 11}^{m m} {Q Q}_{i i}^{22}}},,$

其中，R(DWI_i，Q_i)表示查询本体Q_i与查询接口对象DWI_i的相关度，m为查询接口中的对象个数。Among them, R(DWI _i , Q _i ) represents the correlation between the query ontology Q _i and the query interface object DWI _i , and m is the number of objects in the query interface.

本体是一种具有更多语义和结构信息的复杂模型，上述步骤B中的本体库可以使用现有的公用本体库；也可以通过采集现有的公用本体库，并对这些本体库进行扩充，得到新的本体库；而本发明采用后者。Ontology is a complex model with more semantic and structural information. The ontology library in the above step B can use the existing public ontology library; it can also collect the existing public ontology library and expand these ontology libraries. A new ontology library is obtained; and the present invention adopts the latter.

这类本体学习的主要任务就是分析关系模型中蕴涵的语义信息，并将其映射到本体中的相应部分。其次，查询接口和数据源结果页面通常包含丰富的信息如概念、实例以及领域有关的概念之间的关系，查询接口以HTML表单格式出现，在无法获得数据库模式的情况下，可以通过分析HTML表单的结构和数据来获取Web数据库中的语义，从而构建本体。根据以上分析，可以通过以下各步骤构建本发明的本体库：The main task of this type of ontology learning is to analyze the semantic information contained in the relational model and map it to the corresponding parts in the ontology. Secondly, the query interface and the data source result page usually contain rich information such as concepts, examples, and the relationship between domain-related concepts. The query interface appears in the form of an HTML form. If the database schema cannot be obtained, the HTML form can be analyzed. structure and data to obtain the semantics in the Web database, so as to build ontology. According to the above analysis, the ontology library of the present invention can be constructed through the following steps:

步骤B1、通过现有本体库分析HTML表单模式结构来获取查询接口的语义，构建相应本体库中的类；Step B1, analyze the HTML form schema structure through the existing ontology library to obtain the semantics of the query interface, and construct the classes in the corresponding ontology library;

步骤B2、从查询接口和结果页面抽取概念和实例，提取现有本体库中类的层次关系和函数关系；Step B2, extract concepts and examples from the query interface and the result page, and extract the hierarchical relationship and functional relationship of classes in the existing ontology library;

步骤B3、从某个主题的多个数据源中提取上述步骤B2中得到的本体类之间关系，然后推理映射不同的关系，最后合并成一个更高层的领域本体；针对每个本体库中的每个类，构建与该类对应的关键词集合，组成本体库的词汇层。Step B3. Extract the relationship between the ontology classes obtained in the above step B2 from multiple data sources of a certain subject, then reason and map different relationships, and finally merge into a higher-level domain ontology; for each ontology library For each class, build a set of keywords corresponding to the class to form the vocabulary layer of the ontology library.

为了进一步提高数据源选择的准确性，减少信息冗余，降低查询代价；本发明又在上述技术方案的基础上引入了数据源质量得分的概念，通过数据源的质量得分来度量数据源的质量，选择质量得分较高的若干数据源而放弃其他质量较低的数据源，从而大大降低信息冗余，提高了查询的准确性。具体而言，就是在上述步骤C之后继续执行以下各步骤：In order to further improve the accuracy of data source selection, reduce information redundancy, and reduce query costs; the present invention introduces the concept of data source quality score on the basis of the above technical solution, and measures the quality of data source by the quality score of data source , select some data sources with higher quality scores and discard other data sources with lower quality, thus greatly reducing information redundancy and improving query accuracy. Specifically, after the above step C, continue to perform the following steps:

步骤D、建立数据源质量评估模型并利用该数据源质量评估模型计算步骤C中得到的各数据源的质量得分；Step D, establishing a data source quality assessment model and using the data source quality assessment model to calculate the quality score of each data source obtained in step C;

步骤E、根据质量得分并按照一定的方法选择若干高质量数据源，得到最终的数据源集。Step E, select several high-quality data sources according to the quality score and according to a certain method, and obtain the final data source set.

上述步骤E中所述根据质量得分并按照一定的方法选择若干高质量数据源可以是选择质量得分大于一个预先设定的阈值的数据源；也可以采用Top-k数据选择方法，即按照质量得分将数据源从大到小排序，选择前k个数据源，k为预先设定的最终选择的数据源的个数。The selection of several high-quality data sources based on the quality score and according to a certain method in the above step E may be to select a data source with a quality score greater than a preset threshold; the Top-k data selection method may also be used, that is, according to the quality score Sort the data sources from large to small, and select the top k data sources, where k is the preset number of finally selected data sources.

本发明方法首先基于查询接口语义特征并结合本体库，选择与用户查询相关度较大的深层网数据源；接着通过数据源的质量得分来度量数据源的质量，选择质量得分较高的若干数据源而放弃其他质量较低的数据源，最终得到与客户查询相关度大且质量较高的数据源。相比现有技术，本发明方法能够提高深层网页查询的准确度，同时降低信息冗余，提高查询效率。The method of the present invention first selects a deep network data source that is highly relevant to the user query based on the semantic features of the query interface and in combination with the ontology database; then measures the quality of the data source by the quality score of the data source, and selects some data with higher quality scores sources and discard other lower-quality data sources, and finally obtain high-quality data sources that are highly relevant to customer queries. Compared with the prior art, the method of the invention can improve the accuracy of deep web page query, reduce information redundancy and improve query efficiency.

附图说明 Description of drawings

图1是本发明具体实施方式的深层网页查询接口示例图；Fig. 1 is an example diagram of a deep web page query interface of a specific embodiment of the present invention;

图2是本发明方法的流程图；Fig. 2 is a flow chart of the inventive method;

图3是本体库结构示例图；Fig. 3 is an example diagram of ontology library structure;

具体实施方式 Detailed ways

下面结合附图对本发明的技术方案进行详细说明：The technical scheme of the present invention is described in detail below in conjunction with accompanying drawing:

如附图2所示，本发明按照以下各步骤进行深层网数据源的选择：As shown in accompanying drawing 2, the present invention carries out the selection of deep network data source according to the following steps:

步骤A、对查询接口进行解析；Step A, analyzing the query interface;

如附图1所示，一个查询接口包含一些表单控件让用户输入查询信息，如文本框(Textbox)，单选按钮(Radio Button)，复选框(Check box)和下拉列表(SelectionList)等控件。每个控件通常都关联一个标签——一个描述文本，每个控件可以有一个或多个值(value)，例如一个下拉列表有一列值供用户选择，单选按钮和复选框通常有一个值。逻辑上讲，一个控件和它关联的标签构成了一个属性(attribute)，对应了深层网页(Deep Web)后台数据库中的一个字段。通常，一个属性包含一个标签，一个或多个表单控件。通过对当前Deep Web查询接口页面进行解析，得到相应各属性内容的标签、表单控制，再把它们按照语义关系组成一个个属性(查询条件的一个逻辑单位)。我们可以抽象地将查询接口本体实例DWI表示为：DWI＝(S，P，M)。其中S反映了接口实例功能等的特定信息，它包含：接口实例的名字(表单标签名)和该接口站点的URL等基本信息。P＝{p₁，p₂，…，p_n}为接口实例所对应的本体实例模板，M为接口实例所提供的方法。建立了DWI实例后，用户就可以提供一个面向本体实例的查询来检索其所需要的信息。As shown in Figure 1, a query interface includes some form controls for users to input query information, such as text box (Textbox), radio button (Radio Button), check box (Check box) and drop-down list (SelectionList) and other controls . Each control is usually associated with a label - a descriptive text, each control can have one or more values (value), for example a drop-down list has a list of values for the user to choose from, radio buttons and check boxes usually have a value . Logically speaking, a control and its associated label constitute an attribute, corresponding to a field in the deep web (Deep Web) background database. Typically, a property contains a label and one or more form controls. By parsing the current Deep Web query interface page, the labels and form controls corresponding to the content of each attribute are obtained, and then they are combined into attributes (a logical unit of query conditions) according to the semantic relationship. We can abstractly express the query interface ontology instance DWI as: DWI=(S, P, M). Among them, S reflects the specific information such as the function of the interface instance, which includes basic information such as the name of the interface instance (form tag name) and the URL of the interface site. P={p ₁ , p ₂ , . . . , p _n } is the ontology instance template corresponding to the interface instance, and M is the method provided by the interface instance. After the DWI instance is established, the user can provide an ontology instance-oriented query to retrieve the information he needs.

Deep Web数据源接口集可以抽象为：假定某领域内Deep Web数据源接口集为DWS＝{S_i1，S_i2，…，S_im}，每个数据源接口S_ii都对应一个出现在查询接口上的实例R_i组成的数据源本体模板，本体模板中的所有实例的联合为数据源接口集DWS。所谓实例就是指定查询接口上一个元素对应的标签名、内部属性名、一个或多个修饰语及其值域，它是查询接口上最小的语义单位。The Deep Web data source interface set can be abstracted as: Assume that the Deep Web data source interface set in a certain domain is DWS={S _i1 , S _i2 ,...,S _im }, each data source interface S _ii corresponds to a query interface The data source ontology template composed of instances R _i above, the union of all instances in the ontology template is the data source interface set DWS. The so-called instance refers to the tag name, internal attribute name, one or more modifiers and their value fields corresponding to an element on the specified query interface, which is the smallest semantic unit on the query interface.

步骤B、构建本体库并通过本体库把相应查询信息转化为本体信息；其中构建本体库按照以下各步骤执行：Step B. Construct an ontology database and convert the corresponding query information into ontology information through the ontology database; wherein the construction of the ontology database is performed according to the following steps:

步骤B3、从某个主题的多个数据源中提取上述步骤B2中得到的本体类之间关系，然后推理映射不同的关系，最后合并成一个更高层的领域本体；针对每个本体库中的每个类，构建与该类对应的关键词集合，组成本体库的词汇层；Step B3. Extract the relationship between the ontology classes obtained in the above step B2 from multiple data sources of a certain subject, then reason and map different relationships, and finally merge into a higher-level domain ontology; for each ontology library For each class, build a set of keywords corresponding to the class to form the vocabulary layer of the ontology library;

本发明方法将相应查询信息抽象表示为一种查询模型：Deep Web表示由一系列查询接口属性组成的关系表DB：Aq＝{aq₁，aq₂，…，aq_n}(接口模式)和一系列查询结果属性组成：Ar＝{ar₁，ar₂，，ar_m}(结果模式)。其中，每个属性aq_i∈A表示通过查询接口得到的查询属性，而结果属性arj∈A表示查询结果中的属性。每个查询操作可以用类似SQL语句来表示：“Select ar₁，ar₂，，ar_m from DB WHERE aq1＝val q₁，aq₂＝valq₂，…，aq_n＝valq_n”，这里val q_i表示查询表单中填充的属性值。The method of the present invention abstracts and expresses corresponding query information as a kind of query model: Deep Web represents the relational table DB that is made up of a series of query interface attributes: Aq={aq ₁ , aq ₂ ,..., aq _n } (interface mode) and a Attribute composition of series query results: Ar={ar ₁ , ar ₂ ,, ar _m } (result pattern). Among them, each attribute aq _i ∈ A represents the query attribute obtained through the query interface, and the result attribute arj ∈ A represents the attribute in the query result. Each query operation can be expressed by a similar SQL statement: "Select ar ₁ , ar ₂ ,, _{arm m} from DB WHERE aq1=val q ₁ , aq ₂ = valq ₂ ,..., aq _n = valq _n ", where val q _i represents the attribute value populated in the query form.

对于查询信息通过本体库进行查询扩展得到一系列的查询接口集。本体结构如附图3所示，图中所示为以一个交通工具(Vehicle)为核心概念的本体库结构图的一部分。该本体库结构包括一系列对现实事物的抽象。例如，“Vehicle”、“Car”“Truck”等这些概念构成本体库中的类(class)，图中还包括了类与类之间关系如“driver”和“price”等，该本体库还包含各类相应的实体，如BWM，F512M等。通过本体库的扩展，可以将一个概念扩展成一系列本体层中的概念集。如对于概念“Vehicle”，它所对应的概念还包含“Car”和“Truck”等概念。For the query information, a series of query interface sets are obtained through the query expansion of the ontology library. The ontology structure is shown in Figure 3, which is a part of the structure diagram of the ontology library with a vehicle (Vehicle) as the core concept. The ontology library structure includes a series of abstractions to real things. For example, concepts such as "Vehicle", "Car", and "Truck" constitute the classes in the ontology library. The figure also includes the relationship between classes such as "driver" and "price". The ontology library also includes Contains various corresponding entities, such as BWM, F512M, etc. Through the extension of the ontology library, a concept can be expanded into a series of concept sets in the ontology layer. For example, for the concept "Vehicle", its corresponding concepts also include concepts such as "Car" and "Truck".

通过分析可知，影响评估Deep Web数据源质量的主要因素有：浏览器、Web数据库、用户以及网络性能，本具体实施方式把这四类因素作为一级质量因子；每个一级质量因子又包含若干二级质量因子，例如，作为一级质量因子，Web数据库包括域完整性、一致性、冗余性、数据源大小等若干二级质量因子，这样，就可以得到一个包括两级质量因子的质量因子集，并据此得到数据源质量评估模型如下：It can be known by analysis that the main factors affecting the evaluation of Deep Web data source quality are: browser, Web database, user and network performance. This specific implementation method uses these four types of factors as a first-level quality factor; each first-level quality factor includes Several second-level quality factors, for example, as a first-level quality factor, the Web database includes several second-level quality factors such as domain integrity, consistency, redundancy, and data source size, so that a two-level quality factor can be obtained The quality factor set, and the data source quality evaluation model is obtained as follows:

${Q Q}_{s the s} = = {Σ Σ}_{n no = = 11}^{K K} {{{W W}_{n no} \times \times {Σ Σ}_{j j = = 11}^{L L} {w w}_{j j} {q q}_{nj nj}}}$

其中，Q_s∈[0，100]，表示第s个数据源的质量得分；W_n表示质量因子集中第n个一级质量因子的权重，n＝1，2…K，K为质量因子集中一级质量因子的个数，

w_j为第n个一级质量因子中第j个二级质量因子的权重，q_nj为使用第n个一级质量因子中第j个二级质量因子评估第s个数据源的质量得分，j＝1，2…L，L为质量因子集中第n个一级质量因子中所包含二级质量因子的个数，

Among them, Q _s ∈ [0, 100], represents the quality score of the sth data source; W _n represents the weight of the nth first-level quality factor in the quality factor set, n=1, 2...K, K is the quality factor set The number of first-order quality factors,

w _j is the weight of the jth secondary quality factor in the nth primary quality factor, q _nj is the quality score of the sth data source evaluated using the jth secondary quality factor in the nth primary quality factor, j=1, 2...L, L is the number of secondary quality factors contained in the nth primary quality factor in the quality factor set,

上述数据源质量评估模型为现有技术，更详细内容可参考文献(鲜学丰，方巍等.一种Deep Web数据源质量评估模型.微电子学与计算机，2008，Vol 25(10)：47-50.)。The above data source quality assessment model is an existing technology, and more detailed content can be referred to in the literature (Xian Xuefeng, Fang Wei, etc. A Deep Web data source quality assessment model. Microelectronics and Computers, 2008, Vol 25(10): 47-50.).

本具体实施方式在本步骤中采用Top-k的数据选择方法，即按照质量得分将数据源从大到小排序，选择前k个数据源，k为预先设定的最终选择的数据源的个数。In this specific embodiment, the Top-k data selection method is adopted in this step, that is, the data sources are sorted from large to small according to the quality score, and the first k data sources are selected, and k is the number of preset final selected data sources. number.

Claims

1. A data source selection method for deep web data integration, characterized in that, comprising the following steps:

Step A, analyzing the query interface;

Step B, construct ontology library and convert corresponding query information into ontology information through ontology library; said building ontology library specifically follows the following steps:

Step B 1. Analyze the HTML form schema structure through the existing ontology library to obtain the semantics of the query interface, and construct the classes in the corresponding ontology library;

Step B2, extract concepts and examples from the query interface and the result page, and extract the hierarchical relationship and functional relationship of classes in the existing ontology library;

Step B3. Extract the relationship between the ontology classes obtained in the above step B2 from multiple data sources of a certain subject, then reason and map different relationships, and finally merge into a higher-level domain ontology; for each ontology library For each class, build a set of keywords corresponding to the class to form the vocabulary layer of the ontology library;

Step C, calculate the correlation degree between the ontology information and each data source, and select the data source that meets the preset conditions according to the correlation degree; for a given target query interface object DWI _i and query ontology Q _i , the correlation degree is calculated according to the following formula:

R R (({DWI DWI}_{i i},, {Q Q}_{i i})) = = \frac{{Σ Σ}_{i i = = 11}^{m m} (({DWI DWI}_{i i} \times \times {Q Q}_{i i}))}{\sqrt{{Σ Σ}_{i i = = 11}^{m m} {(({DWI DWI}_{i i}))}^{22}} \times \times \sqrt{{Σ Σ}_{i i = = 11}^{m m} {Q Q}_{i i}^{22}}},,

Among them, R(DWI _i , Q _i ) represents the correlation between the query ontology Q _i and the query interface object DWI _i , and m is the number of objects in the query interface;

Step D, establishing a data source quality assessment model and using the data source quality assessment model to calculate the quality score of each data source obtained in step C;

Step E, select several high-quality data sources according to the quality score and according to a certain method, and obtain the final data source set.

2. as claimed in claim 1, is used for the data source selection method of deep network data integration, it is characterized in that, according to quality score described in step E and select some high-quality data sources according to certain method means: according to quality score will The data sources are sorted from large to small, and the top k data sources are selected; k is the preset number of finally selected data sources.