CN101061478B

CN101061478B - Method and system for identifying web document

Info

Publication number: CN101061478B
Application number: CN2005800396934A
Authority: CN
Inventors: 舍拉佳·哈利克; 威廉姆·C·布鲁格赫
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2004-09-30
Filing date: 2005-08-29
Publication date: 2011-06-15
Anticipated expiration: 2025-08-29
Also published as: EP1797511A1; US8386453B2; JP2012104149A; CA2583042C; WO2006039025A1; JP2008515087A; CN101061478A; BRPI0515950A; JP5531033B2; US20060074868A1; US20130151497A1; CA2583042A1

Abstract

A system that identifies a document, performs a search to identify web documents related to attributes associated with the document, and provides information associated with the web documents and the document.

Description

Method and system for identifying web documents

技术领域technical field

符合本发明原理的系统与方法一般涉及信息检索，更具体而言，涉及提供与特定文档相关的信息。 Systems and methods consistent with the principles of the present invention relate generally to information retrieval and, more specifically, to providing information related to a particular document. the

背景技术Background technique

现代计算机网络，尤其是因特网，使得大量信息能够广泛地且容易地获得。例如，免费因特网搜索引擎标引(index)出了链接至因特网的成千上万的web文档。连接至因特网的用户可以输入简单的搜索查询以快速地定位与该搜索查询相关的web文档。 Modern computer networks, especially the Internet, make vast amounts of information widely and easily available. For example, free Internet search engines index thousands of web documents linked to the Internet. A user connected to the Internet can enter a simple search query to quickly locate web documents relevant to the search query. the

然而，在因特网上不能广泛获得的一类内容包括更传统的作者印刷作品，例如书籍和杂志。使这些作品能够以数字方式获得的阻碍在于难以将作品的印刷版本转换为数字形式。光学字符识别(OCR)(它是使用光学扫描设备生成接下来转换为计算机可读格式(例如ASCII文件)的字符的文本图像的动作)，是将印刷文本转换为可用数字形式的已知技术。OCR系统一般包括用于生成印刷页面图像的光学扫描仪，以及用于分析该图像的软件。 However, one category of content that is not widely available on the Internet includes more traditional printed works of authors, such as books and magazines. An impediment to making these works available digitally is the difficulty of converting the printed version of the works into digital form. Optical character recognition (OCR), which is the act of using an optical scanning device to generate an image of text that is then converted into characters in a computer-readable format such as an ASCII file, is a known technique for converting printed text into a usable digital form. An OCR system typically includes an optical scanner to generate an image of a printed page, and software to analyze the image. the

发明内容Contents of the invention

根据一个方面，一种方法可以包括：接收搜索查询；基于该搜索查询执行第一搜索以识别文档；基于与该文档关联的属性执行第二搜索；并呈现第二搜索的结果。 According to one aspect, a method may include: receiving a search query; performing a first search to identify a document based on the search query; performing a second search based on attributes associated with the document; and presenting results of the second search. the

根据另一方面，一种系统可以包括存储指令的存储器和执行这些指令的处理器。该系统可以识别文档，执行搜索以识别带有与该文档关联的属性相关的信息的web文档，并呈现与该web文档关联的信息。 According to another aspect, a system may include a memory storing instructions and a processor executing the instructions. The system can identify documents, perform searches to identify web documents with information related to attributes associated with the documents, and present information associated with the web documents. the

根据又一方面，计算机可读媒体中体现的图形用户界面可以包括到文档各部分的一组链接，文档内容的描述，以及与该文档关联的著录信息。该图形用户界面还可以包括用于使web文档的搜索得以执行的链接，该web文档带有与该文档关联的属性相关的信息。 According to yet another aspect, a graphical user interface embodied in a computer-readable medium can include a set of links to portions of a document, a description of the content of the document, and bibliographic information associated with the document. The graphical user interface may also include links for enabling searches of web documents with information related to attributes associated with the documents. the

根据再一方面，一种方法可以包括：接收来自用户的文档标识；自动执行多个搜索以识别与该文档关联的属性相关的web文档；并向用户提供与这些web文档关联的信息。 According to yet another aspect, a method may include: receiving a document identification from a user; automatically performing a plurality of searches to identify web documents related to attributes associated with the document; and providing information associated with the web documents to the user. the

根据又一方面，一种计算机可读媒体可以包含计算机可执行指令，这些计算机可执行指令包括用于识别文档的指令，用于执行搜索以识别带有与该文档关联的属性相关的信息的web文档的指令，用于从这些web文档提取信息的指令，以及用于呈现所提取的信息以及与该文档关联的信息的指令。 According to yet another aspect, a computer-readable medium may contain computer-executable instructions including instructions for identifying a document, performing a search to identify web documents with information related to attributes associated with the document documents, instructions for extracting information from the web documents, and instructions for presenting the extracted information and information associated with the documents. the

附图说明Description of drawings

并入本说明书并构成其一部分的附图，图示了本发明的实施例，并且与说明书一起对本发明进行解释。在附图中， The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description explain the invention. In the attached picture,

图1是其中可以实现符合本发明原理的系统和方法的网络的示例图； Figure 1 is an example diagram of a network in which systems and methods consistent with the principles of the invention may be implemented;

图2是根据符合本发明原理的实现方式，图1的客户端或服务器的示例图； Fig. 2 is an example diagram of the client or server of Fig. 1 according to an implementation in accordance with the principles of the present invention;

图3A-3D是根据符合本发明原理的某些实现方式，用于呈现与同文档相关的web文档有关的信息的示例处理的流程图； 3A-3D are flowcharts of example processes for presenting information related to a web document associated with the document, according to certain implementations consistent with the principles of the invention;

图4是根据符合本发明原理的一种实现方式，其中可以以搜索结果形式呈现与文档关联的信息的图形用户界面的示例图； 4 is an example diagram of a graphical user interface in which information associated with a document may be presented in the form of search results according to an implementation consistent with the principles of the present invention;

图5是根据符合本发明原理的一种实现方式，呈现与文档关联的引用页面的示例图； Fig. 5 is an example diagram showing a reference page associated with a document according to an implementation manner consistent with the principle of the present invention;

图6是根据符合本发明原理的一种实现方式，其中可以呈现与web文档关联的信息的图形用户界面的示例图； 6 is an illustration of an exemplary graphical user interface in which information associated with a web document may be presented, according to an implementation consistent with the principles of the invention;

图7是根据符合本发明原理的另一实现方式，引用页面部分的示例图； Fig. 7 is according to another implementation mode that accords with the principle of the present invention, cites the example figure of page part;

图8是根据符合本发明原理的又一实现方式，引用页面部分的示例图； Fig. 8 is according to another implementation mode consistent with the principles of the present invention, an example diagram of a reference page part;

图9是根据符合本发明原理的替代实现方式，其中可以以搜索结果形式呈现与文档关联的信息的图形用户界面的示例图；以及 9 is an illustration of an example graphical user interface in which information associated with a document may be presented as search results, according to an alternative implementation consistent with the principles of the invention; and

图10A和图10B是根据符合本发明原理的两种不同实现方式，其中可以呈现相关信息的图形用户界面的示例图。 10A and 10B are illustrations of graphical user interfaces in which relevant information may be presented, according to two different implementations consistent with the principles of the invention. the

具体实施方式Detailed ways

本发明的下述详细描述参照附图。不同附图中相同的附图标记可以识别相同或类似的元素。同样，下述详细描述并不限制本发明。 The following detailed description of the invention refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention. the

总述 Overview

越来越多类型的文档可以通过搜索引擎来进行搜索。例如，某些文档(诸如书籍、杂志、和/或目录)可以被扫描，并且它们的文本可以通过OCR进行识别。对这些文档进行更多了解，并使得该附加信息可以被用户获得是有益的。 More and more types of documents can be searched by search engines. For example, certain documents (such as books, magazines, and/or catalogs) can be scanned and their text can be recognized by OCR. It would be beneficial to learn more about these documents and to make this additional information available to users. the

符合本发明原理的系统和方法可以自动地对与同文档关联的一个或多个属性(也称作“文档属性”)相关的附加信息进行搜索，并与该文档关联地提供该附加信息。 Systems and methods consistent with the principles of the present invention can automatically search for additional information related to one or more attributes (also referred to as "document attributes") associated with a document and provide the additional information in association with the document. the

示例性网络配置 Example network configuration

图1是其中可以实现符合本发明原理的系统和方法的网络100的示例图。网络100可以包括通过网络150连接至多个服务器120-140的多个客户端110。为了简便起见，将两个客户端110和三个服务器120-140图示为连接至网络150。实际上，可以有更多或更少的客户端和服务器。同样，在某些实例中，客户端也可以执行服务器的功能，而服务器也可以执行客户端的功能。 1 is an exemplary diagram of a network 100 in which systems and methods consistent with the principles of the invention may be implemented. The network 100 may include a plurality of clients 110 connected to a plurality of servers 120-140 through a network 150 . For simplicity, two clients 110 and three servers 120-140 are shown connected to network 150. In fact, there can be more or fewer clients and servers. Also, in some instances, a client can perform the functions of a server, and a server can perform the functions of a client. the

客户端110可以包括客户端实体。实体可以定义为诸如无线电话、个人计算机、个人数字助理(PDA)、膝上型电脑，或其他类型的计算或通信设备的设备，在这些设备之一上运行的线程或进程，和 /或可由这些设备之一执行的对象。服务器120-140可以包括以符合本发明原理的方式来采集、处理、搜索，和/或保存文档的服务器实体。 Client 110 may include a client entity. An entity may be defined as a device such as a wireless telephone, personal computer, personal digital assistant (PDA), laptop, or other type of computing or communication device, a thread or process running on one of these devices, and/or accessible by An object implemented by one of these devices. Servers 120-140 may include server entities that capture, process, search, and/or store documents in a manner consistent with the principles of the invention. the

在符合本发明原理的一种实现方式中，服务器120可以包括客户端110可以使用的搜索引擎125。服务器120可以爬过(crawl)文档大全(corpus)(例如web文档)，标引这些文档，并且在文档仓库中存储与这些文档关联的信息。替代地或另外地，服务器120可以分析文档(例如书籍、杂志、报纸、文章、目录等)的数据库(或数据库集)，并且在相同仓库或不同仓库中存储与这些文档关联的信息。服务器130和140可以存储或保存可以由服务器120爬过或分析的文档。 In one implementation consistent with the principles of the invention, server 120 may include a search engine 125 that client 110 may use. Server 120 may crawl a corpus of documents (eg, web documents), index the documents, and store information associated with the documents in a document repository. Alternatively or additionally, server 120 may analyze a database (or collection of databases) of documents (eg, books, magazines, newspapers, articles, catalogs, etc.) and store information associated with those documents in the same repository or in a different repository. Servers 130 and 140 may store or hold documents that may be crawled or analyzed by server 120 . the

尽管将服务器120-140示为分立实体，服务器120-140中的一个或多个有可能执行服务器120-140中另一个或另外多个的一个或多个功能。例如，服务器120-140中的两个或更多可能实现为单一服务器。服务器120-140中的单独一个还可能实现为两个或更多分立(并且可能是分布式)设备。 Although servers 120-140 are shown as separate entities, it is possible for one or more of servers 120-140 to perform one or more functions of another or more of servers 120-140. For example, two or more of servers 120-140 may be implemented as a single server. It is also possible for a single one of servers 120-140 to be implemented as two or more separate (and possibly distributed) devices. the

网络150可以包括局域网(LAN)，广域网(WAN)，诸如公用交换电话网(PSTN)的电话网络，内联网，因特网，存储设备，或网络组合。客户端110和服务器120-140可以通过有线、无线，和/或光连接与网络150连接。 Network 150 may include a local area network (LAN), a wide area network (WAN), a telephone network such as the public switched telephone network (PSTN), an intranet, the Internet, a storage device, or a combination of networks. Clients 110 and servers 120-140 may be connected to network 150 via wired, wireless, and/or optical connections. the

作为此处使用的术语，“文档”广泛地解释为包括任何传统的作者印刷作品，诸如书籍、杂志、目录、报纸、文章等等。作为此处使用的术语，“web文档”广泛地解释为包括可以通过网络(如网络150)获得的任何机器可读的且机器可存储的作品产品。例如，Web文档可以包括web站点，文件，文件组合，带有到其他文件的嵌入式链接的一个或多个文件，新闻组张贴版，博客(blog)，web广告等等。在因特网的情境里，通常的web文档是网页。网页经常包括文本信息，并且可以包括嵌入式信息(例如元信息、图像、超链接等)和/或嵌入式指令(例如Java脚本等)。最为此处使用的术语，“链接”广泛地解释为包括对或来自web文档的任何引用。 As the term is used herein, "document" is broadly construed to include any traditional printed work of authorship, such as books, magazines, catalogs, newspapers, articles, and the like. As the term is used herein, "web document" is broadly construed to include any machine-readable and machine-storable work product available over a network, such as network 150 . For example, a Web document may include a web site, a document, a combination of documents, one or more documents with embedded links to other documents, newsgroup postings, blogs, web advertisements, and the like. In the context of the Internet, typical web documents are web pages. Web pages often include textual information, and may include embedded information (eg, meta information, images, hyperlinks, etc.) and/or embedded instructions (eg, Java script, etc.). As the term is used herein, "link" is broadly construed to include any reference to or from a web document. the

示例性客户端/服务器架构 Exemplary Client/Server Architecture

图2是根据符合本发明原理的实现方式，可对应于客户端110和服务器120-140中一个或多个的客户端或服务器实体(下文中称作“客户端/服务器实体”)的示例图。客户端/服务器实体可以包括总线210，处理器220，主存储器230，只读存储器(ROM)240，存储设备250，输入设备260，输出设备270，以及通信接口280。总线210可以包括允许客户端/服务器实体的元件之间进行通信的路径。 Figure 2 is an illustration of a client or server entity (hereinafter referred to as a "client/server entity") that may correspond to one or more of a client 110 and servers 120-140, according to an implementation consistent with the principles of the invention . The client/server entity may include a bus 210 , a processor 220 , a main memory 230 , a read only memory (ROM) 240 , a storage device 250 , an input device 260 , an output device 270 , and a communication interface 280 . Bus 210 may include paths that allow communication between elements of a client/server entity. the

处理器220可以包括常规处理器，微处理器，或解译且执行指令的处理逻辑。主存储器230可以包括可以存储由处理器220执行的信息和指令的随机访问存储器(RAM)或其他类型的动态存储设备。ROM240可以包括可以存储由处理器220使用的静态信息和指令的常规ROM设备或其他类型的静态存储设备。存储设备250可以包括磁和/或光记录媒体及其相应驱动器。 Processor 220 may include a conventional processor, microprocessor, or processing logic that interprets and executes instructions. Main memory 230 may include random access memory (RAM) or other types of dynamic storage devices that may store information and instructions for execution by processor 220 . ROM 240 may include a conventional ROM device or other type of static storage device that may store static information and instructions used by processor 220 . Storage device 250 may include magnetic and/or optical recording media and their corresponding drives. the

输入设备260可以包括允许操作者向客户端/服务器实体输入信息的常规机制，例如键盘、鼠标、手写笔、语音识别和/或生物测定机制等等。输出设备270可以包括向操作者输出信息的常规机制，包括显示器、打印机、扬声器等等。通信接口280可以包括任何能够使客户端/服务器实体与其他设备和/或系统进行通信的象收发机一样的机制。例如，通信接口280可以包括用于通过网络，如网络150，与其他设备或系统进行通信的机制。 Input device 260 may include conventional mechanisms that allow an operator to enter information into a client/server entity, such as a keyboard, mouse, stylus, voice recognition and/or biometric mechanisms, and the like. Output devices 270 may include conventional mechanisms for outputting information to an operator, including displays, printers, speakers, and the like. Communication interface 280 may include any transceiver-like mechanism that enables a client/server entity to communicate with other devices and/or systems. For example, communication interface 280 may include mechanisms for communicating with other devices or systems over a network, such as network 150 . the

正如将要在下文中详细描述的那样，符合本发明原理的客户端/服务器实体，可以执行特定的与搜索相关的操作。客户端/服务器实体可以响应于处理器220执行计算机可读媒体(如存储器230)中包含的软件指令，执行这些操作。计算机可读媒体可以定义为物理或逻辑存储设备和/或载波。 As will be described in detail below, a client/server entity, consistent with the principles of the present invention, may perform specific search-related operations. The client/server entity may perform these operations in response to processor 220 executing software instructions contained in a computer-readable medium, such as memory 230 . A computer-readable medium can be defined as a physical or logical storage device and/or carrier wave. the

软件指令可以从其他计算机可读媒体(如数据存储设备250)，或者通过通信接口280从其他设备读取到存储器230中。存储器230中包含的软件指令可以使处理器220执行稍后将要描述的过程。替代地，硬件电路可以代替或者结合软件指令来使用以实现符合本发明原理的过程。因此，符合本发明原理的实现方式并不局限于硬件电路和软件的任何特定组合。 Software instructions may be read into memory 230 from other computer-readable media, such as data storage device 250 , or from other devices through communication interface 280 . Software instructions contained in the memory 230 can cause the processor 220 to perform processes that will be described later. Alternatively, hardware circuitry may be used in place of or in combination with software instructions to implement processes consistent with the principles of the invention. Thus, implementations consistent with the principles of the invention are not limited to any specific combination of hardware circuitry and software. the

示例性处理 Exemplary processing

图3A-3D是根据符合本发明原理的某些实现方式，用于呈现与同文档相关的web文档有关的信息的示例处理的流程图。处理可以始于用户提供搜索项(或一组搜索项)作为搜索文档仓库的搜索查询。在一种实现方式中，文档仓库包括可以从因特网和/或数据库(或数据库集)获得的文档，而用于搜索该仓库的工具(vehicle)是搜索引擎，例如搜索引擎125(图1)。用户可以通过客户端，如客户端110(图1)上的web浏览器软件来提供搜索查询。 3A-3D are flowcharts of example processes for presenting information related to a web document related to a document, according to certain implementations consistent with principles of the invention. Processing may begin with a user providing a search term (or set of search terms) as a search query to search a document repository. In one implementation, the document repository includes documents available from the Internet and/or databases (or collections of databases), and the vehicle for searching the repository is a search engine, such as search engine 125 (FIG. 1). A user may provide a search query through a client, such as web browser software on client 110 (FIG. 1). the

搜索查询可以由搜索引擎来接收，并用来识别与搜索查询相关的文档(例如书籍、杂志、报纸、文章、目录，等等)(动作305和310)(图3A)。已经存在用于识别与搜索查询相关的文档的基于技术。一种这样的技术可能包括识别包含该搜索项或者搜索项同义字的文档。当搜索查询包括多于一个搜索项时，则技术可能包括识别包含这些搜索项作为词组的文档，或者包含这些搜索项但不必同时包含的文档，或者不用包含全部这些搜索项的文档。其他技术对本领域技术人员而言也是众所周知的。 A search query may be received by a search engine and used to identify documents (eg, books, magazines, newspapers, articles, catalogs, etc.) relevant to the search query (acts 305 and 310) (FIG. 3A). Based techniques already exist for identifying documents relevant to a search query. One such technique might include identifying documents that contain the search term or synonyms of the search term. When a search query includes more than one search term, then techniques may include identifying documents that contain these search terms as a phrase, or documents that contain these search terms but not necessarily both, or documents that do not contain all of these search terms. Other techniques are also well known to those skilled in the art. the

任选地，可以以某种方式对这些文档进行评分(动作315)。例如，文档的评分可以基于信息检索(IR)得分。已经存在用于生成IR分数的若干技术。例如，文档的IR得分可以基于文档内出现搜索项的文档文本中(例如，在标题、主体、页脚、页眉，等等)搜索项的出现数目，或者基于搜索项的出现特性(例如字体、尺寸、颜色，等等)来生成。其他技术对本领域技术人员而言也是众所周知的。 Optionally, the documents may be scored in some fashion (act 315). For example, the scoring of documents may be based on Information Retrieval (IR) scores. Several techniques already exist for generating IR scores. For example, the IR score of a document may be based on the number of occurrences of the search term in the document text where the search term occurs within the document (e.g., in the title, body, footer, header, etc.), or on the occurrence characteristics of the search term (e.g., font , size, color, etc.) to generate. Other techniques are also well known to those skilled in the art. the

搜索结果可以基于这些文档及其任选得分来形成，并呈现给用户(动作320)。在一种实现方式中，搜索结果可以包括与文档关联的信息，例如到文档的链接，其可以基于文档得分任选地进行分类。类似于常规搜索引擎提供的搜索结果，可以将搜索结果提供为HTML文档。替代地，可以根据搜索引擎和客户端约定一致的其他格式(例如可扩展标记语言(XML))来提供搜索结果。 Search results may be formed based on the documents and their optional scores and presented to the user (act 320). In one implementation, search results can include information associated with documents, such as links to documents, which can optionally be sorted based on document scores. Search results may be provided as HTML documents similar to those provided by conventional search engines. Alternatively, search results may be provided according to other formats agreed upon by the search engine and the client, such as Extensible Markup Language (XML). the

图4是根据符合本发明原理的一种实现方式，其中可以以搜索结果400的形式呈现与文档关联的信息的图形用户界面的示例图。如图4所示，搜索结果400可以包括文档标题410，作者信息420，来自文档的摘录430，以及任选地到该文档中其他相关摘录的链接440。假设对于该例，以及随后的那些例子，用户已经对与搜索项“military(军事)”相关的文档执行了搜索，并且所得到的一个文档包括“9/11Report(9/11报告)”。 4 is an illustration of an example graphical user interface in which information associated with a document may be presented in the form of search results 400, according to one implementation consistent with the principles of the invention. As shown in FIG. 4, search results 400 may include document title 410, author information 420, excerpt 430 from the document, and optionally links 440 to other related excerpts in the document. Assume that for this example, and those that follow, the user has performed a search for documents related to the search term "military" and the resulting one document includes "9/11 Report". the

文档标题410可以包括与该文档关联的标题。文档标题410的选择可以使得很可能采用(下文描述的)引用页面的形式的与该文档关联的详细信息得以呈现。作者信息420可以包括该文档作者的姓名。摘录430可以包括文档中包括搜索查询的搜索项的部分。搜索项的出现可以在摘录430内在视觉上加以辨别(例如高亮显示)。链接440可以允许将含有搜索项的，来自该文档的一个或多个其他摘录呈现给用户。 Document title 410 may include a title associated with the document. Selection of a document title 410 may cause detailed information associated with the document to be presented, most likely in the form of referenced pages (described below). Author information 420 may include the name of the author of the document. Snippet 430 may include the portion of the document that includes the search terms of the search query. The occurrence of the search term may be visually discerned (eg, highlighted) within snippet 430 . Link 440 may allow one or more other snippets from the document containing the search term to be presented to the user. the

返回到图3A，用户可以选择搜索结果中的一个文档(动作325)。各种各样的公知技术可以用于选择。例如，选择可以通过点击、鼠标悬停(mousehover)、鼠标经过(mouseover)、键盘敲击等等来进行。在一种实现方式中，文档选择可以包括与该文档关联的链接的选择，诸如图4所示的文档标题410的选择。 Returning to Figure 3A, the user may select a document in the search results (act 325). A variety of known techniques can be used for selection. For example, selection may be made by clicking, mousehover, mouseover, keystroke, and the like. In one implementation, document selection may include selection of a link associated with the document, such as selection of document title 410 shown in FIG. 4 . the

在符合本发明原理的一种实现方式中，有可能采用引用页面的方式的有关文档的详细信息，可以响应于用户对该文档的选择而呈现给用户(动作330)(图3B)。图5是根据符合本发明原理的一种实现方式，与文档关联的引用页面500的示例图。如图5所示，引用页面500可用包括来自该文档的摘录510，有关该文档的提要(synopsis)520，与该文档关联的封皮(jacket)或封皮内折边(flap)说明530，相关信息540，著录信息550，以及到该文档不同部分的一组链接560。在另外的实现方式中，引用页面500可以包括更多、更少、或不同类型的信息。 In one implementation consistent with the principles of the invention, detailed information about a document, possibly in the form of a referenced page, may be presented to a user in response to user selection of the document (act 330) (FIG. 3B). FIG. 5 is an exemplary diagram of a reference page 500 associated with a document according to an implementation consistent with the principles of the present invention. As shown in FIG. 5, a reference page 500 may include an excerpt 510 from the document, a synopsis 520 about the document, a jacket or flap description 530 associated with the document, related information 540, bibliographic information 550, and a set of links 560 to different parts of the document. In other implementations, reference page 500 may include more, fewer, or different types of information. the

摘录510可以包括来自该文档的、可以包括搜索查询的搜索项的文本部分。该文本部分可以对应于文档文本的图像或文本版本。搜索项的出现可以在该文本部分中在视觉上加以辨别(例如高亮显示)。提要520可以包括该文档内容的简短描述。封皮或封皮内折边说明530可以包括来自与该文档关联的封皮、封皮页(cover)或封皮内折边的文本。 Snippets 510 may include portions of text from the document that may include search terms of a search query. The text portion may correspond to an image or a text version of the document text. The presence of a search term can be visually discerned (eg, highlighted) within the text portion. Summary 520 may include a short description of the content of the document. Cover or cover flap instructions 530 may include text from a cover, cover, or cover flap associated with the document. the

著录信息510可以包括诸如ISBN，ISSN，出版者名称，标识文档主题内容类别的类别代码，和/或公开日期的信息。在其他实现方式中，著录信息550可以包括更多，更少，或不同条信息。链接560可以包括到该文档各部分的链接。例如，这些链接可以引用文档的封面(front cover)，内容表，相关摘录，索引，和/或封底(back cover)。这些链接之一的选择可以使相应文档部分的图像得以呈现。 Bibliographic information 510 may include information such as ISBN, ISSN, publisher name, category code identifying the category of the subject content of the document, and/or publication date. In other implementations, bibliographic information 550 may include more, fewer, or different pieces of information. Links 560 may include links to various parts of the document. For example, these links can refer to the document's front cover, table of contents, related excerpts, index, and/or back cover. Selection of one of these links may cause an image of the corresponding document section to be rendered. the

相关信息540可以包括各种类型的与用户可能发现有用的文档相关的信息。该信息可以通过执行与文档属性(例如标题，作者，出版者，出版日期等)相关的搜索来获得，以识别相关的web文档。 Related information 540 may include various types of information related to documents that a user may find useful. This information can be obtained by performing searches related to document attributes (eg, title, author, publisher, publication date, etc.) to identify relevant web documents. the

可以通过搜索获得的信息的例子可以包括与文档评论关联的信息，与文档话题关联的信息，与文档主题或类别关联的信息，与同该文档同一系列书籍关联的信息，与该文档同一杂志中杂志发行(magazine issue)关联的信息，与该文档来自同一会议的同一日志中或与该文档在同一杂志中的文章关联的信息，与新闻文章关联的信息，博客，或者其他类型的引用该文档或文档作者的张贴公告(posting)，与同该文档或该文档话题相关产品关联的信息，与该文档出版者关联的信息，与同该文档关联的出版日期关联的信息，与作者传记关联的信息，与同作者相关的web文档(诸如作者的网页)关联的信息，与作者图像关联的信息，和/或与相同作者的其他文档关联的信息。 Examples of information that may be obtained by searching may include information associated with comments on a document, information associated with a topic of a document, information associated with a subject or category of a document, information associated with books in the same series as the document, information in the same journal as the document Information associated with a magazine issue, information associated with an article in the same journal as the document from the same conference or in the same journal as the document, information associated with a news article, blog, or other type of reference to the document or document author's postings, information associated with products associated with the document or the topic of the document, information associated with the document's publisher, information associated with the publication date associated with the document, information associated with the author's biography information, information associated with web documents related to the author (such as the author's web page), information associated with the author's image, and/or information associated with other documents of the same author. the

在一种实现方式中，相关信息540可以包括与一个或多个文档属性关联的链接列表。如图5所示，示出了两个示例性链接542和544。实际上，可以有另外的链接。这些链接之一的选择可以使与特定文档属性相关的搜索得以执行。例如，与作者传记关联的链接544的选择可以使搜索得以执行，以便识别包括与该文档作者传记相关的信息的web文档。形成与各种话题相关的搜索查询的技术是本领域众所周知的。 In one implementation, related information 540 may include a list of links associated with one or more document properties. As shown in Figure 5, two exemplary links 542 and 544 are shown. In fact, there can be additional links. Selection of one of these links may cause a search to be performed related to a particular document property. For example, selection of a link 544 associated with an author's biography may cause a search to be performed to identify web documents that include information related to the document's author's biography. Techniques for forming search queries related to various topics are well known in the art. the

相关信息540可任选地还包括与一个或多个文档属性相关的广告集547。例如，广告可以为销售该文档、该文档的一部分、与作者相关的其它文档或与该文档属于同一话题的其它文档而提供。广告集547可还或替换地与其它信息相关或从其它信息得出，所述其它信息例如搜索查询项、另一(例如相关)文档或用户行为(例如搜索或观看历史)。 Related information 540 optionally also includes a set of advertisements 547 related to one or more document attributes. For example, an advertisement may be provided for sale of the document, a portion of the document, other documents related to the author, or other documents on the same topic as the document. Ad set 547 may also or alternatively be related to or derived from other information, such as a search query term, another (eg, related) document, or user behavior (eg, search or viewing history). the

返回到图3B，可以判断是否需要与文档属性相关的信息(动作335)。例如，可以判断用户是否选择了一个链接或与相关信息540关联的广告。如果需要与文档属性相关的信息，那么可以执行与文档属性相关的搜索以识别相关的web文档(动作340)。例如，如果用户需要有关文档评论的信息，那么可以利用例如与文档标题或作者姓名关联的词或多个词，以及类似“评论”或“多个评论”的词作为搜索查询，来执行搜索。与上述技术类似的技术可以用来识别与搜索查询相关的web文档。 Returning to Figure 3B, a determination may be made as to whether information related to document attributes is required (act 335). For example, it may be determined whether the user has selected a link or advertisement associated with related information 540 . If information related to document properties is desired, a search related to document properties may be performed to identify related web documents (act 340). For example, if a user wants information about document reviews, a search can be performed using, for example, a word or words associated with the document title or author name, and words like "comment" or "comments" as a search query. Techniques similar to those described above can be used to identify web documents relevant to a search query. the

Web文档可以基于IR得分和/或基于链接的得分任意地进行评分。已经存在生成IR和基于链接的得分的若干技术。用于生成IR得分的示例性技术可能基于该文档中搜索项的出现数目。用于生成基于链接的得分的技术在美国专利No.6,285,999中进行了描述。其他技术对本领域技术人员而言也是众所周知的。 Web documents can be scored arbitrarily based on IR scores and/or link-based scores. Several techniques for generating IR and link-based scores already exist. An exemplary technique for generating an IR score might be based on the number of occurrences of the search term in the document. Techniques for generating link-based scores are described in US Patent No. 6,285,999. Other techniques are also well known to those skilled in the art. the

在另一实现方式中，作为后台任务，可以对与相关信息540关联的所有链接进行搜索。换句话说，为与相关信息540关联的不同类型的信息，可以识别相关的web文档，而且这些相关的web文档可以进行高速缓存，以便用于稍后当用户指示需要这些信息时，呈现给用户。 In another implementation, all links associated with related information 540 may be searched as a background task. In other words, for the different types of information associated with related information 540, related web documents can be identified, and these related web documents can be cached for later presentation to the user when the user indicates that the information is needed . the

搜索结果可以基于web文档及其任选得分来形成，并呈现给用户(动作345)。在一种实现方式中，搜索结果可以包括与web文档关联的信息，例如到web文档的链接，它可以基于web文档得分任意地进行分类。类似于常规搜索引擎提供的搜索结果，可以将搜索结果提供为HTML文档。替代地，可以根据搜索引擎和客户端约定一致的格式(例如XML)来提供搜索结果。 Search results may be formed based on the web documents and their optional scores and presented to the user (act 345). In one implementation, the search results can include information associated with web documents, such as links to web documents, which can be arbitrarily categorized based on web document scores. Search results may be provided as HTML documents similar to those provided by conventional search engines. Alternatively, the search results may be provided according to a format agreed upon by the search engine and the client, such as XML. the

图6是根据符合本发明原理的一种实现方式，其中可以呈现关联信息的图形用户界面的示例图。在该示例性实现方式中，假设用户通过选择与相关信息540关联的相应链接，要求与该文档评论相关的附加信息。在这种情况下，可以执行搜索以识别带有该文档评论的web文档。例如，诸如与文档标题(如“9/11 Report(9/11报告)”)或作者姓名关联的词或多个词，以及类似“评论”或“多个评论”(或者很可能识别带有该文档评论的web文档的其他搜索项)的词的搜索查询，可以用来识别相关的web文档。 Fig. 6 is an illustration of an exemplary graphical user interface in which associated information may be presented, according to an implementation consistent with the principles of the invention. In this exemplary implementation, it is assumed that the user requests additional information related to the document review by selecting a corresponding link associated with related information 540 . In this case, a search can be performed to identify web documents with comments on that document. For example, words such as a word or words associated with a document title such as "9/11 Report" or an author's name, and words like "comment" or "multiple comments" (or likely to identify words with A search query of words that the document reviews (other search terms for web documents) can be used to identify related web documents. the

一组搜索结果(在图6中图示了其两个例子)可以呈现给用户。在图6中，示例性搜索结果对应于书籍评论-9/11 Report(9/11报告)。例如，搜索结果600可以包括web文档标识符610，来自该web文档的摘录620，以及与该web文档关联的其他信息630。标识符610可以识别该web文档。标识符610的选择可以使得该web文档得以呈现。摘录620可以包括该web文档中可以包括搜索查询的搜索项的部分。搜索项的出现可以在摘录620中在视觉上加以辨别(例如高亮显示)。其他信息630可以包括web文档的地址，web文档的大小，与web文档关联的日期，或者与该web文档关联的其他信息。 A set of search results, two examples of which are illustrated in Figure 6, may be presented to the user. In FIG. 6, exemplary search results correspond to Book Review - 9/11 Report (9/11 Report). For example, search results 600 may include web document identifier 610, excerpt 620 from the web document, and other information 630 associated with the web document. Identifier 610 may identify the web document. Selection of identifier 610 may cause the web document to be rendered. Snippets 620 can include portions of the web document that can include search terms of a search query. The occurrence of the search term can be visually discerned (eg, highlighted) in snippet 620 . Other information 630 may include the address of the web document, the size of the web document, a date associated with the web document, or other information associated with the web document. the

在符合本发明原理的另一实现方式中，可以响应于用户在搜索结果中选择文档(动作325)(图3A)而执行搜索。在这种实现方式中，可以执行与一个或多个文档属性相关的搜索，以识别相关的web文档(动作350)(图3C)。例如，可以对不同的文档属性形成搜索查询，并且可以执行搜索以识别相关的web文档。上述技术的类似技术可以用来识别并有可能为与搜索查询相关的web文档评分。 In another implementation consistent with the principles of the invention, the search may be performed in response to a user selecting a document among the search results (act 325) (FIG. 3A). In such an implementation, a search related to one or more document attributes can be performed to identify related web documents (act 350) (FIG. 3C). For example, search queries can be formed for different document attributes, and searches can be performed to identify related web documents. Techniques similar to those described above can be used to identify and possibly score web documents that are relevant to a search query. the

有关该文档的详细信息(包括关于相关web文档的信息)，很可能以引用页面的形式呈现给用户(动作355)。在一种实现方式中，引用页面可以类似于上面关于图5所述的引用页面500。然而，在这种实现方式中，与相关信息540关联的链接可以用与相关web文档关联的信息来代替或加以补充。 Detailed information about the document (comprising information about related web documents) is likely to be presented to the user in the form of a reference page (action 355). In one implementation, the reference page may be similar to reference page 500 described above with respect to FIG. 5 . In such an implementation, however, links associated with related information 540 may be replaced or supplemented with information associated with related web documents. the

图7是根据符合本发明原理的另一实现方式，引用页面部分700的示例图。在该实现方式中，与一个或多个文档属性相关的一组搜索结果可以呈现给用户。如图7所示，与文档评论542相对应地呈现两个示例性搜索结果。同样如图7所示，可以提供链接以用于另外的搜索结果。 FIG. 7 is an illustration of a reference page section 700 according to another implementation consistent with the principles of the invention. In this implementation, a set of search results related to one or more document properties may be presented to the user. As shown in FIG. 7 , two exemplary search results are presented corresponding to document reviews 542 . As also shown in Figure 7, links may be provided for additional search results. the

例如，搜索结果710可以包括web文档源712，来自该web文档的摘录714，以及与该web文档关联的其他信息716。源712可以包括该web文档的源。源712的选择可以使得对应的web文档得以呈现。摘录714可以包括web文档中可以包括搜索查询的搜索项的部分。搜索项的出现可以在摘录714中在视觉上加以辨别(例如高亮显示)。其他信息716可以包括web文档的地址，web文档的大小，与web文档关联的日期，或者与该web文档关联的其他信息。 For example, search results 710 may include a web document source 712, an excerpt 714 from the web document, and other information 716 associated with the web document. Source 712 may include the source of the web document. Selection of a source 712 may cause a corresponding web document to be rendered. Snippets 714 may include portions of web documents that may include search terms of a search query. The occurrence of the search term can be visually discerned (eg, highlighted) in snippet 714 . Other information 716 may include the address of the web document, the size of the web document, the date associated with the web document, or other information associated with the web document. the

在符合本发明原理的另一实现方式中，可以响应于用户在搜索结果中选择文档(动作325)(图3A)而执行搜索。在这种实现方式中，可以执行与一个或多个文档属性相关的搜索，以识别相关的web文档(动作360)(图3D)。例如，可以对不同的文档属性形成搜索查询，并且可以执行搜索以识别相关的web文档。上述技术的类似技术可以用来识别并有可能为与搜索查询相关的web文档评分。 In another implementation consistent with the principles of the invention, the search may be performed in response to a user selecting a document among the search results (act 325) (FIG. 3A). In such an implementation, a search related to one or more document attributes can be performed to identify related web documents (act 360) (FIG. 3D). For example, search queries can be formed for different document attributes, and searches can be performed to identify related web documents. Techniques similar to those described above can be used to identify and possibly score web documents that are relevant to a search query. the

可以从相关的web文档中提取信息(动作365)。可能是引用页面形式的页面可以基于所提取的信息来创建，并且该页面可以呈现给用户(动作370和375)。在一种实现方式中，引用页面可以类似上面关于图5描述的引用页面500。然而，在这种实现方式中，与相关信息540关联的链接可以用从相关web文档提取的信息来代替或加以补充。 Information can be extracted from related web documents (act 365). A page, possibly in the form of a reference page, can be created based on the extracted information, and the page can be presented to the user (acts 370 and 375). In one implementation, the reference page may be similar to reference page 500 described above with respect to FIG. 5 . In such an implementation, however, links associated with related information 540 may be replaced or supplemented with information extracted from related web documents. the

图8是根据符合本发明原理的又一实现方式，引用页面部分800 的示例图。在该实现方式中，对于各种类型的相关信息540，可以从对应于一组搜索结果的web文档中提取信息，并且该信息可以呈现给用户。从搜索结果提取的特定类型的信息可以包括用户可能发现有用的任何信息。 FIG. 8 is an example diagram of a reference page portion 800 according to yet another implementation consistent with the principles of the invention. In this implementation, for each type of related information 540, information can be extracted from web documents corresponding to a set of search results and presented to the user. Certain types of information extracted from search results may include any information a user may find useful. the

如图8所示，呈现关于文档评论的从两个示例性搜索结果提取的信息。例如，信息810可以包括信息源812，任选用户评级(rating)814，评论816，和其他信息818。源812可以包括该信息的源(例如Amazon.com)。源812的选择可以使得来自该源的web文档得以呈现。用户评级814可以包括源812(例如Amazon.com)的用户对该文档的评级。评论816可以包括源812(例如Amazon.com)提供的文档评论(或评论的一部分)。其他信息818可以包括web文档的地址，web文档的大小，与web文档关联的日期，或者与该web文档关联的其他信息。 As shown in FIG. 8, information extracted from two exemplary search results about document reviews is presented. For example, information 810 may include information sources 812 , optional user ratings 814 , reviews 816 , and other information 818 . Source 812 may include the source of the information (eg, Amazon.com). Selection of a source 812 may cause web documents from that source to be rendered. User ratings 814 may include ratings of the document by users of source 812 (eg, Amazon.com). Reviews 816 may include document reviews (or portions of reviews) provided by source 812 (eg, Amazon.com). Other information 818 may include the address of the web document, the size of the web document, the date associated with the web document, or other information associated with the web document. the

替代的图形用户界面 Alternative GUI

在符合本发明原理的替代实现方式中，与上面关于图4所描述的类似，信息可以关于文档而呈现。然而，在这种情况下，与相关信息540(图5)类似，可以为相关信息提供附加链接。图9是根据符合本发明原理的该替代实现方式，其中可以以搜索结果900的形式呈现与文档关联的信息的图形用户界面的示例图。如图9所示，搜索结果900可以包括文档标题410，作者信息420，来自文档的摘录430，到该文档中其他相关摘录的任选链接440，以及到相关信息的链接910。文档标题410，作者信息420，摘录430，任选链接440可以类似于上面关于图4所述的部分。 In an alternative implementation consistent with the principles of the invention, information may be presented with respect to documents similar to that described above with respect to FIG. 4 . In this case, however, additional links may be provided for related information, similar to related information 540 (FIG. 5). FIG. 9 is an illustration of an example graphical user interface in which information associated with a document may be presented in the form of search results 900 according to this alternative implementation consistent with the principles of the invention. As shown in FIG. 9, search results 900 may include document title 410, author information 420, excerpt 430 from the document, optional links 440 to other related excerpts in the document, and links 910 to related information. Document title 410, author information 420, excerpt 430, optional link 440 may be similar to those described above with respect to FIG. the

链接910可以使得相关信息得以呈现。图10A和图10B是根据符合本发明原理的两种不同实现方式，其中可以呈现相关信息的图形用户界面的示例图。如图10A所示，链接910的选择可以使一组链接得以提供，这组链接可以基于它们所关联的不同类型的文档属性进行任意地分离。如上所述，该组链接中某一链接的选择可以使得执行搜索并呈现结果。 Link 910 may cause related information to be presented. 10A and 10B are illustrations of graphical user interfaces in which relevant information may be presented, according to two different implementations consistent with the principles of the invention. As shown in FIG. 10A, selection of link 910 may result in a set of links being provided that may be arbitrarily separated based on the different types of document attributes with which they are associated. As noted above, selection of a link in the set of links may cause a search to be performed and results presented. the

如图10B所示，链接910的选择可以使得关于它们所关联的不同类型的文档属性，执行搜索并呈现结果。在一种实现方式中，可以提供一组搜索结果(类似于图7)。如上所述，这些搜索结果之一的选择可以使得对应的web文档得以呈现。在另一实现方式中，可以提供对应于一组搜索结果的从web文档(多个web文档)提取的信息(类似于图8)。 As shown in FIG. 10B , selection of links 910 may cause a search to be performed and results presented with respect to the different types of document attributes with which they are associated. In one implementation, a set of search results (similar to Figure 7) may be provided. As noted above, selection of one of these search results may cause the corresponding web document to be presented. In another implementation, information extracted from a web document(s) corresponding to a set of search results may be provided (similar to FIG. 8). the

结论 in conclusion

符合本发明原理的系统和方法可以对与一个或多个文档属性相关的附加信息进行搜索，并且与该文档关联地提供所述附加信息。 Systems and methods consistent with the principles of the present invention can search for additional information related to one or more document attributes and provide the additional information in association with the document. the

本发明优选实施例的前述说明提供了说明和描述，但是并不意图是穷尽的或将本发明局限于所公开的确切形式。各种修改和变形可以根据上述教导作出，或可以从本发明的实践中获得。 The foregoing description of the preferred embodiments of this invention has provided illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Various modifications and variations can be made in light of the above teachings, or can be acquired from practice of the invention. the

例如，尽管已经关于图3A-3D描述了一系列的动作，但是在符合本发明原理的其他实现方式中，可以对这些动作的次序进行修改。此外，还可以并行地执行不相互依赖的动作。 For example, while a series of acts have been described with respect to FIGS. 3A-3D , the order of these acts may be modified in other implementations consistent with the principles of the invention. Furthermore, actions that do not depend on each other can also be performed in parallel. the

已经描述了将文档识别为搜索结果，并且可以呈现与该文档或文档作者相关的web文档。然而，在其他实现方式中，文档可以以其他方式来标识，例如通过目录、类别，或其他文档列表。 It has been described that a document is identified as a search result, and web documents related to that document or document author can be presented. However, in other implementations, documents may be identified in other ways, such as by categories, categories, or other document listings. the

同样，已经关于图4-10B描述了示例性图形用户界面。在符合本发明原理的其他实现方式中，图形用户界面可以包括更多，更少，或不同条信息。 Likewise, exemplary graphical user interfaces have been described with respect to Figures 4-10B. In other implementations consistent with the principles of the invention, the graphical user interface may include more, fewer, or different pieces of information. the

如上所述，对本领域普通技术人员而言显而易见的是，本发明的这些方面可以在如附图中所示实现方式中以软件、硬件和固件的许多不同形式来实现。用来实现符合本发明原理的方面的实际软件代码或专用控制硬件并非对本发明的限制。因此，并不参照特定的软件代码对这些方面的操作和性能进行描述——可以理解本领域普通技术人员能够根据此处的说明，设计软件和控制硬件来实现这些方面。 As noted above, it will be apparent to those of ordinary skill in the art that these aspects of the invention can be implemented in many different forms of software, hardware and firmware in the implementations shown in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the principles of the invention is not limiting of the invention. Thus, the operation and performance of these aspects are not described with reference to specific software code - it is understood that one of ordinary skill in the art can, based on the description herein, design software and control hardware to implement these aspects. the

本申请中使用的元件、动作或指令都不应当解释为对本发明关键或必要的，除非进行了这样的明确描述。同样，在此处使用时，冠词 “一”意图包括一个或多个项目。在意指唯一一个项目的时候，使用术语“一个”或类似语言。此外，短语“基于”意图表达“至少部分地基于”的含义，除非另外进行了明确表明。 No element, act, or instruction used in the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article "a" is intended to include one or more items. Where only one item is intended, the term "one" or similar language is used. Furthermore, the phrase "based on" is intended to mean "based at least in part on" unless expressly stated otherwise. the

Claims

1. A method for identifying a web document comprising:

receiving a search query from a client device over a network;

performing a first search based on the search query to identify a set of search results;

presenting the set of search results for display on the client device;

receiving a selection of one of the set of search results from the client device;

presenting for display on the client device a reference page associated with a scanned document associated with the selected one of the set of search results, the reference page including information about the scanned information on documents and links associated with searches;

receiving a selection of the link from the client device;

In response to receiving a selection of the link, performing a second search to identify a web document based on attributes associated with the scanned document; and

Results of the second search are presented for display on the client device.

2. The method of claim 1, wherein the step of presenting the results of the second search comprises:

identifies the web document associated with the property,

extract information from the web document, and

The extracted information is presented for display on the client device.

3. The method of claim 1 , wherein the step of presenting the results of the second search comprises:

generating a score for said web document based on an information retrieval score and/or a link-based score,

classifying the web document based on the score, and

The categorized web documents are presented for display on the client device.

4. The method of claim 3, wherein the step of generating a score for the web document comprises:

generating an information retrieval score for said web document,

generating a link-based score for said web document, and

Based on the information retrieval score and the link-based score, an overall score for the web document is generated.

5. The method of claim 1, wherein the referenced page further comprises at least one of the following:

A description of the content of the scanned document,

the text associated with one of the cover, cover, or cover flap associated with the scanned document,

bibliographic information associated with the scanned document, or

advertise.

6. The method of claim 1, wherein the referring page further comprises:

an excerpt from that scanned document, and

A set of links to various parts of the scanned document.

7. The method of claim 6, wherein the extract comprises an image of a portion of text from the scanned document.

8. The method of claim 6, wherein the set of links references at least one of:

the cover page associated with the scanned document,

a table of contents associated with the scanned document,

the index associated with the scanned document, or

The back cover associated with this scanned document.

9. The method of claim 1, wherein the step of presenting the results of the second search comprises:

A reference page associated with one of the web documents is presented, the reference page including a link to a second web document with information related to the attribute.

10. The method of claim 9, wherein the link is generated by performing a second search.

11. The method of claim 9, wherein the referenced page further comprises at least one of:

a description of the content of said one web document,

text associated with one of the cover, cover or cover flap associated with said one web document,

bibliographic information associated with said one web document, or

advertise.

12. The method of claim 9, wherein said referring page further comprises:

an excerpt from said one web document, and

A set of links to parts of said one web document.

13. The method of claim 11, wherein the advertisement is related to or derived from at least one of a search query, the one web document, or user behavior.

14. The method of claim 1, wherein the step of presenting the results of the second search comprises:

A reference page associated with one of the web documents is presented, the reference page containing information extracted from the web document with information associated with the attribute.

15. The method of claim 1, wherein the attribute corresponds to at least one of a title, author, category, publisher, or publication date associated with the scanned document.

16. A system for identifying web documents comprising:

means for identifying a set of search results,

means for receiving a selection of one of the set of search results from a client device,

means for presenting to said client device a reference page associated with a scanned document associated with said selected one of said set of search results, said reference page comprising the link to,

means for receiving a selection of the link from the client device,

means for performing a search to identify web documents with information related to attributes associated with the scanned document in response to receiving a selection of the link, and

means for presenting information associated with the web document for display on said client device.