[go: up one dir, main page]

CN101061478B - Method and system for identifying web document - Google Patents

Method and system for identifying web document Download PDF

Info

Publication number
CN101061478B
CN101061478B CN2005800396934A CN200580039693A CN101061478B CN 101061478 B CN101061478 B CN 101061478B CN 2005800396934 A CN2005800396934 A CN 2005800396934A CN 200580039693 A CN200580039693 A CN 200580039693A CN 101061478 B CN101061478 B CN 101061478B
Authority
CN
China
Prior art keywords
document
search
information
web
web document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2005800396934A
Other languages
Chinese (zh)
Other versions
CN101061478A (en
Inventor
舍拉佳·哈利克
威廉姆·C·布鲁格赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Publication of CN101061478A publication Critical patent/CN101061478A/en
Application granted granted Critical
Publication of CN101061478B publication Critical patent/CN101061478B/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种系统,识别文档,执行搜索以识别与同该文档关联的属性相关的web文档,并提供与所述web文档关联的信息和该文档。

A system that identifies a document, performs a search to identify web documents related to attributes associated with the document, and provides information associated with the web documents and the document.

Description

用于识别web文档的方法和系统Method and system for identifying web documents

技术领域technical field

符合本发明原理的系统与方法一般涉及信息检索,更具体而言,涉及提供与特定文档相关的信息。 Systems and methods consistent with the principles of the present invention relate generally to information retrieval and, more specifically, to providing information related to a particular document. the

背景技术Background technique

现代计算机网络,尤其是因特网,使得大量信息能够广泛地且容易地获得。例如,免费因特网搜索引擎标引(index)出了链接至因特网的成千上万的web文档。连接至因特网的用户可以输入简单的搜索查询以快速地定位与该搜索查询相关的web文档。 Modern computer networks, especially the Internet, make vast amounts of information widely and easily available. For example, free Internet search engines index thousands of web documents linked to the Internet. A user connected to the Internet can enter a simple search query to quickly locate web documents relevant to the search query. the

然而,在因特网上不能广泛获得的一类内容包括更传统的作者印刷作品,例如书籍和杂志。使这些作品能够以数字方式获得的阻碍在于难以将作品的印刷版本转换为数字形式。光学字符识别(OCR)(它是使用光学扫描设备生成接下来转换为计算机可读格式(例如ASCII文件)的字符的文本图像的动作),是将印刷文本转换为可用数字形式的已知技术。OCR系统一般包括用于生成印刷页面图像的光学扫描仪,以及用于分析该图像的软件。 However, one category of content that is not widely available on the Internet includes more traditional printed works of authors, such as books and magazines. An impediment to making these works available digitally is the difficulty of converting the printed version of the works into digital form. Optical character recognition (OCR), which is the act of using an optical scanning device to generate an image of text that is then converted into characters in a computer-readable format such as an ASCII file, is a known technique for converting printed text into a usable digital form. An OCR system typically includes an optical scanner to generate an image of a printed page, and software to analyze the image. the

发明内容Contents of the invention

根据一个方面,一种方法可以包括:接收搜索查询;基于该搜索查询执行第一搜索以识别文档;基于与该文档关联的属性执行第二搜索;并呈现第二搜索的结果。 According to one aspect, a method may include: receiving a search query; performing a first search to identify a document based on the search query; performing a second search based on attributes associated with the document; and presenting results of the second search. the

根据另一方面,一种系统可以包括存储指令的存储器和执行这些指令的处理器。该系统可以识别文档,执行搜索以识别带有与该文档关联的属性相关的信息的web文档,并呈现与该web文档关联的信息。 According to another aspect, a system may include a memory storing instructions and a processor executing the instructions. The system can identify documents, perform searches to identify web documents with information related to attributes associated with the documents, and present information associated with the web documents. the

根据又一方面,计算机可读媒体中体现的图形用户界面可以包括到文档各部分的一组链接,文档内容的描述,以及与该文档关联的著录信息。该图形用户界面还可以包括用于使web文档的搜索得以执行的链接,该web文档带有与该文档关联的属性相关的信息。 According to yet another aspect, a graphical user interface embodied in a computer-readable medium can include a set of links to portions of a document, a description of the content of the document, and bibliographic information associated with the document. The graphical user interface may also include links for enabling searches of web documents with information related to attributes associated with the documents. the

根据再一方面,一种方法可以包括:接收来自用户的文档标识;自动执行多个搜索以识别与该文档关联的属性相关的web文档;并向用户提供与这些web文档关联的信息。 According to yet another aspect, a method may include: receiving a document identification from a user; automatically performing a plurality of searches to identify web documents related to attributes associated with the document; and providing information associated with the web documents to the user. the

根据又一方面,一种计算机可读媒体可以包含计算机可执行指令,这些计算机可执行指令包括用于识别文档的指令,用于执行搜索以识别带有与该文档关联的属性相关的信息的web文档的指令,用于从这些web文档提取信息的指令,以及用于呈现所提取的信息以及与该文档关联的信息的指令。 According to yet another aspect, a computer-readable medium may contain computer-executable instructions including instructions for identifying a document, performing a search to identify web documents with information related to attributes associated with the document documents, instructions for extracting information from the web documents, and instructions for presenting the extracted information and information associated with the documents. the

附图说明Description of drawings

并入本说明书并构成其一部分的附图,图示了本发明的实施例,并且与说明书一起对本发明进行解释。在附图中, The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description explain the invention. In the attached picture,

图1是其中可以实现符合本发明原理的系统和方法的网络的示例图; Figure 1 is an example diagram of a network in which systems and methods consistent with the principles of the invention may be implemented;

图2是根据符合本发明原理的实现方式,图1的客户端或服务器的示例图; Fig. 2 is an example diagram of the client or server of Fig. 1 according to an implementation in accordance with the principles of the present invention;

图3A-3D是根据符合本发明原理的某些实现方式,用于呈现与同文档相关的web文档有关的信息的示例处理的流程图; 3A-3D are flowcharts of example processes for presenting information related to a web document associated with the document, according to certain implementations consistent with the principles of the invention;

图4是根据符合本发明原理的一种实现方式,其中可以以搜索结果形式呈现与文档关联的信息的图形用户界面的示例图; 4 is an example diagram of a graphical user interface in which information associated with a document may be presented in the form of search results according to an implementation consistent with the principles of the present invention;

图5是根据符合本发明原理的一种实现方式,呈现与文档关联的引用页面的示例图; Fig. 5 is an example diagram showing a reference page associated with a document according to an implementation manner consistent with the principle of the present invention;

图6是根据符合本发明原理的一种实现方式,其中可以呈现与web文档关联的信息的图形用户界面的示例图; 6 is an illustration of an exemplary graphical user interface in which information associated with a web document may be presented, according to an implementation consistent with the principles of the invention;

图7是根据符合本发明原理的另一实现方式,引用页面部分的示 例图; Fig. 7 is according to another implementation mode that accords with the principle of the present invention, cites the example figure of page part;

图8是根据符合本发明原理的又一实现方式,引用页面部分的示例图; Fig. 8 is according to another implementation mode consistent with the principles of the present invention, an example diagram of a reference page part;

图9是根据符合本发明原理的替代实现方式,其中可以以搜索结果形式呈现与文档关联的信息的图形用户界面的示例图;以及 9 is an illustration of an example graphical user interface in which information associated with a document may be presented as search results, according to an alternative implementation consistent with the principles of the invention; and

图10A和图10B是根据符合本发明原理的两种不同实现方式,其中可以呈现相关信息的图形用户界面的示例图。 10A and 10B are illustrations of graphical user interfaces in which relevant information may be presented, according to two different implementations consistent with the principles of the invention. the

具体实施方式Detailed ways

本发明的下述详细描述参照附图。不同附图中相同的附图标记可以识别相同或类似的元素。同样,下述详细描述并不限制本发明。 The following detailed description of the invention refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention. the

总述 Overview

越来越多类型的文档可以通过搜索引擎来进行搜索。例如,某些文档(诸如书籍、杂志、和/或目录)可以被扫描,并且它们的文本可以通过OCR进行识别。对这些文档进行更多了解,并使得该附加信息可以被用户获得是有益的。 More and more types of documents can be searched by search engines. For example, certain documents (such as books, magazines, and/or catalogs) can be scanned and their text can be recognized by OCR. It would be beneficial to learn more about these documents and to make this additional information available to users. the

符合本发明原理的系统和方法可以自动地对与同文档关联的一个或多个属性(也称作“文档属性”)相关的附加信息进行搜索,并与该文档关联地提供该附加信息。 Systems and methods consistent with the principles of the present invention can automatically search for additional information related to one or more attributes (also referred to as "document attributes") associated with a document and provide the additional information in association with the document. the

示例性网络配置 Example network configuration

图1是其中可以实现符合本发明原理的系统和方法的网络100的示例图。网络100可以包括通过网络150连接至多个服务器120-140的多个客户端110。为了简便起见,将两个客户端110和三个服务器120-140图示为连接至网络150。实际上,可以有更多或更少的客户端和服务器。同样,在某些实例中,客户端也可以执行服务器的功能,而服务器也可以执行客户端的功能。 1 is an exemplary diagram of a network 100 in which systems and methods consistent with the principles of the invention may be implemented. The network 100 may include a plurality of clients 110 connected to a plurality of servers 120-140 through a network 150 . For simplicity, two clients 110 and three servers 120-140 are shown connected to network 150. In fact, there can be more or fewer clients and servers. Also, in some instances, a client can perform the functions of a server, and a server can perform the functions of a client. the

客户端110可以包括客户端实体。实体可以定义为诸如无线电话、个人计算机、个人数字助理(PDA)、膝上型电脑,或其他类型的计算或通信设备的设备,在这些设备之一上运行的线程或进程,和 /或可由这些设备之一执行的对象。服务器120-140可以包括以符合本发明原理的方式来采集、处理、搜索,和/或保存文档的服务器实体。 Client 110 may include a client entity. An entity may be defined as a device such as a wireless telephone, personal computer, personal digital assistant (PDA), laptop, or other type of computing or communication device, a thread or process running on one of these devices, and/or accessible by An object implemented by one of these devices. Servers 120-140 may include server entities that capture, process, search, and/or store documents in a manner consistent with the principles of the invention. the

在符合本发明原理的一种实现方式中,服务器120可以包括客户端110可以使用的搜索引擎125。服务器120可以爬过(crawl)文档大全(corpus)(例如web文档),标引这些文档,并且在文档仓库中存储与这些文档关联的信息。替代地或另外地,服务器120可以分析文档(例如书籍、杂志、报纸、文章、目录等)的数据库(或数据库集),并且在相同仓库或不同仓库中存储与这些文档关联的信息。服务器130和140可以存储或保存可以由服务器120爬过或分析的文档。 In one implementation consistent with the principles of the invention, server 120 may include a search engine 125 that client 110 may use. Server 120 may crawl a corpus of documents (eg, web documents), index the documents, and store information associated with the documents in a document repository. Alternatively or additionally, server 120 may analyze a database (or collection of databases) of documents (eg, books, magazines, newspapers, articles, catalogs, etc.) and store information associated with those documents in the same repository or in a different repository. Servers 130 and 140 may store or hold documents that may be crawled or analyzed by server 120 . the

尽管将服务器120-140示为分立实体,服务器120-140中的一个或多个有可能执行服务器120-140中另一个或另外多个的一个或多个功能。例如,服务器120-140中的两个或更多可能实现为单一服务器。服务器120-140中的单独一个还可能实现为两个或更多分立(并且可能是分布式)设备。 Although servers 120-140 are shown as separate entities, it is possible for one or more of servers 120-140 to perform one or more functions of another or more of servers 120-140. For example, two or more of servers 120-140 may be implemented as a single server. It is also possible for a single one of servers 120-140 to be implemented as two or more separate (and possibly distributed) devices. the

网络150可以包括局域网(LAN),广域网(WAN),诸如公用交换电话网(PSTN)的电话网络,内联网,因特网,存储设备,或网络组合。客户端110和服务器120-140可以通过有线、无线,和/或光连接与网络150连接。 Network 150 may include a local area network (LAN), a wide area network (WAN), a telephone network such as the public switched telephone network (PSTN), an intranet, the Internet, a storage device, or a combination of networks. Clients 110 and servers 120-140 may be connected to network 150 via wired, wireless, and/or optical connections. the

作为此处使用的术语,“文档”广泛地解释为包括任何传统的作者印刷作品,诸如书籍、杂志、目录、报纸、文章等等。作为此处使用的术语,“web文档”广泛地解释为包括可以通过网络(如网络150)获得的任何机器可读的且机器可存储的作品产品。例如,Web文档可以包括web站点,文件,文件组合,带有到其他文件的嵌入式链接的一个或多个文件,新闻组张贴版,博客(blog),web广告等等。在因特网的情境里,通常的web文档是网页。网页经常包括文本信息,并且可以包括嵌入式信息(例如元信息、图像、超链接等)和/或嵌入式指令(例如Java脚本等)。最为此处使用的术语,“链接”广泛地解释为包括对或来自web文档的任何引用。 As the term is used herein, "document" is broadly construed to include any traditional printed work of authorship, such as books, magazines, catalogs, newspapers, articles, and the like. As the term is used herein, "web document" is broadly construed to include any machine-readable and machine-storable work product available over a network, such as network 150 . For example, a Web document may include a web site, a document, a combination of documents, one or more documents with embedded links to other documents, newsgroup postings, blogs, web advertisements, and the like. In the context of the Internet, typical web documents are web pages. Web pages often include textual information, and may include embedded information (eg, meta information, images, hyperlinks, etc.) and/or embedded instructions (eg, Java script, etc.). As the term is used herein, "link" is broadly construed to include any reference to or from a web document. the

示例性客户端/服务器架构 Exemplary Client/Server Architecture

图2是根据符合本发明原理的实现方式,可对应于客户端110和服务器120-140中一个或多个的客户端或服务器实体(下文中称作“客户端/服务器实体”)的示例图。客户端/服务器实体可以包括总线210,处理器220,主存储器230,只读存储器(ROM)240,存储设备250,输入设备260,输出设备270,以及通信接口280。总线210可以包括允许客户端/服务器实体的元件之间进行通信的路径。 Figure 2 is an illustration of a client or server entity (hereinafter referred to as a "client/server entity") that may correspond to one or more of a client 110 and servers 120-140, according to an implementation consistent with the principles of the invention . The client/server entity may include a bus 210 , a processor 220 , a main memory 230 , a read only memory (ROM) 240 , a storage device 250 , an input device 260 , an output device 270 , and a communication interface 280 . Bus 210 may include paths that allow communication between elements of a client/server entity. the

处理器220可以包括常规处理器,微处理器,或解译且执行指令的处理逻辑。主存储器230可以包括可以存储由处理器220执行的信息和指令的随机访问存储器(RAM)或其他类型的动态存储设备。ROM240可以包括可以存储由处理器220使用的静态信息和指令的常规ROM设备或其他类型的静态存储设备。存储设备250可以包括磁和/或光记录媒体及其相应驱动器。 Processor 220 may include a conventional processor, microprocessor, or processing logic that interprets and executes instructions. Main memory 230 may include random access memory (RAM) or other types of dynamic storage devices that may store information and instructions for execution by processor 220 . ROM 240 may include a conventional ROM device or other type of static storage device that may store static information and instructions used by processor 220 . Storage device 250 may include magnetic and/or optical recording media and their corresponding drives. the

输入设备260可以包括允许操作者向客户端/服务器实体输入信息的常规机制,例如键盘、鼠标、手写笔、语音识别和/或生物测定机制等等。输出设备270可以包括向操作者输出信息的常规机制,包括显示器、打印机、扬声器等等。通信接口280可以包括任何能够使客户端/服务器实体与其他设备和/或系统进行通信的象收发机一样的机制。例如,通信接口280可以包括用于通过网络,如网络150,与其他设备或系统进行通信的机制。 Input device 260 may include conventional mechanisms that allow an operator to enter information into a client/server entity, such as a keyboard, mouse, stylus, voice recognition and/or biometric mechanisms, and the like. Output devices 270 may include conventional mechanisms for outputting information to an operator, including displays, printers, speakers, and the like. Communication interface 280 may include any transceiver-like mechanism that enables a client/server entity to communicate with other devices and/or systems. For example, communication interface 280 may include mechanisms for communicating with other devices or systems over a network, such as network 150 . the

正如将要在下文中详细描述的那样,符合本发明原理的客户端/服务器实体,可以执行特定的与搜索相关的操作。客户端/服务器实体可以响应于处理器220执行计算机可读媒体(如存储器230)中包含的软件指令,执行这些操作。计算机可读媒体可以定义为物理或逻辑存储设备和/或载波。 As will be described in detail below, a client/server entity, consistent with the principles of the present invention, may perform specific search-related operations. The client/server entity may perform these operations in response to processor 220 executing software instructions contained in a computer-readable medium, such as memory 230 . A computer-readable medium can be defined as a physical or logical storage device and/or carrier wave. the

软件指令可以从其他计算机可读媒体(如数据存储设备250),或者通过通信接口280从其他设备读取到存储器230中。存储器230中包含的软件指令可以使处理器220执行稍后将要描述的过程。替代地,硬件电路可以代替或者结合软件指令来使用以实现符合本发明原 理的过程。因此,符合本发明原理的实现方式并不局限于硬件电路和软件的任何特定组合。 Software instructions may be read into memory 230 from other computer-readable media, such as data storage device 250 , or from other devices through communication interface 280 . Software instructions contained in the memory 230 can cause the processor 220 to perform processes that will be described later. Alternatively, hardware circuitry may be used in place of or in combination with software instructions to implement processes consistent with the principles of the invention. Thus, implementations consistent with the principles of the invention are not limited to any specific combination of hardware circuitry and software. the

示例性处理 Exemplary processing

图3A-3D是根据符合本发明原理的某些实现方式,用于呈现与同文档相关的web文档有关的信息的示例处理的流程图。处理可以始于用户提供搜索项(或一组搜索项)作为搜索文档仓库的搜索查询。在一种实现方式中,文档仓库包括可以从因特网和/或数据库(或数据库集)获得的文档,而用于搜索该仓库的工具(vehicle)是搜索引擎,例如搜索引擎125(图1)。用户可以通过客户端,如客户端110(图1)上的web浏览器软件来提供搜索查询。 3A-3D are flowcharts of example processes for presenting information related to a web document related to a document, according to certain implementations consistent with principles of the invention. Processing may begin with a user providing a search term (or set of search terms) as a search query to search a document repository. In one implementation, the document repository includes documents available from the Internet and/or databases (or collections of databases), and the vehicle for searching the repository is a search engine, such as search engine 125 (FIG. 1). A user may provide a search query through a client, such as web browser software on client 110 (FIG. 1). the

搜索查询可以由搜索引擎来接收,并用来识别与搜索查询相关的文档(例如书籍、杂志、报纸、文章、目录,等等)(动作305和310)(图3A)。已经存在用于识别与搜索查询相关的文档的基于技术。一种这样的技术可能包括识别包含该搜索项或者搜索项同义字的文档。当搜索查询包括多于一个搜索项时,则技术可能包括识别包含这些搜索项作为词组的文档,或者包含这些搜索项但不必同时包含的文档,或者不用包含全部这些搜索项的文档。其他技术对本领域技术人员而言也是众所周知的。 A search query may be received by a search engine and used to identify documents (eg, books, magazines, newspapers, articles, catalogs, etc.) relevant to the search query (acts 305 and 310) (FIG. 3A). Based techniques already exist for identifying documents relevant to a search query. One such technique might include identifying documents that contain the search term or synonyms of the search term. When a search query includes more than one search term, then techniques may include identifying documents that contain these search terms as a phrase, or documents that contain these search terms but not necessarily both, or documents that do not contain all of these search terms. Other techniques are also well known to those skilled in the art. the

任选地,可以以某种方式对这些文档进行评分(动作315)。例如,文档的评分可以基于信息检索(IR)得分。已经存在用于生成IR分数的若干技术。例如,文档的IR得分可以基于文档内出现搜索项的文档文本中(例如,在标题、主体、页脚、页眉,等等)搜索项的出现数目,或者基于搜索项的出现特性(例如字体、尺寸、颜色,等等)来生成。其他技术对本领域技术人员而言也是众所周知的。 Optionally, the documents may be scored in some fashion (act 315). For example, the scoring of documents may be based on Information Retrieval (IR) scores. Several techniques already exist for generating IR scores. For example, the IR score of a document may be based on the number of occurrences of the search term in the document text where the search term occurs within the document (e.g., in the title, body, footer, header, etc.), or on the occurrence characteristics of the search term (e.g., font , size, color, etc.) to generate. Other techniques are also well known to those skilled in the art. the

搜索结果可以基于这些文档及其任选得分来形成,并呈现给用户(动作320)。在一种实现方式中,搜索结果可以包括与文档关联的信息,例如到文档的链接,其可以基于文档得分任选地进行分类。类似于常规搜索引擎提供的搜索结果,可以将搜索结果提供为HTML文档。替代地,可以根据搜索引擎和客户端约定一致的其他格式(例 如可扩展标记语言(XML))来提供搜索结果。 Search results may be formed based on the documents and their optional scores and presented to the user (act 320). In one implementation, search results can include information associated with documents, such as links to documents, which can optionally be sorted based on document scores. Search results may be provided as HTML documents similar to those provided by conventional search engines. Alternatively, search results may be provided according to other formats agreed upon by the search engine and the client, such as Extensible Markup Language (XML). the

图4是根据符合本发明原理的一种实现方式,其中可以以搜索结果400的形式呈现与文档关联的信息的图形用户界面的示例图。如图4所示,搜索结果400可以包括文档标题410,作者信息420,来自文档的摘录430,以及任选地到该文档中其他相关摘录的链接440。假设对于该例,以及随后的那些例子,用户已经对与搜索项“military(军事)”相关的文档执行了搜索,并且所得到的一个文档包括“9/11Report(9/11报告)”。 4 is an illustration of an example graphical user interface in which information associated with a document may be presented in the form of search results 400, according to one implementation consistent with the principles of the invention. As shown in FIG. 4, search results 400 may include document title 410, author information 420, excerpt 430 from the document, and optionally links 440 to other related excerpts in the document. Assume that for this example, and those that follow, the user has performed a search for documents related to the search term "military" and the resulting one document includes "9/11 Report". the

文档标题410可以包括与该文档关联的标题。文档标题410的选择可以使得很可能采用(下文描述的)引用页面的形式的与该文档关联的详细信息得以呈现。作者信息420可以包括该文档作者的姓名。摘录430可以包括文档中包括搜索查询的搜索项的部分。搜索项的出现可以在摘录430内在视觉上加以辨别(例如高亮显示)。链接440可以允许将含有搜索项的,来自该文档的一个或多个其他摘录呈现给用户。 Document title 410 may include a title associated with the document. Selection of a document title 410 may cause detailed information associated with the document to be presented, most likely in the form of referenced pages (described below). Author information 420 may include the name of the author of the document. Snippet 430 may include the portion of the document that includes the search terms of the search query. The occurrence of the search term may be visually discerned (eg, highlighted) within snippet 430 . Link 440 may allow one or more other snippets from the document containing the search term to be presented to the user. the

返回到图3A,用户可以选择搜索结果中的一个文档(动作325)。各种各样的公知技术可以用于选择。例如,选择可以通过点击、鼠标悬停(mousehover)、鼠标经过(mouseover)、键盘敲击等等来进行。在一种实现方式中,文档选择可以包括与该文档关联的链接的选择,诸如图4所示的文档标题410的选择。 Returning to Figure 3A, the user may select a document in the search results (act 325). A variety of known techniques can be used for selection. For example, selection may be made by clicking, mousehover, mouseover, keystroke, and the like. In one implementation, document selection may include selection of a link associated with the document, such as selection of document title 410 shown in FIG. 4 . the

在符合本发明原理的一种实现方式中,有可能采用引用页面的方式的有关文档的详细信息,可以响应于用户对该文档的选择而呈现给用户(动作330)(图3B)。图5是根据符合本发明原理的一种实现方式,与文档关联的引用页面500的示例图。如图5所示,引用页面500可用包括来自该文档的摘录510,有关该文档的提要(synopsis)520,与该文档关联的封皮(jacket)或封皮内折边(flap)说明530,相关信息540,著录信息550,以及到该文档不同部分的一组链接560。在另外的实现方式中,引用页面500可以包括更多、更少、或不同类型的信息。 In one implementation consistent with the principles of the invention, detailed information about a document, possibly in the form of a referenced page, may be presented to a user in response to user selection of the document (act 330) (FIG. 3B). FIG. 5 is an exemplary diagram of a reference page 500 associated with a document according to an implementation consistent with the principles of the present invention. As shown in FIG. 5, a reference page 500 may include an excerpt 510 from the document, a synopsis 520 about the document, a jacket or flap description 530 associated with the document, related information 540, bibliographic information 550, and a set of links 560 to different parts of the document. In other implementations, reference page 500 may include more, fewer, or different types of information. the

摘录510可以包括来自该文档的、可以包括搜索查询的搜索项的文本部分。该文本部分可以对应于文档文本的图像或文本版本。搜索项的出现可以在该文本部分中在视觉上加以辨别(例如高亮显示)。提要520可以包括该文档内容的简短描述。封皮或封皮内折边说明530可以包括来自与该文档关联的封皮、封皮页(cover)或封皮内折边的文本。 Snippets 510 may include portions of text from the document that may include search terms of a search query. The text portion may correspond to an image or a text version of the document text. The presence of a search term can be visually discerned (eg, highlighted) within the text portion. Summary 520 may include a short description of the content of the document. Cover or cover flap instructions 530 may include text from a cover, cover, or cover flap associated with the document. the

著录信息510可以包括诸如ISBN,ISSN,出版者名称,标识文档主题内容类别的类别代码,和/或公开日期的信息。在其他实现方式中,著录信息550可以包括更多,更少,或不同条信息。链接560可以包括到该文档各部分的链接。例如,这些链接可以引用文档的封面(front cover),内容表,相关摘录,索引,和/或封底(back cover)。这些链接之一的选择可以使相应文档部分的图像得以呈现。 Bibliographic information 510 may include information such as ISBN, ISSN, publisher name, category code identifying the category of the subject content of the document, and/or publication date. In other implementations, bibliographic information 550 may include more, fewer, or different pieces of information. Links 560 may include links to various parts of the document. For example, these links can refer to the document's front cover, table of contents, related excerpts, index, and/or back cover. Selection of one of these links may cause an image of the corresponding document section to be rendered. the

相关信息540可以包括各种类型的与用户可能发现有用的文档相关的信息。该信息可以通过执行与文档属性(例如标题,作者,出版者,出版日期等)相关的搜索来获得,以识别相关的web文档。 Related information 540 may include various types of information related to documents that a user may find useful. This information can be obtained by performing searches related to document attributes (eg, title, author, publisher, publication date, etc.) to identify relevant web documents. the

可以通过搜索获得的信息的例子可以包括与文档评论关联的信息,与文档话题关联的信息,与文档主题或类别关联的信息,与同该文档同一系列书籍关联的信息,与该文档同一杂志中杂志发行(magazine issue)关联的信息,与该文档来自同一会议的同一日志中或与该文档在同一杂志中的文章关联的信息,与新闻文章关联的信息,博客,或者其他类型的引用该文档或文档作者的张贴公告(posting),与同该文档或该文档话题相关产品关联的信息,与该文档出版者关联的信息,与同该文档关联的出版日期关联的信息,与作者传记关联的信息,与同作者相关的web文档(诸如作者的网页)关联的信息,与作者图像关联的信息,和/或与相同作者的其他文档关联的信息。 Examples of information that may be obtained by searching may include information associated with comments on a document, information associated with a topic of a document, information associated with a subject or category of a document, information associated with books in the same series as the document, information in the same journal as the document Information associated with a magazine issue, information associated with an article in the same journal as the document from the same conference or in the same journal as the document, information associated with a news article, blog, or other type of reference to the document or document author's postings, information associated with products associated with the document or the topic of the document, information associated with the document's publisher, information associated with the publication date associated with the document, information associated with the author's biography information, information associated with web documents related to the author (such as the author's web page), information associated with the author's image, and/or information associated with other documents of the same author. the

在一种实现方式中,相关信息540可以包括与一个或多个文档属性关联的链接列表。如图5所示,示出了两个示例性链接542和544。实际上,可以有另外的链接。这些链接之一的选择可以使与特定文档 属性相关的搜索得以执行。例如,与作者传记关联的链接544的选择可以使搜索得以执行,以便识别包括与该文档作者传记相关的信息的web文档。形成与各种话题相关的搜索查询的技术是本领域众所周知的。 In one implementation, related information 540 may include a list of links associated with one or more document properties. As shown in Figure 5, two exemplary links 542 and 544 are shown. In fact, there can be additional links. Selection of one of these links may cause a search to be performed related to a particular document property. For example, selection of a link 544 associated with an author's biography may cause a search to be performed to identify web documents that include information related to the document's author's biography. Techniques for forming search queries related to various topics are well known in the art. the

相关信息540可任选地还包括与一个或多个文档属性相关的广告集547。例如,广告可以为销售该文档、该文档的一部分、与作者相关的其它文档或与该文档属于同一话题的其它文档而提供。广告集547可还或替换地与其它信息相关或从其它信息得出,所述其它信息例如搜索查询项、另一(例如相关)文档或用户行为(例如搜索或观看历史)。 Related information 540 optionally also includes a set of advertisements 547 related to one or more document attributes. For example, an advertisement may be provided for sale of the document, a portion of the document, other documents related to the author, or other documents on the same topic as the document. Ad set 547 may also or alternatively be related to or derived from other information, such as a search query term, another (eg, related) document, or user behavior (eg, search or viewing history). the

返回到图3B,可以判断是否需要与文档属性相关的信息(动作335)。例如,可以判断用户是否选择了一个链接或与相关信息540关联的广告。如果需要与文档属性相关的信息,那么可以执行与文档属性相关的搜索以识别相关的web文档(动作340)。例如,如果用户需要有关文档评论的信息,那么可以利用例如与文档标题或作者姓名关联的词或多个词,以及类似“评论”或“多个评论”的词作为搜索查询,来执行搜索。与上述技术类似的技术可以用来识别与搜索查询相关的web文档。 Returning to Figure 3B, a determination may be made as to whether information related to document attributes is required (act 335). For example, it may be determined whether the user has selected a link or advertisement associated with related information 540 . If information related to document properties is desired, a search related to document properties may be performed to identify related web documents (act 340). For example, if a user wants information about document reviews, a search can be performed using, for example, a word or words associated with the document title or author name, and words like "comment" or "comments" as a search query. Techniques similar to those described above can be used to identify web documents relevant to a search query. the

Web文档可以基于IR得分和/或基于链接的得分任意地进行评分。已经存在生成IR和基于链接的得分的若干技术。用于生成IR得分的示例性技术可能基于该文档中搜索项的出现数目。用于生成基于链接的得分的技术在美国专利No.6,285,999中进行了描述。其他技术对本领域技术人员而言也是众所周知的。 Web documents can be scored arbitrarily based on IR scores and/or link-based scores. Several techniques for generating IR and link-based scores already exist. An exemplary technique for generating an IR score might be based on the number of occurrences of the search term in the document. Techniques for generating link-based scores are described in US Patent No. 6,285,999. Other techniques are also well known to those skilled in the art. the

在另一实现方式中,作为后台任务,可以对与相关信息540关联的所有链接进行搜索。换句话说,为与相关信息540关联的不同类型的信息,可以识别相关的web文档,而且这些相关的web文档可以进行高速缓存,以便用于稍后当用户指示需要这些信息时,呈现给用户。 In another implementation, all links associated with related information 540 may be searched as a background task. In other words, for the different types of information associated with related information 540, related web documents can be identified, and these related web documents can be cached for later presentation to the user when the user indicates that the information is needed . the

搜索结果可以基于web文档及其任选得分来形成,并呈现给用 户(动作345)。在一种实现方式中,搜索结果可以包括与web文档关联的信息,例如到web文档的链接,它可以基于web文档得分任意地进行分类。类似于常规搜索引擎提供的搜索结果,可以将搜索结果提供为HTML文档。替代地,可以根据搜索引擎和客户端约定一致的格式(例如XML)来提供搜索结果。 Search results may be formed based on the web documents and their optional scores and presented to the user (act 345). In one implementation, the search results can include information associated with web documents, such as links to web documents, which can be arbitrarily categorized based on web document scores. Search results may be provided as HTML documents similar to those provided by conventional search engines. Alternatively, the search results may be provided according to a format agreed upon by the search engine and the client, such as XML. the

图6是根据符合本发明原理的一种实现方式,其中可以呈现关联信息的图形用户界面的示例图。在该示例性实现方式中,假设用户通过选择与相关信息540关联的相应链接,要求与该文档评论相关的附加信息。在这种情况下,可以执行搜索以识别带有该文档评论的web文档。例如,诸如与文档标题(如“9/11 Report(9/11报告)”)或作者姓名关联的词或多个词,以及类似“评论”或“多个评论”(或者很可能识别带有该文档评论的web文档的其他搜索项)的词的搜索查询,可以用来识别相关的web文档。 Fig. 6 is an illustration of an exemplary graphical user interface in which associated information may be presented, according to an implementation consistent with the principles of the invention. In this exemplary implementation, it is assumed that the user requests additional information related to the document review by selecting a corresponding link associated with related information 540 . In this case, a search can be performed to identify web documents with comments on that document. For example, words such as a word or words associated with a document title such as "9/11 Report" or an author's name, and words like "comment" or "multiple comments" (or likely to identify words with A search query of words that the document reviews (other search terms for web documents) can be used to identify related web documents. the

一组搜索结果(在图6中图示了其两个例子)可以呈现给用户。在图6中,示例性搜索结果对应于书籍评论-9/11 Report(9/11报告)。例如,搜索结果600可以包括web文档标识符610,来自该web文档的摘录620,以及与该web文档关联的其他信息630。标识符610可以识别该web文档。标识符610的选择可以使得该web文档得以呈现。摘录620可以包括该web文档中可以包括搜索查询的搜索项的部分。搜索项的出现可以在摘录620中在视觉上加以辨别(例如高亮显示)。其他信息630可以包括web文档的地址,web文档的大小,与web文档关联的日期,或者与该web文档关联的其他信息。 A set of search results, two examples of which are illustrated in Figure 6, may be presented to the user. In FIG. 6, exemplary search results correspond to Book Review - 9/11 Report (9/11 Report). For example, search results 600 may include web document identifier 610, excerpt 620 from the web document, and other information 630 associated with the web document. Identifier 610 may identify the web document. Selection of identifier 610 may cause the web document to be rendered. Snippets 620 can include portions of the web document that can include search terms of a search query. The occurrence of the search term can be visually discerned (eg, highlighted) in snippet 620 . Other information 630 may include the address of the web document, the size of the web document, a date associated with the web document, or other information associated with the web document. the

在符合本发明原理的另一实现方式中,可以响应于用户在搜索结果中选择文档(动作325)(图3A)而执行搜索。在这种实现方式中,可以执行与一个或多个文档属性相关的搜索,以识别相关的web文档(动作350)(图3C)。例如,可以对不同的文档属性形成搜索查询,并且可以执行搜索以识别相关的web文档。上述技术的类似技术可以用来识别并有可能为与搜索查询相关的web文档评分。 In another implementation consistent with the principles of the invention, the search may be performed in response to a user selecting a document among the search results (act 325) (FIG. 3A). In such an implementation, a search related to one or more document attributes can be performed to identify related web documents (act 350) (FIG. 3C). For example, search queries can be formed for different document attributes, and searches can be performed to identify related web documents. Techniques similar to those described above can be used to identify and possibly score web documents that are relevant to a search query. the

有关该文档的详细信息(包括关于相关web文档的信息),很 可能以引用页面的形式呈现给用户(动作355)。在一种实现方式中,引用页面可以类似于上面关于图5所述的引用页面500。然而,在这种实现方式中,与相关信息540关联的链接可以用与相关web文档关联的信息来代替或加以补充。 Detailed information about the document (comprising information about related web documents) is likely to be presented to the user in the form of a reference page (action 355). In one implementation, the reference page may be similar to reference page 500 described above with respect to FIG. 5 . In such an implementation, however, links associated with related information 540 may be replaced or supplemented with information associated with related web documents. the

图7是根据符合本发明原理的另一实现方式,引用页面部分700的示例图。在该实现方式中,与一个或多个文档属性相关的一组搜索结果可以呈现给用户。如图7所示,与文档评论542相对应地呈现两个示例性搜索结果。同样如图7所示,可以提供链接以用于另外的搜索结果。 FIG. 7 is an illustration of a reference page section 700 according to another implementation consistent with the principles of the invention. In this implementation, a set of search results related to one or more document properties may be presented to the user. As shown in FIG. 7 , two exemplary search results are presented corresponding to document reviews 542 . As also shown in Figure 7, links may be provided for additional search results. the

例如,搜索结果710可以包括web文档源712,来自该web文档的摘录714,以及与该web文档关联的其他信息716。源712可以包括该web文档的源。源712的选择可以使得对应的web文档得以呈现。摘录714可以包括web文档中可以包括搜索查询的搜索项的部分。搜索项的出现可以在摘录714中在视觉上加以辨别(例如高亮显示)。其他信息716可以包括web文档的地址,web文档的大小,与web文档关联的日期,或者与该web文档关联的其他信息。 For example, search results 710 may include a web document source 712, an excerpt 714 from the web document, and other information 716 associated with the web document. Source 712 may include the source of the web document. Selection of a source 712 may cause a corresponding web document to be rendered. Snippets 714 may include portions of web documents that may include search terms of a search query. The occurrence of the search term can be visually discerned (eg, highlighted) in snippet 714 . Other information 716 may include the address of the web document, the size of the web document, the date associated with the web document, or other information associated with the web document. the

在符合本发明原理的另一实现方式中,可以响应于用户在搜索结果中选择文档(动作325)(图3A)而执行搜索。在这种实现方式中,可以执行与一个或多个文档属性相关的搜索,以识别相关的web文档(动作360)(图3D)。例如,可以对不同的文档属性形成搜索查询,并且可以执行搜索以识别相关的web文档。上述技术的类似技术可以用来识别并有可能为与搜索查询相关的web文档评分。 In another implementation consistent with the principles of the invention, the search may be performed in response to a user selecting a document among the search results (act 325) (FIG. 3A). In such an implementation, a search related to one or more document attributes can be performed to identify related web documents (act 360) (FIG. 3D). For example, search queries can be formed for different document attributes, and searches can be performed to identify related web documents. Techniques similar to those described above can be used to identify and possibly score web documents that are relevant to a search query. the

可以从相关的web文档中提取信息(动作365)。可能是引用页面形式的页面可以基于所提取的信息来创建,并且该页面可以呈现给用户(动作370和375)。在一种实现方式中,引用页面可以类似上面关于图5描述的引用页面500。然而,在这种实现方式中,与相关信息540关联的链接可以用从相关web文档提取的信息来代替或加以补充。 Information can be extracted from related web documents (act 365). A page, possibly in the form of a reference page, can be created based on the extracted information, and the page can be presented to the user (acts 370 and 375). In one implementation, the reference page may be similar to reference page 500 described above with respect to FIG. 5 . In such an implementation, however, links associated with related information 540 may be replaced or supplemented with information extracted from related web documents. the

图8是根据符合本发明原理的又一实现方式,引用页面部分800 的示例图。在该实现方式中,对于各种类型的相关信息540,可以从对应于一组搜索结果的web文档中提取信息,并且该信息可以呈现给用户。从搜索结果提取的特定类型的信息可以包括用户可能发现有用的任何信息。 FIG. 8 is an example diagram of a reference page portion 800 according to yet another implementation consistent with the principles of the invention. In this implementation, for each type of related information 540, information can be extracted from web documents corresponding to a set of search results and presented to the user. Certain types of information extracted from search results may include any information a user may find useful. the

如图8所示,呈现关于文档评论的从两个示例性搜索结果提取的信息。例如,信息810可以包括信息源812,任选用户评级(rating)814,评论816,和其他信息818。源812可以包括该信息的源(例如Amazon.com)。源812的选择可以使得来自该源的web文档得以呈现。用户评级814可以包括源812(例如Amazon.com)的用户对该文档的评级。评论816可以包括源812(例如Amazon.com)提供的文档评论(或评论的一部分)。其他信息818可以包括web文档的地址,web文档的大小,与web文档关联的日期,或者与该web文档关联的其他信息。 As shown in FIG. 8, information extracted from two exemplary search results about document reviews is presented. For example, information 810 may include information sources 812 , optional user ratings 814 , reviews 816 , and other information 818 . Source 812 may include the source of the information (eg, Amazon.com). Selection of a source 812 may cause web documents from that source to be rendered. User ratings 814 may include ratings of the document by users of source 812 (eg, Amazon.com). Reviews 816 may include document reviews (or portions of reviews) provided by source 812 (eg, Amazon.com). Other information 818 may include the address of the web document, the size of the web document, the date associated with the web document, or other information associated with the web document. the

替代的图形用户界面 Alternative GUI

在符合本发明原理的替代实现方式中,与上面关于图4所描述的类似,信息可以关于文档而呈现。然而,在这种情况下,与相关信息540(图5)类似,可以为相关信息提供附加链接。图9是根据符合本发明原理的该替代实现方式,其中可以以搜索结果900的形式呈现与文档关联的信息的图形用户界面的示例图。如图9所示,搜索结果900可以包括文档标题410,作者信息420,来自文档的摘录430,到该文档中其他相关摘录的任选链接440,以及到相关信息的链接910。文档标题410,作者信息420,摘录430,任选链接440可以类似于上面关于图4所述的部分。 In an alternative implementation consistent with the principles of the invention, information may be presented with respect to documents similar to that described above with respect to FIG. 4 . In this case, however, additional links may be provided for related information, similar to related information 540 (FIG. 5). FIG. 9 is an illustration of an example graphical user interface in which information associated with a document may be presented in the form of search results 900 according to this alternative implementation consistent with the principles of the invention. As shown in FIG. 9, search results 900 may include document title 410, author information 420, excerpt 430 from the document, optional links 440 to other related excerpts in the document, and links 910 to related information. Document title 410, author information 420, excerpt 430, optional link 440 may be similar to those described above with respect to FIG. the

链接910可以使得相关信息得以呈现。图10A和图10B是根据符合本发明原理的两种不同实现方式,其中可以呈现相关信息的图形用户界面的示例图。如图10A所示,链接910的选择可以使一组链接得以提供,这组链接可以基于它们所关联的不同类型的文档属性进行任意地分离。如上所述,该组链接中某一链接的选择可以使得执行搜索并呈现结果。 Link 910 may cause related information to be presented. 10A and 10B are illustrations of graphical user interfaces in which relevant information may be presented, according to two different implementations consistent with the principles of the invention. As shown in FIG. 10A, selection of link 910 may result in a set of links being provided that may be arbitrarily separated based on the different types of document attributes with which they are associated. As noted above, selection of a link in the set of links may cause a search to be performed and results presented. the

如图10B所示,链接910的选择可以使得关于它们所关联的不同类型的文档属性,执行搜索并呈现结果。在一种实现方式中,可以提供一组搜索结果(类似于图7)。如上所述,这些搜索结果之一的选择可以使得对应的web文档得以呈现。在另一实现方式中,可以提供对应于一组搜索结果的从web文档(多个web文档)提取的信息(类似于图8)。 As shown in FIG. 10B , selection of links 910 may cause a search to be performed and results presented with respect to the different types of document attributes with which they are associated. In one implementation, a set of search results (similar to Figure 7) may be provided. As noted above, selection of one of these search results may cause the corresponding web document to be presented. In another implementation, information extracted from a web document(s) corresponding to a set of search results may be provided (similar to FIG. 8). the

结论 in conclusion

符合本发明原理的系统和方法可以对与一个或多个文档属性相关的附加信息进行搜索,并且与该文档关联地提供所述附加信息。 Systems and methods consistent with the principles of the present invention can search for additional information related to one or more document attributes and provide the additional information in association with the document. the

本发明优选实施例的前述说明提供了说明和描述,但是并不意图是穷尽的或将本发明局限于所公开的确切形式。各种修改和变形可以根据上述教导作出,或可以从本发明的实践中获得。 The foregoing description of the preferred embodiments of this invention has provided illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Various modifications and variations can be made in light of the above teachings, or can be acquired from practice of the invention. the

例如,尽管已经关于图3A-3D描述了一系列的动作,但是在符合本发明原理的其他实现方式中,可以对这些动作的次序进行修改。此外,还可以并行地执行不相互依赖的动作。 For example, while a series of acts have been described with respect to FIGS. 3A-3D , the order of these acts may be modified in other implementations consistent with the principles of the invention. Furthermore, actions that do not depend on each other can also be performed in parallel. the

已经描述了将文档识别为搜索结果,并且可以呈现与该文档或文档作者相关的web文档。然而,在其他实现方式中,文档可以以其他方式来标识,例如通过目录、类别,或其他文档列表。 It has been described that a document is identified as a search result, and web documents related to that document or document author can be presented. However, in other implementations, documents may be identified in other ways, such as by categories, categories, or other document listings. the

同样,已经关于图4-10B描述了示例性图形用户界面。在符合本发明原理的其他实现方式中,图形用户界面可以包括更多,更少,或不同条信息。 Likewise, exemplary graphical user interfaces have been described with respect to Figures 4-10B. In other implementations consistent with the principles of the invention, the graphical user interface may include more, fewer, or different pieces of information. the

如上所述,对本领域普通技术人员而言显而易见的是,本发明的这些方面可以在如附图中所示实现方式中以软件、硬件和固件的许多不同形式来实现。用来实现符合本发明原理的方面的实际软件代码或专用控制硬件并非对本发明的限制。因此,并不参照特定的软件代码对这些方面的操作和性能进行描述——可以理解本领域普通技术人员能够根据此处的说明,设计软件和控制硬件来实现这些方面。 As noted above, it will be apparent to those of ordinary skill in the art that these aspects of the invention can be implemented in many different forms of software, hardware and firmware in the implementations shown in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the principles of the invention is not limiting of the invention. Thus, the operation and performance of these aspects are not described with reference to specific software code - it is understood that one of ordinary skill in the art can, based on the description herein, design software and control hardware to implement these aspects. the

本申请中使用的元件、动作或指令都不应当解释为对本发明关键或必要的,除非进行了这样的明确描述。同样,在此处使用时,冠词 “一”意图包括一个或多个项目。在意指唯一一个项目的时候,使用术语“一个”或类似语言。此外,短语“基于”意图表达“至少部分地基于”的含义,除非另外进行了明确表明。 No element, act, or instruction used in the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article "a" is intended to include one or more items. Where only one item is intended, the term "one" or similar language is used. Furthermore, the phrase "based on" is intended to mean "based at least in part on" unless expressly stated otherwise. the

Claims (16)

1.一种用于识别web文档的方法,包括:1. A method for identifying a web document comprising: 通过网络从客户端设备接收搜索查询;receiving a search query from a client device over a network; 基于该搜索查询执行第一搜索以识别一组搜索结果;performing a first search based on the search query to identify a set of search results; 呈现所述组搜索结果,用以显示在所述客户端设备上;presenting the set of search results for display on the client device; 从所述客户端设备接收对所述组搜索结果之一的选择;receiving a selection of one of the set of search results from the client device; 呈现与扫描文档相关联的引用页面用以显示在所述客户端设备上,该扫描文档与所述组搜索结果中的所述选定的搜索结果相关联,所述引用页面包括关于所述扫描文档以及和搜索相关联的链接的信息;presenting for display on the client device a reference page associated with a scanned document associated with the selected one of the set of search results, the reference page including information about the scanned information on documents and links associated with searches; 从所述客户端设备接收对所述链接的选择;receiving a selection of the link from the client device; 响应于接收到对所述链接的选择,基于与所述扫描文档相关联的属性执行第二搜索以识别web文档;以及In response to receiving a selection of the link, performing a second search to identify a web document based on attributes associated with the scanned document; and 呈现第二搜索的结果,用以显示在所述客户端设备上。Results of the second search are presented for display on the client device. 2.根据权利要求1的方法,其中呈现第二搜索的结果的步骤包括:2. The method of claim 1, wherein the step of presenting the results of the second search comprises: 标识与该属性相关的web文档,identifies the web document associated with the property, 从该web文档中提取信息,以及extract information from the web document, and 呈现所提取的信息,用以显示在所述客户端设备上。The extracted information is presented for display on the client device. 3.根据权利要求1的方法,其中呈现第二搜索的结果的步骤包括:3. The method of claim 1 , wherein the step of presenting the results of the second search comprises: 根据信息检索得分和/或基于链接的得分,生成所述web文档的得分,generating a score for said web document based on an information retrieval score and/or a link-based score, 基于所述得分对web文档进行分类,以及classifying the web document based on the score, and 呈现分类后的web文档,用以显示在所述客户端设备上。The categorized web documents are presented for display on the client device. 4.根据权利要求3的方法,其中生成所述web文档的得分的步骤包括:4. The method of claim 3, wherein the step of generating a score for the web document comprises: 生成所述web文档的信息检索得分,generating an information retrieval score for said web document, 生成所述web文档的基于链接的得分,以及generating a link-based score for said web document, and 基于该信息检索得分和基于链接的得分,生成所述web文档的总体得分。Based on the information retrieval score and the link-based score, an overall score for the web document is generated. 5.根据权利要求1的方法,其中所述引用页面还包括以下中的至少之一:5. The method of claim 1, wherein the referenced page further comprises at least one of the following: 该扫描文档的内容的描述,A description of the content of the scanned document, 与同该扫描文档关联的封皮、封面或封皮内折边之一关联的文本,the text associated with one of the cover, cover, or cover flap associated with the scanned document, 与该扫描文档关联的著录信息,或bibliographic information associated with the scanned document, or 广告。advertise. 6.根据权利要求1的方法,其中所述引用页面还包括:6. The method of claim 1, wherein the referring page further comprises: 来自该扫描文档的摘录,和an excerpt from that scanned document, and 到该扫描文档各部分的一组链接。A set of links to various parts of the scanned document. 7.根据权利要求6的方法,其中所述摘录包括来自该扫描文档的一部分文本的图像。7. The method of claim 6, wherein the extract comprises an image of a portion of text from the scanned document. 8.根据权利要求6的方法,其中该组链接引用以下中的至少之一:8. The method of claim 6, wherein the set of links references at least one of: 与该扫描文档关联的封面,the cover page associated with the scanned document, 与该扫描文档关联的内容表,a table of contents associated with the scanned document, 与该扫描文档关联的索引,或the index associated with the scanned document, or 与该扫描文档关联的封底。The back cover associated with this scanned document. 9.根据权利要求1的方法,其中呈现第二搜索的结果的步骤包括:9. The method of claim 1, wherein the step of presenting the results of the second search comprises: 呈现与web文档中的一个web文档关联的引用页面,该引用页面包括到带有与属性相关的信息的第二web文档的链接。A reference page associated with one of the web documents is presented, the reference page including a link to a second web document with information related to the attribute. 10.根据权利要求9的方法,其中该链接通过执行第二搜索而生成。10. The method of claim 9, wherein the link is generated by performing a second search. 11.根据权利要求9的方法,其中所述引用页面还包括以下中的至少之一:11. The method of claim 9, wherein the referenced page further comprises at least one of: 所述一个web文档的内容的描述,a description of the content of said one web document, 与所述一个web文档关联的封皮、封面或封皮内折边之一关联的文本,text associated with one of the cover, cover or cover flap associated with said one web document, 与所述一个web文档关联的著录信息,或bibliographic information associated with said one web document, or 广告。advertise. 12.根据权利要求9的方法,其中所述引用页面还包括:12. The method of claim 9, wherein said referring page further comprises: 来自所述一个web文档的摘录,和an excerpt from said one web document, and 到所述一个web文档各部分的一组链接。A set of links to parts of said one web document. 13.根据权利要求11的方法,其中所述广告与搜索查询、所述一个web文档或用户行为中的至少一个相关,或者从搜索查询、所述一个web文档或用户行为中的至少一个导出。13. The method of claim 11, wherein the advertisement is related to or derived from at least one of a search query, the one web document, or user behavior. 14.根据权利要求1的方法,其中呈现第二搜索的结果的步骤包括:14. The method of claim 1, wherein the step of presenting the results of the second search comprises: 呈现与web文档中的一个web文档关联的引用页面,该引用页面含有从带有与属性相关的信息的web文档中提取的信息。A reference page associated with one of the web documents is presented, the reference page containing information extracted from the web document with information associated with the attribute. 15.根据权利要求1的方法,其中所述属性对应于与该扫描文档关联的标题、作者、类别、出版者、或出版日期中的至少一个。15. The method of claim 1, wherein the attribute corresponds to at least one of a title, author, category, publisher, or publication date associated with the scanned document. 16.一种用于识别web文档的系统,包括:16. A system for identifying web documents comprising: 用于识别一组搜索结果的装置,means for identifying a set of search results, 用于从客户端设备接收对所述组搜索结果之一的选择的装置,means for receiving a selection of one of the set of search results from a client device, 用于向所述客户端设备呈现与扫描文档相关联的引用页面的装置,该扫描文档与所述组搜索结果中的所述选定的搜索结果相关联,所述引用页面包括用于执行搜索的链接,means for presenting to said client device a reference page associated with a scanned document associated with said selected one of said set of search results, said reference page comprising the link to, 用于从所述客户端设备接收对所述链接的选择的装置,means for receiving a selection of the link from the client device, 用于响应于接收到对所述链接的选择,执行搜索以识别带有与同该扫描文档关联的属性相关的信息的web文档的装置,以及means for performing a search to identify web documents with information related to attributes associated with the scanned document in response to receiving a selection of the link, and 用于呈现与该web文档关联的信息,用以显示在所述客户端设备上的装置。means for presenting information associated with the web document for display on said client device.
CN2005800396934A 2004-09-30 2005-08-29 Method and system for identifying web document Expired - Fee Related CN101061478B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US10/953,112 2004-09-30
US10/953,112 US8386453B2 (en) 2004-09-30 2004-09-30 Providing search information relating to a document
PCT/US2005/030646 WO2006039025A1 (en) 2004-09-30 2005-08-29 Providing information relating to a document

Publications (2)

Publication Number Publication Date
CN101061478A CN101061478A (en) 2007-10-24
CN101061478B true CN101061478B (en) 2011-06-15

Family

ID=35708608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2005800396934A Expired - Fee Related CN101061478B (en) 2004-09-30 2005-08-29 Method and system for identifying web document

Country Status (7)

Country Link
US (2) US8386453B2 (en)
EP (1) EP1797511A1 (en)
JP (2) JP2008515087A (en)
CN (1) CN101061478B (en)
BR (1) BRPI0515950A (en)
CA (1) CA2583042C (en)
WO (1) WO2006039025A1 (en)

Families Citing this family (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9799060B2 (en) * 2004-04-01 2017-10-24 Google Inc. Content access with handheld document data capture devices
US20100185538A1 (en) * 2004-04-01 2010-07-22 Exbiblio B.V. Content access with handheld document data capture devices
US7606794B2 (en) * 2004-11-11 2009-10-20 Yahoo! Inc. Active Abstracts
US20060101012A1 (en) * 2004-11-11 2006-05-11 Chad Carson Search system presenting active abstracts including linked terms
US7444589B2 (en) * 2004-12-30 2008-10-28 At&T Intellectual Property I, L.P. Automated patent office documentation
US20060149710A1 (en) * 2004-12-30 2006-07-06 Ross Koningstein Associating features with entities, such as categories of web page documents, and/or weighting such features
MX2007013020A (en) * 2005-04-18 2008-03-18 Collage Analytics Llc System and method for efficiently tracking and dating content in very large dynamic document spaces.
JP4789516B2 (en) * 2005-06-14 2011-10-12 キヤノン株式会社 Document conversion apparatus, document conversion method, and storage medium
CA2838153C (en) * 2005-11-15 2016-07-26 Google Inc. Displaying compact and expanded data items
JP2007287134A (en) * 2006-03-20 2007-11-01 Ricoh Co Ltd Information extraction apparatus and information extraction method
US8073830B2 (en) * 2006-03-31 2011-12-06 Google Inc. Expanded text excerpts
US20070274300A1 (en) 2006-05-04 2007-11-29 Microsoft Corporation Hover to call
JP2008021267A (en) * 2006-07-14 2008-01-31 Fuji Xerox Co Ltd Document retrieval system, document retrieval processing method and document retrieval processing program
US20100185651A1 (en) * 2009-01-16 2010-07-22 Google Inc. Retrieving and displaying information from an unstructured electronic document collection
US8452791B2 (en) 2009-01-16 2013-05-28 Google Inc. Adding new instances to a structured presentation
US8412749B2 (en) 2009-01-16 2013-04-02 Google Inc. Populating a structured presentation with new values
JP2012515382A (en) * 2009-01-16 2012-07-05 グーグル・インコーポレーテッド Visualize the structure of the site and enable site navigation for search results or linked pages
EP2387756A4 (en) * 2009-01-16 2013-06-12 Google Inc Retrieving and displaying information from an unstructured electronic document collection
US8977645B2 (en) 2009-01-16 2015-03-10 Google Inc. Accessing a search interface in a structured presentation
US8615707B2 (en) * 2009-01-16 2013-12-24 Google Inc. Adding new attributes to a structured presentation
US8458171B2 (en) * 2009-01-30 2013-06-04 Google Inc. Identifying query aspects
US8812362B2 (en) * 2009-02-20 2014-08-19 Yahoo! Inc. Method and system for quantifying user interactions with web advertisements
FR2953043A1 (en) * 2009-11-23 2011-05-27 Sagem Comm METHOD FOR PROCESSING A DOCUMENT TO BE ASSOCIATED WITH A SERVICE, AND ASSOCIATED SCANNER
US20120027246A1 (en) * 2010-07-29 2012-02-02 Intuit Inc. Technique for collecting income-tax information
US9280601B1 (en) 2012-02-15 2016-03-08 Google Inc. Modifying search results
CN104428769B (en) * 2012-07-13 2018-04-06 索尼公司 The information of text file reader is provided
CN103577436B (en) * 2012-07-27 2017-10-13 阿尔派株式会社 Content search apparatus and content search method
US8965880B2 (en) 2012-10-05 2015-02-24 Google Inc. Transcoding and serving resources
US8924850B1 (en) 2013-11-21 2014-12-30 Google Inc. Speeding up document loading
CN103744856B (en) * 2013-12-03 2016-09-21 北京奇虎科技有限公司 Linkage extended search method and device, system
CN104239570B (en) * 2014-09-30 2018-04-13 百度在线网络技术(北京)有限公司 The searching method and device of paper
CN107396147A (en) * 2017-07-17 2017-11-24 环球智达科技(北京)有限公司 The method for pushing of personage's relevant information
CN107277574A (en) * 2017-07-17 2017-10-20 环球智达科技(北京)有限公司 The method for pushing of film relevant information
CN110008395B (en) * 2018-09-17 2021-11-02 北京字节跳动网络技术有限公司 Comment content presentation method and device, storage medium and terminal
JP7527623B2 (en) 2020-06-05 2024-08-05 株式会社サピエンス Information processing device and program
US12118025B2 (en) * 2020-10-28 2024-10-15 Ihc Invest, Inc. Comprehension engine to comprehend contents of selected documents

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5708825A (en) * 1995-05-26 1998-01-13 Iconovex Corporation Automatic summary page creation and hyperlink generation
US6122647A (en) * 1998-05-19 2000-09-19 Perspecta, Inc. Dynamic generation of contextual links in hypertext documents

Family Cites Families (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5317507A (en) * 1990-11-07 1994-05-31 Gallant Stephen I Method for document retrieval and for word sense disambiguation using neural networks
US5623681A (en) * 1993-11-19 1997-04-22 Waverley Holdings, Inc. Method and apparatus for synchronizing, displaying and manipulating text and image documents
US5799325A (en) * 1993-11-19 1998-08-25 Smartpatents, Inc. System, method, and computer program product for generating equivalent text files
US6112201A (en) * 1995-08-29 2000-08-29 Oracle Corporation Virtual bookshelf
US5713016A (en) * 1995-09-05 1998-01-27 Electronic Data Systems Corporation Process and system for determining relevance
WO1997019415A2 (en) * 1995-11-07 1997-05-29 Cadis, Inc. Search engine for remote object oriented database management system
US5787424A (en) * 1995-11-30 1998-07-28 Electronic Data Systems Corporation Process and system for recursive document retrieval
US5893109A (en) * 1996-03-15 1999-04-06 Inso Providence Corporation Generation of chunks of a long document for an electronic book system
JP3714723B2 (en) 1996-05-13 2005-11-09 沖電気工業株式会社 Document display system
US6285999B1 (en) 1997-01-10 2001-09-04 The Board Of Trustees Of The Leland Stanford Junior University Method for node ranking in a linked database
US7363291B1 (en) * 2002-03-29 2008-04-22 Google Inc. Methods and apparatus for increasing efficiency of electronic document delivery to users
JP3779431B2 (en) * 1997-06-13 2006-05-31 富士通株式会社 Relational database management device, intermediate link table automatic creation processing method, and program storage medium
US5987457A (en) * 1997-11-25 1999-11-16 Acceleration Software International Corporation Query refinement method for searching documents
US6032145A (en) * 1998-04-10 2000-02-29 Requisite Technology, Inc. Method and system for database manipulation
IL126373A (en) * 1998-09-27 2003-06-24 Haim Zvi Melman Apparatus and method for search and retrieval of documents
US7200804B1 (en) * 1998-12-08 2007-04-03 Yodlee.Com, Inc. Method and apparatus for providing automation to an internet navigation application
US6510406B1 (en) * 1999-03-23 2003-01-21 Mathsoft, Inc. Inverse inference engine for high performance web search
JP2000322427A (en) * 1999-05-11 2000-11-24 Kubota Corp Information retrieval device and recording medium
US7287214B1 (en) * 1999-12-10 2007-10-23 Books24X7.Com, Inc. System and method for providing a searchable library of electronic documents to a user
US6785670B1 (en) * 2000-03-16 2004-08-31 International Business Machines Corporation Automatically initiating an internet-based search from within a displayed document
US6968332B1 (en) * 2000-05-25 2005-11-22 Microsoft Corporation Facility for highlighting documents accessed through search or browsing
US7149743B2 (en) * 2000-11-17 2006-12-12 Heck.Com, Llc Virtual directory
US20020103920A1 (en) * 2000-11-21 2002-08-01 Berkun Ken Alan Interpretive stream metadata extraction
US6792459B2 (en) * 2000-12-14 2004-09-14 International Business Machines Corporation Verification of service level agreement contracts in a client server environment
US7158971B1 (en) * 2001-03-07 2007-01-02 Thomas Layne Bascom Method for searching document objects on a network
US6732090B2 (en) * 2001-08-13 2004-05-04 Xerox Corporation Meta-document management system with user definable personalities
JP2003067149A (en) * 2001-08-30 2003-03-07 Canon Inc Data processing device, data processing system, data processing method, tab printing method, storage medium, and program
JP3791908B2 (en) 2002-02-22 2006-06-28 インターナショナル・ビジネス・マシーンズ・コーポレーション SEARCH SYSTEM, SYSTEM, SEARCH METHOD, AND PROGRAM
US7076497B2 (en) * 2002-10-11 2006-07-11 Emergency24, Inc. Method for providing and exchanging search terms between internet site promoters
US20040138988A1 (en) * 2002-12-20 2004-07-15 Bart Munro Method to facilitate a search of a database utilizing multiple search criteria
JP2004318321A (en) * 2003-04-14 2004-11-11 Nec Corp Biological information retrieval system and its method
US20050004835A1 (en) * 2003-07-01 2005-01-06 Yahoo! Inc System and method of placing a search listing in at least one search result list
US20060143674A1 (en) * 2003-09-19 2006-06-29 Blu Ventures, Llc Methods to adapt search results provided by an integrated network-based media station/search engine based on user lifestyle
US20050222989A1 (en) * 2003-09-30 2005-10-06 Taher Haveliwala Results based personalization of advertisements in a search engine
US20070214126A1 (en) * 2004-01-12 2007-09-13 Otopy, Inc. Enhanced System and Method for Search
US20050160083A1 (en) * 2004-01-16 2005-07-21 Yahoo! Inc. User-specific vertical search
US7359893B2 (en) * 2004-03-31 2008-04-15 Yahoo! Inc. Delivering items based on links to resources associated with search results
US8407094B1 (en) * 2004-03-31 2013-03-26 Google Inc. Providing links to related advertisements
US7698626B2 (en) * 2004-06-30 2010-04-13 Google Inc. Enhanced document browsing with automatically generated links to relevant information
US20060143016A1 (en) * 2004-07-16 2006-06-29 Blu Ventures, Llc And Iomedia Partners, Llc Method to access and use an integrated web site in a mobile environment
US20060036659A1 (en) * 2004-08-12 2006-02-16 Colin Capriati Method of retrieving information using combined context based searching and content merging

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5708825A (en) * 1995-05-26 1998-01-13 Iconovex Corporation Automatic summary page creation and hyperlink generation
US6122647A (en) * 1998-05-19 2000-09-19 Perspecta, Inc. Dynamic generation of contextual links in hypertext documents

Also Published As

Publication number Publication date
EP1797511A1 (en) 2007-06-20
US8386453B2 (en) 2013-02-26
JP2012104149A (en) 2012-05-31
CA2583042C (en) 2014-03-18
WO2006039025A1 (en) 2006-04-13
JP2008515087A (en) 2008-05-08
CN101061478A (en) 2007-10-24
BRPI0515950A (en) 2008-08-12
JP5531033B2 (en) 2014-06-25
US20060074868A1 (en) 2006-04-06
US20130151497A1 (en) 2013-06-13
CA2583042A1 (en) 2006-04-13

Similar Documents

Publication Publication Date Title
CN101061478B (en) Method and system for identifying web document
US11803604B2 (en) User interfaces for a document search engine
US10528650B2 (en) User interface for presentation of a document
US9323827B2 (en) Identifying key terms related to similar passages
US9613061B1 (en) Image selection for news search
US7958128B2 (en) Query-independent entity importance in books
KR100932999B1 (en) Browsing documents by links automatically generated based on user information and content
US20070250501A1 (en) Search result delivery engine
CN102722498B (en) Search engine and implementation method thereof
US9031898B2 (en) Presentation of search results based on document structure
US20090144240A1 (en) Method and systems for using community bookmark data to supplement internet search results
US20020099685A1 (en) Document retrieval system; method of document retrieval; and search server
US20150172299A1 (en) Indexing and retrieval of blogs
KR20080114825A (en) Expanded snippets
JP2017504105A (en) System and method for in-memory database search
KR20020075359A (en) System and method for capturing and managing information from digital source
US20030237042A1 (en) Document processing device and document processing method
JPH11238075A (en) Document classification method and apparatus, and recording medium recording document classification processing program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: American California

Patentee after: Google Inc.

Address before: American California

Patentee before: GOOGLE Inc.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110615