[go: up one dir, main page]

HK1163846B - A computer-implemented method and a system and device for performing searches using queries - Google Patents

A computer-implemented method and a system and device for performing searches using queries Download PDF

Info

Publication number
HK1163846B
HK1163846B HK12104074.7A HK12104074A HK1163846B HK 1163846 B HK1163846 B HK 1163846B HK 12104074 A HK12104074 A HK 12104074A HK 1163846 B HK1163846 B HK 1163846B
Authority
HK
Hong Kong
Prior art keywords
anchor text
text strings
format
search
written
Prior art date
Application number
HK12104074.7A
Other languages
Chinese (zh)
Other versions
HK1163846A1 (en
Inventor
维巴休.米塔尔
热.M.蓬特
迈赫兰.萨哈米
桑贾伊.格马瓦特
约翰.A.鲍尔
Original Assignee
Google Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/676,724 external-priority patent/US8706747B2/en
Application filed by Google Llc filed Critical Google Llc
Publication of HK1163846A1 publication Critical patent/HK1163846A1/en
Publication of HK1163846B publication Critical patent/HK1163846B/en

Links

Abstract

The invention provides a computer executing method and systems and devices for searching using queries. The method includes: receiving a first search inquiry of a first format, wherein the first search inquiry includes a first item of the first format; identifying a first set and a second set respectively including a first anchor text string and a second anchor text string, wherein each of first anchor text string and second anchor text string is respectively a displayed part of a first hyperlink and the first hyperlink directs to a first document; obtaining a candidate translation of the first item written in a second format at least partially based on the frequency of each of the unique item in a number of second anchor text strings; and providing the search result in response to the search inquiry of the first format, wherein the search result is identified by a second search inquiry by using a search engine and the second search inquiry includes at least one of candidate translations of the second format.

Description

Computer-implemented method and system and apparatus for performing a search using a query
The application is a divisional application, and the original application has application number of 200480028535.4, application date of 2004, 9 and 13, and is named as a computer-implemented method and a system and a device for executing search by using query.
Technical Field
The present application relates generally to information search and retrieval. More particularly, a system and method for performing a search using a query written in a character set or a language different from the character set or at least some of the documents being searched is disclosed.
Background
Most search engines work under the assumption that end users enter search queries using something similar to a traditional keyboard, where entry of alphanumeric strings is not difficult. However, as small devices become more popular, this concept is not always effective. For example, a user may query a search engine using a wireless telephone that supports the WAP (wireless application protocol) standard. Devices such as wireless telephones often have data entry interfaces in which a particular action (e.g., a key press) by a user may correspond to more than one alphanumeric character. A detailed description of the WAP architecture is available at http:// wwl.wap format.org/tech/documents/SPEC-WAPALch-19980439. pdf ("WAP 100 Wireless application protocol architecture Specification").
In the usual case, WAP users navigate to a search query page and are presented with a form of entry of their search query. With conventional methods, the user may be required to press a plurality of keys to select a particular letter. On a standard telephone keypad, for example, the user may select the letter "b" twice by pressing the "2" key, or the letter "s" four times by pressing the "7" key. Thus, to enter a query for "ben smith", the user typically needs to enter the following key string: 223366077776444844 which map to the following letters:
22→b
33→e
66→n
0 → space
7777→s
6→m
444→i
8→t
44→h
After the user enters their search request, the search engine receives the words from the user and proceeds in much the same way that they received the request from the desktop browser (where the user used a traditional keyboard).
As can be seen from the above example, this form of data entry is inefficient because it requires eighteen keystrokes to enter nine alphanumeric characters (including spaces) corresponding to "ben smith".
Similar difficulties may occur when a query is typed out using a non-target language keyboard. For example, Japanese text may be represented using a number of different character sets including hiragana, katakana, and kanji, but none of these character sets are easily entered using a typical ASCII keyboard based on the Roman alphabet. In such a situation, users often use word processors such as Ichitaro, produced by jusstystem corporation of german City, japan, which can convert text written in romaji (roman alphabet representation of japanese representing speech) into katakana, hiragana, and kanji. Using the word processor, the user can type a query in romaji, and then cut and paste the translated text from the word processor's screen into a search box on the browser. A disadvantage of this approach is that it is relatively slow and tedious and requires the user to access a copy of the word processor, but this may not be feasible due to cost and/or memory limitations.
Accordingly, there remains a need for methods and apparatus that provide relevant search results in response to ambiguous search queries.
Disclosure of Invention
As embodied and broadly described herein, methods and apparatus in accordance with the present invention provide relevant search results in response to an ambiguous search query. According to the invention, such a method comprises receiving a sequence of ambiguous information members from a user. The method obtains mapping information that maps ambiguous information members to less ambiguous information members. This mapping information is used to map the ambiguous information member sequences into one or more corresponding less ambiguous information member sequences. One or more of these less ambiguous sequences of information members are provided as input to a search engine. Search results are obtained from a search engine and presented to a user.
In addition, systems and methods are disclosed for performing a search using a query that is represented in a language or character set that is different from the character set or language of at least some of the documents to be searched. Embodiments of the present invention allow a user to type out a query using a standard input device (e.g., an ASCII keyboard), cause the query to be translated into a relevant form at a server (e.g., translate a query written in romaji into katakana, hiragana, and/or kanji), and receive search results based on the converted form.
It should be appreciated that the present invention can be implemented in numerous ways, including as a program, an apparatus, a system, a device, a method, or a computer readable medium such as a computer readable storage medium, a carrier wave, or a computer network wherein program instructions are sent over optical or electronic communication lines. Several embodiments of the invention are described below.
In one embodiment, a method of automatically translating query terms from one language and/or character set to another language and/or character set is described. A first set of anchor text containing a given query term is identified as a set of documents (e.g., web pages) to which the anchor text points. A second set of anchor text written in a second format and pointing to the same set of documents is then identified. The second set of anchor text is then analyzed to obtain a probability that the presentation of the given query term in the first format corresponds to the presentation of the given query term in the second format.
In another embodiment, a probability dictionary is created that maps terms written in a first format (e.g., a language and/or character set) to a second format (e.g., another language and/or character set). The probabilistic dictionary is for translating a query written in a first format into a second format. The translated query is then used to perform a search, and the searched results are returned to the user. In some embodiments, user interactions with search results may be monitored and used to update probabilities in a probability dictionary. Also, in some embodiments, the query itself may be expanded to include alternative language and/or character set mappings prior to searching.
In yet another embodiment, a method of creating a probability dictionary is described. The probability dictionary is operable to translate items having a first format into a second format. The dictionary is preferably created item by identifying the anchor text or other data that contains the item. The data aligned with the anchor text or other data is then analyzed to determine a probability that a given item having the first format maps onto one or more items having the second format.
In yet another embodiment, a query provided in a first language or character set is translated into a second language or character set by comparing anchor text written in the first language or character set and corresponding to the first anchor text containing one or more of the query terms.
In another embodiment, a computer program product for translating items written in a first format into a second format is provided. The computer program product is for causing a computer system to identify aligned anchor text and determine a probability that presentation of a given item in a first format corresponds to presentation of one or more items in a second format.
In another embodiment, a method of performing a search using a potentially ambiguous query is provided. When a user enters a query having a first format, the query is translated into a set of one or more variants written in a second format. A search is then performed using the translated variants and response information is returned to the user. For example, the first format may include a sequence of numbers entered using a telephone keypad, and the second format may include alphanumeric text (e.g., English, romaji, romaja, Pinyin, etc.). In some embodiments, the set of one or more variants is selected by discarding translated variants that do not appear in the predetermined index vocabulary and/or translated variants that contain a predetermined low probability character combination. In some embodiments, the probability dictionary is used to further translate the set of one or more variants into a third format before performing the search. For example, the probability dictionary may be used to translate the set of one or more variants from romaji, romaja, or pinyin to kanji, katakana, hiragana, hangul, hangja, or traditional chinese characters, and then perform a search using the translated variants.
These and other features and advantages of the present invention will be presented in more detail in the following detailed description and the accompanying drawings which illustrate, by way of example, the principles of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the advantages and principles of the invention. In the drawings:
FIG. 1 illustrates a block diagram of a system in which methods and apparatus in accordance with the present invention are implemented;
FIG. 2 shows a block diagram of a client device according to the present invention;
FIG. 3 shows a diagram depicting three documents;
FIG. 4a illustrates a conventional alphanumeric index table;
FIG. 4b illustrates a flow diagram for providing search results responsive to a conventional alphanumeric search query;
FIG. 5a illustrates a flow diagram for providing search results responsive to an ambiguous search query in accordance with the present invention;
FIG. 5b shows a diagram for mapping alphanumeric information to numeric information;
FIG. 5c illustrates an example numeric index; and
FIG. 6 illustrates another flow diagram for providing search results responsive to an ambiguous search query in accordance with the present invention.
Fig. 7 illustrates a method for performing a search according to an embodiment of the present invention.
FIG. 8 illustrates a probability dictionary for character set translations.
FIG. 9 shows a diagram of constructing a probability dictionary using parallel anchor text.
FIG. 10 illustrates a collection of documents linked using anchor text.
11A and 11B show illustrations of computing a possible translation based on the anchor text shown in FIG. 10.
FIG. 12 illustrates probability distributions associated with the illustrated word translations.
Detailed Description
Reference will now be made in detail to embodiments of the present invention as illustrated in the accompanying drawings. The same reference numbers are used throughout the drawings and the following description refers to the same or like parts. The following description is presented to enable any person skilled in the art to make and use the working body of the invention. Descriptions of specific embodiments and applications are provided only as examples and various modifications will be apparent to those skilled in the art. For example, although examples are described in the context of internet web pages, it should be understood that embodiments of the present invention may be used to search for other types of documents and/or information, such as books, newspapers, magazines, and the like. Similarly, although many of the examples describe translations of Japanese text from romaji to katakana, hiragana, and/or kanji for illustrative purposes, one skilled in the art will appreciate that the systems and methods of the present invention may be applied to any suitable translation. For example, without limitation, embodiments of the invention may be used to search text written in, for example, traditional Chinese characters or Korean hangul or hangja characters, based on a query received in some other format (e.g., Pinyin or romaja). The general principles described herein may be applied to other embodiments and applications without departing from the spirit and scope of the invention. Thus, the present invention is to be accorded the widest scope encompassing numerous alternatives, modifications, and equivalents consistent with the principles and features disclosed herein. For the purpose of brevity, details of technical material that is known in the art to which the invention relates have not been described in order to avoid unnecessarily obscuring the invention.
A. Overview
Methods and apparatus in accordance with the present invention allow a user to submit an ambiguous search query and receive potentially disambiguated search results. In one embodiment, a sequence of numbers received from a user having a standard telephone keypad is translated into a set of potentially corresponding alphanumeric sequences. These potentially corresponding alphanumeric sequences are provided as input to a conventional search engine using a boolean OR (OR) expression. As such, the search engine is used to help define search results that may be of interest to the user.
B. Framework
Fig. 1 illustrates a system 100 in which methods and apparatus in accordance with the present invention may be implemented. The system 100 may include a plurality of client devices 110 connected to a plurality of servers 120 and 130 via a network 140. Network 140 may include a Local Area Network (LAN), a Wide Area Network (WAN), a telephone network such as the Public Switched Telephone Network (PSTN), an intranet, the internet, or a combination of these networks. For simplicity, two client devices 110 and three servers 120 and 130 are shown connected to network 140. In practice, there may be more or fewer client devices and servers. Also, in some cases, the client device may perform the functions of the server, and the server may perform the functions of the client device.
Client devices 110 may include devices such as mainframes, minicomputers, personal computers, laptops, personal digital assistants, and the like, capable of connecting to network 140. Client device 110 may transmit data over network 140 or receive data from network 140 via a wired, wireless, or optical connection.
Fig. 2 illustrates an exemplary client device 110 according to the present invention. Client device 110 may include a bus 210, a processor 220, a main memory 230, a Read Only Memory (ROM)240, a storage device 250, an input device 260, an output device 270, and a communication interface 280.
Bus 210 may include one or more conventional buses that allow communication among the components of client device 110. Processor 220 may include any type of conventional processor or microprocessor for understanding and executing instructions. Main memory 230 may include a Random Access Memory (RAM) or another type of dynamic storage device for storing information and instructions to be executed by processor 220. ROM 240 may include a conventional ROM device or another type of static storage device that stores static information and instructions for use by processor 220. Storage device 250 may include a magnetic and/or optical recording medium and its corresponding drive.
Input device 260 may include one or more conventional mechanisms that allow a user to input information to client device 110, such as a keyboard, a mouse, a pen, voice recognition and/or biometric mechanisms, and the like. Input device 270 may include one or more conventional mechanisms that output information to the user, including a display, a printer, a speaker, and the like. Communication interface 280 may include any transceiver-type mechanism that enables client device 110 to communicate with other devices and/or systems. For example, communication interface 280 may include mechanisms for communicating with another device or system via a network (e.g., network 140).
As will be described in detail below, the client device 110 according to the present invention performs certain (certain) search-related operations. Client device 110 may perform these operations in response to processor 220 executing software instructions contained in a computer-readable medium, such as memory 230. A computer-readable medium may be defined as one or more memory devices and/or carrier waves. The software instructions may be read into memory 230 from another computer-readable medium, such as data storage device 250, or from another device via communication interface 280. The software instructions contained in memory 230 enable processor 220 to perform search-related activities as described below. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement processes consistent with the invention. Thus, the present invention is not limited to any specific combination of hardware circuitry and software.
Servers 120 and 130 may include one or more types of computer systems, such as mainframes, minicomputers, or personal computers, and the like, that may be connected to network 140 such that servers 120 and 130 communicate with client device 110. In alternative embodiments, servers 120 and 130 may include mechanisms for directly connecting to one or more client devices 110. Servers 120 and 130 may transmit data over network 140 or receive data from network 140 via wired, wireless, or optical connections.
The server may be configured in a manner similar to that described above for client device 110 with reference to fig. 2. In an embodiment consistent with the invention, server 120 may include a search engine 125 usable by client devices 110. Server 130 may store documents (or web pages) that may be accessed by client device 110.
C. Architectural operations
FIG. 3 shows a diagram depicting three documents that may be stored, for example, on one of servers 130.
The first document (document 1) includes two items- "car repair" and "car rental" -and is labeled with the number "3" at the bottom thereof. The second document (document 2) includes the entry "video rentals". The third document (document 3) includes three items, "wine," "champagne," and "bar item," and includes a link (or reference) to document 2.
For simplicity of illustration, the document shown in FIG. 3 contains only alphanumeric strings of information (e.g., "car," "repair," "wine," etc.). However, those skilled in the art will recognize that in other situations, the document may include other types of information, such as voice, or audiovisual information, etc.
FIG. 4a illustrates a conventional alphanumeric index table based on the document shown in FIG. 3. The first column of the index table includes a list of alphanumeric items and the second column includes a list of documents corresponding to the items. Some items, such as the alphanumeric item "3," correspond to (e.g., appear in) only one document, in this case, document 1. Other items, such as "real" correspond to multiple documents, in this case documents 1 and 2.
FIG. 4b illustrates how a conventional search engine, such as search engine 125, provides search results responsive to an alphanumeric search query using the index table shown in FIG. 4 a. The alphanumeric query may be generated using any conventional technique. For illustrative purposes, FIG. 4b depicts two alphanumeric queries: "car" and "wine". Under conventional methods, search engine 125 receives an alphanumeric query such as "car" (step 410), and uses the alphanumeric index table to determine which documents correspond to the query (step 420). In this example, conventional search engine 125 determines that "car" corresponds to document 1 using the index table shown in FIG. 4a, and returns document 1 (or a reference to document 1) as the search result to the user. Similarly, the conventional search engine determines that "wine" corresponds to document 3 and returns document 3 (or a reference to document 3) to the user as a search result (step 430).
FIG. 5a illustrates a flow diagram of a preferred technique for providing search results responsive to a numeric search query based on the documents and index table shown in FIGS. 3 and 4a, respectively, in accordance with the present invention. To ease illustration, FIG. 5a depicts a particular technique for processing numeric queries based on the mapping of a standard telephone handset (telephone handset); those skilled in the art will recognize that other techniques according to the present invention may also be used.
At step 510, a sequence "227" (including the digital components "2", and "7") is received from the user. In step 520, information is obtained about how the numeric components map to letters. The mapping information is shown in fig. 5b, assuming that the user enters information from a standard telephone keypad. As shown in FIG. 5b, each of the letters "a", "b", and "c" maps to the number "1", each of the letters "p", "q", "r", and "s" maps to the number "7", and so on.
At step 530, using this mapping information, the sequence "227" is translated into its potential alphanumeric equivalent. From the information shown in fig. 5b, there are 36 possible letter combinations corresponding to the sequence "227", including the following: aap, bap, cap, abp, bbp, …, bar … car … ccs. If a number is included in a possible combination (e.g., "aa 7"), there are 80 possible combinations. Rather than generate all possible alphanumeric equivalents, it may be desirable to define the equivalents generated from some index vocabulary. For example, it may be desirable to generate only those alphanumeric equivalents that appear in a dictionary, search engine log of previous search queries; or by using known statistical techniques (e.g., the probability of certain words appearing together) to otherwise define an alphanumeric equivalent.
At step 540, these alphanumeric equivalents are provided as input to a conventional search engine (e.g., as described with reference to fig. 4a and 4 b) using a logical "OR" operation. For example, the search query provided to the search engine may be "app OR bap OR abp … OR bar … OR car". Although all possible alphanumeric equivalents may be provided to the search engine, instead, subsets are used, with equivalents that are not likely to be desired eliminated using conventional techniques. For example, by using a technique that utilizes (drawupon) probabilistic information about the use of letters or words, a narrower list of possible combinations can be generated: combinations starting from "qt" may be ignored, but combinations starting from "qu" are included (and favored).
At step 550, search results are obtained from the search engine. Since items such as "aap" and "abp" do not appear in the index table of the search engine, they are effectively ignored. In fact, the entries contained in the index table shown in FIG. 4a are only "car" and "bar", so that the only search results returned are those referencing documents 1 and 3. At step 560, the search results are presented to the user. The search results may be presented in the same order as provided by the search engine or may be reordered according to considerations such as user language. Assuming that the user is only interested in documents containing the term "bar", the user receives unwanted results (document 3) in addition to the wanted results (document 1). However, this is an acceptable price to pay for, in order for the user to only press three keys to form the search query intent.
FIG. 6 illustrates another flow diagram of a preferred technique for providing search results responsive to a numeric search query based on the documents and index table shown in FIGS. 3 and 4a, respectively, in accordance with the present invention. The flow chart illustrates how increasing the size of the received sequence may help to qualify the search results to the user's intended search results. For ease of illustration, FIG. 6 again depicts a particular technique for processing numeric queries based on the mapping of a standard telephone handset; those skilled in the art will recognize that other techniques according to the present invention may also be used.
At step 610, a sequence "22748367" (including digital components "2", "7", "4", "8", "3", "6", "7") is received from the user. For purposes of this description, the sequence "227" will be referred to as a "digital word" and the entire sequence "22748367" will be referred to as a "digital phrase". The possible alphanumeric equivalents of the numeric words are referred to as "alphabetic words" and the possible alphanumeric equivalents of the numeric phrases are referred to as "alphabetic phrases".
At step 620, information is obtained about how the numeric components map to letters. Assuming the same mapping information is used as shown in FIG. 5b, at step 630, the digital phrase "22748367" is translated into a potentially corresponding alphabetical phrase. From the information shown in FIG. 5b, there are 11664 letter phrases corresponding to the sequence "22748367".
At step 640, these alphabetic phrases are provided as input to a conventional search engine (e.g., as described with reference to FIGS. 4a and 4 b) using a logical "OR" operation. For example, the search query provided to the search engine may be "' ' aap gtdmp ' OR ' aap htdmp ' … OR ' bar item ' … OR ' car item" '. Although all possible letter phrases may be provided to the search engine, instead, subsets are used by eliminating letter phrases that are not likely to be desired using conventional techniques.
At step 650, search results are obtained from the search engine. Since many search engines are designed to rank those searched documents that contain the exact phrase high (rank highly), document 3 may be the highest ranked search result (i.e., because it contains the exact phrase "bars"). No other documents in this example contain any of the other letter phrases generated in step 620. Also, many search engines reduce or eliminate the weight of search results that contain individual portions of a phrase, but not the entire phrase. For example, document 1 is made to be weighted down or eliminated because it contains the alphabetic word "car" corresponding to the first portion of the alphabetic phrase, but does not contain any alphabetic words corresponding to the second portion of the alphabetic phrase. Finally, alphabetic phrases such as "aap htdmp" are effectively ignored because they do not contain the alphabetic words that appear in the search engine index table.
At step 660, the search results are presented to the user. In the example shown, the first result displayed to the user is document 3, which is likely to be most relevant to the user query. Document 1 may be eliminated altogether because it does not contain any of the possible letter phrases. In this way, the most relevant search results are provided to the user.
Although the description above with reference to fig. 5 and 6 is made with reference to receiving and mapping digital information to alphanumeric information, those skilled in the art will recognize that other embodiments are possible in accordance with the present invention. For example, instead of receiving a sequence of numbers corresponding to a key pressed by the user, the received sequence may include the first letter corresponding to the key pressed by the user. In other words, instead of receiving "227," the received sequence may be "aap. According to the present invention, the equivalent letter sequence generated in step 530 or 630 may be other letter sequences corresponding to "aap" (e.g., "bar"). In practice, the received sequence may contain speech, audiovisual, or any other type of information member.
Regardless of the form in which the sequence is received, it is generally preferred that the received sequence be translated into a sequence corresponding to the format in which the information is stored in the index table of the search engine. For example, if the index table of the search engine is stored in alphanumeric format, the received sequence should be translated into an alphanumeric sequence.
Also, it is generally preferred that the mapping technique used to translate the received information sequence be the same technique employed at the user device for mapping user input to device generated information. However, there are also examples in which it is preferable to use a mapping technique different from that for user input.
Embodiments of the present invention may also enable a user to perform searches that are entered using a non-target language keyboard. For example, a web page containing Japanese text may be written in kanji, while a user attempting to search for the web page may only access the Roman alphabet-based ASCII keyboard (or cell phone).
Fig. 7 illustrates a method for performing such a search. As shown in fig. 7, the user types out the query using a standard input device (e.g., ASCII keyboard, telephone handset, etc.) and sends the query to the search engine. The query may be written in a different character set (e.g., romaji) than the character set (e.g., kanji) written to some of the responsive documents. The search engine receives the query (block 702), translates it into a relevant form (block 704), and performs a search on documents responsive to the translated query (block 706) using, for example, conventional search techniques. The search engine then returns a list of responsive documents (and/or copies of the documents themselves) to the user (block 708). For example, the results are returned to the user in a manner similar to that described above in connection with FIG. 6.
As shown in FIG. 7, the user query is preferably translated at a server of the search engine as opposed to the client, so that the user no longer needs to obtain specialized purpose software to perform the translation. However, it will be understood that in other embodiments, all or some of the translation may be performed at the client. Additionally, in some embodiments, the query may be entered using a device such as a telephone keypad. In such embodiments, the initial numeric query may first be converted to an alphanumeric form (e.g., romaji) using the mapping techniques described above in connection with fig. 5 and 6 (e.g., including the application of index word tables and/or probability techniques) to discard low probability mappings (e.g., mappings including combinations of letters that do not occur in romaji). Once the alphanumeric translation of the query has been obtained, the remainder of the steps shown in FIG. 7 (i.e., 704, 706, and 708) can be performed.
Translation of a query from one character set or language to another character set or language may be performed in a variety of ways (i.e., block 704 in FIG. 7). One technique is to map each term in the query to a corresponding term in the target language or character set using a conventional static dictionary with word senses or translations. However, a problem with this approach is that it will often produce inaccurate results because words are often ambiguous and queries are too short to provide sufficient contextual clues to resolve the ambiguity. For example, the word "bank" may refer to river bank, ultimate institute, or a maneuver by an airport, making it difficult to translate theoretically accurately. Additionally, if the dictionary is relatively small, and/or not updated frequently, it may not contain entries for all terms that the search engine may encounter, such as rarely used words, slang, idioms, proper names, and the like.
Embodiments of the present invention may be used to overcome or ameliorate some or all of these problems by using a probabilistic dictionary to translate query terms from one language or character set (e.g., ASCII) to another language or character set (e.g., kanji). In a preferred embodiment, the probability dictionary maps one set of terms to another and associates a probability with each mapping. For convenience, "term" or "token" refers to a word, phrase, and/or, more generally, one or more sequences of characters that may include spaces.
Fig. 8 shows an example of an equiprobable dictionary 800 such as that described above. The example probability dictionary 800 shown in FIG. 8 maps words written in romaji (Roman alphabet representation of Japanese) to words written in kanji (non-Roman ideograph-based Japanese character set). For ease of explanation, fig. 8 describes roman terms as < term > romaji and kanji terms as < term > kanji. It will be appreciated that in the actual romaji to kanji dictionary, the actual romaji and kanji terms are used instead of the english translation shown in fig. 8. Accordingly, it will be understood that fig. 8 is provided to facilitate the description of embodiments of the invention, and does not illustrate the actual features and meanings of the japanese text.
Dictionary 800 includes entries 808, 810, 812, 814 for multiple romaji entries 802. The dictionary also includes potential presentations (representations) 804 of each of these items written in kanji and corresponding probabilities 806 that each such presentation is correct. For example, the romaji term "bank" may map with a probability of 0.3 to the kanji term meaning "steep slope", with a probability of 0.4 to the term meaning "ultimate instruction", and with a probability of 0.2 to the term meaning "airport maneuver". The terms may be mapped to "other" with a probability of 0.1, which is only a general way to allow each term to map to terms that may not be in the dictionary.
Again, it will be understood that the example shown in FIG. 8 has been constructed to illustrate that a given item (e.g., the word "bank") written in a first character set or language may map to more than one item written in another character set or language. However, those skilled in the art will appreciate that the specific example in fig. 8 illustrates this principle using english words and meanings for the sake of brevity, e.g., the actual romaji presentation of the word "bank" may not be as ambiguous as its english equivalent (e.g., there is no ambiguity in romaji between the word for the fine instruction and the word for the airplan manager). It should also be understood that the dictionary shown in fig. 8 has also been simplified in other respects for ease of explanation. For example, the actual probability dictionary may contain many more potential mappings for each term, or may contain only mappings that exceed a predetermined probability threshold.
The preferred embodiment of the present invention uses such a probabilistic dictionary to translate queries written in one language and/or character set into another language and/or character set, thereby enabling users to find documents written in a different character set and/or language than their original query. For example, if a user enters a query for "cars" written in romaji, the probability dictionary may be used to map the romaji entries for "cars" to kanji entries for "cars", for example. In this way, a user can find documents relevant to their query even if the character set of the query (e.g., romaji) and the character set of the matching document (e.g., kanji) are different. Note that in this particular example, the actual language of the query is not changed (both romaji and kanji are used to represent Japanese), only the character encoding is changed.
As another example, the term "tired" written in ASCII english may be mapped to the term "mdu" using latin 1 character encoding because the character vowel u does not exist in ASCII. Note that in this example, the dictionary provides both translation into another language (english to german) and translation into another character encoding (ASCII to latin 1).
In a preferred embodiment, the mapping dictionary is constructed in an automated fashion using information available over a network and statistical techniques. The preferred embodiment uses a parallel aligned bilingual corpus (e.g., anchor text written in different languages and/or character sets) to achieve accurate translation. Using this data, the preferred embodiment can construct a potential word mapping dictionary. This may be accomplished, for example, by simply counting the number of times a token written in language Si (source language) co-occurs with token Tj (target language) in an aligned text pair (e.g., anchor, sentence, etc.). However, it will be understood that any suitable technique may be used.
In the absence of a sufficiently large and properly aligned data set, this approach may produce a more ambiguous many-to-many mapping. Thus, for example, it may only be determined that S1 maps to T2, T3, T7, and T8 at certain frequencies. However, this is acceptable, and, as described in more detail below, in some embodiments, additional improvements may be made to increase the respective likelihood of each mapping, e.g., by examining previous user queries, user selections of items on the results web page, and so forth.
FIG. 9 illustrates the use of parallel anchor text to construct a probability dictionary. Anchor text includes text related to hyperlinks between web pages (or addresses within a given web page). For example, in hypertext markup language (HTML), the command: "< A href ═ http:// www.abc.com" > Banks and Savingsand Longas "" causes the text "Banks and Savings and Longas" to be displayed as hyperlinks to web pages found at http:// www.abc.com. This text, "Banks and Savingsand roads," is referred to as "anchor text," and typically provides a short description of the web page (e.g., www.abc.com) to which it points. In fact, anchor text will often provide a more accurate description of a web page than the web page itself, and is therefore particularly useful in determining the nature of the web page to which it points. In addition, word usage and distribution in anchor text is often close in spirit and length to that found in user queries. There are also situations where many anchors pointing to a given page will contain the same or highly similar text. For example, the anchor pointing to www.google.com will often be referred to simply as "Google," or the term will be used at least with other text. Thus, by examining all anchors pointing to www.google.com, e.g., katakana, and only by looking for items that appear with the highest frequency (possibly discarding some predetermined low information-content anchors, e.g., the information-content anchor abbreviated as "click here"), katakana translations for "Google" can be inferred with a higher degree of confidence. The preferred embodiment of the present invention utilizes these features of the anchor text to provide accurate translation.
Referring to FIG. 9, upon receiving a query containing a term written in a first character set (e.g., ASCII) (block 902), the server identifies a set of anchor text in which the term appears (block 904). For example, the server may examine an index table of all known anchors to identify those anchors that contain the entry. Next, the web pages pointed to by those anchor points are identified (block 906), followed by identifying any anchors written in the target language or target character set (e.g., hiragana, katakana, and/or kanji) that point to those web pages (block 908). The system will now have two sets of documents (where anchor text is considered to be in document form). The distribution of query terms in one set of documents (e.g., anchors containing the original ASCII query) is then used to identify the most likely candidates for translated phrases in another set of documents (e.g., parallel anchors). Statistics may be calculated for the frequency of occurrence of anchor text items and used to determine the relative frequency or probability of items found in the anchor text that are correct translations of the original query (block 910). For queries with multiple terms, the above process may be repeated for each term, or the entire query may be considered a single term only, or some other suitable grouping of terms may be used. For example, if the query is "bighouses," a possible translation dictionary may be constructed by finding an aligned anchor text containing the phrase (or at least one word in the phrase). Similarly, if the query contains more than two terms, then by picking up the appropriate subset of the query terms and generating the results for those terms, an experiment can be built that determines the appropriate mapping.
One advantage of performing translation in the manner shown in FIG. 9 is that the translation system does not need to have prior knowledge of the mapping between items written in one language or character set and items written in the target set. Instead, the mapping may be dynamically determined based on the volume of data available for performing the statistical analysis. Thus, for example, accurate translations for slang terms, idioms, proper names, etc. may be found without the effort or expense of maintaining a traditional static dictionary (e.g., bilingual analysis and search).
An exemplary embodiment of the foregoing translation technique will now be described in conjunction with fig. 10-12. In this example, it will be assumed that the user has entered the query term "house" and wishes to obtain search results (or just translations of the query term) written in Spanish. The server will attempt to translate the english term "house" into its spanish equivalent.
Referring to fig. 10, a plurality of web pages 959, 961, 963, 965 are linked to web pages 972 and 974 via anchor text 960, 962, 964, 966. Some of the web pages and their associated anchor text are written in English (i.e., web pages 959a-e and 963a-t) and some are written in Spanish (i.e., web pages 961a-e and 965 a-j)). The server first locates all anchors that use the item "house". These anchors can be located, for example, by searching an index table of anchor text stored at the server. Using such an index table, the server may first find five anchors 960 that each use the phrase "big house" and point to web page 972. The server then determines that there are also five target language (i.e., spanish) anchors 962 pointing to web page 972. In the example shown in fig. 10, these anchors contain the text "casa grande". Anchors that point to the same web page (e.g., anchors 960 and 962) or anchors that carry web pages in a predetermined relationship thereto are referred to as being "aligned," where, in a more general sense, aligned generally refers to the equivalent (or possible equivalent) of the item being aligned.
FIG. 11A shows the frequency with which each target language item appears in the target language anchor 962. As shown in FIG. 11A, the items "casa" and "grande" each occur five times (i.e., once in each anchor 962). Thus, of the ten total entries that appear in the target anchor 962 (i.e., two entries per anchor in each of the five anchors), "casa" is in half and "grande" is in half. Therefore, as shown in fig. 11A, at this time, the term "house" is mapped to "casa" or "grande" with equal probability because the two terms appear with equal frequency.
However, as shown in FIG. 10, the system also finds twenty English anchors 964 containing the term "house" and pointing to web page 974 and ten Spanish anchors 966 containing the term "casa" and also pointing to web page 974. As shown in FIG. 11B, the term "house" will now map to "casa" with a probability of 0.75 (i.e., 15/20) and to "grande" with a probability of 0.25 (i.e., 5/20). These probabilities are computed by dividing the total number of occurrences of each term in the target language anchor (i.e., fifteen in the case of "casa") by the total number of terms (including duplicate terms) in the target language anchor (i.e., twenty terms: ten contained in anchor 962, ten contained in anchor 964). Alternatively, or in addition, other techniques may be used to calculate and/or improve the probability of a given translation or mapping. For example, one skilled in the art will appreciate that any of a variety of known techniques (e.g., Bayesian methods, histogram smoothing, kernel smoothing, shrinkage estimator, and/or other estimation techniques) may be used to reduce variance error (variance error) of the probability estimates.
The probability can be improved even further if more anchor text is available. For example, the final probability distribution may be similar to that shown in FIG. 12, in which "house" is mapped with high probability to "casa" and its small form (minor form) "cast" in FIG. 12, with a slightly smaller probability mapping to items like "cast" and "mangio" (spanish words of management), and with negligible probability to items like "grande". Thus, proper translation and recognition of similar synonyms can be obtained without knowledge of the language and/or character set being translated.
Having translated the query term, the server can now search using the translation. For example, if a user were to enter a romaji query for "hotsels in Kyoto," the techniques described above may be used to enable the server to infer katakana, hiragana, and kanji forms of the query, perform a search using those queries, and then present the combined results for each of those queries to the user within an appropriate user interface.
It should be understood that the examples described in connection with fig. 10-12 are provided for illustrative purposes only, and are not limiting, and that many variations may be made to the methods described herein. For example, different statistical techniques may be used to derive the probabilities, and/or modifications may be made to the basic techniques described above. Similarly, it should be appreciated that the translation techniques described above may be used only to perform translations of words or phrases entered by a user, and need not be used to perform related internet searches or to create a probability dictionary. Additionally, while the foregoing examples describe the translation process as being performed after receiving a user query, it should be understood that in other embodiments, the mapping process may be performed before receiving a user query. Such a pre-computed mapping may be stored in a dictionary such as that described in FIG. 8, which may then be applied to translate the user query upon receipt of the user query. Finally, it should be understood that text other than the aligned anchor text may be used to perform the translation. For example, aligned sentences or other data may be used in a similar manner. In many countries, there is more than one official or formal language, and newspapers and periodicals often contain the same articles written in each of these languages. These parallel translations can be used in a manner similar to the anchor text described previously to prepare a probabilistic dictionary of word translations.
Thus, the preferred embodiments advantageously enable a user to enter search queries and/or translation requests in a conventional manner (e.g., using an ASCII keyboard), and provide accurate and automated translations and searches. In some embodiments, additional improvements may be made to the basic model described above. For example, in some embodiments, priority (weight) may be given to anchors that contain multiple items similar to multiple items in the original query and/or other aligned anchors. For example, in the system shown in FIG. 10, priority may be given to the anchor that points to web page 974 because, similar to the original query, each of them contains a single term. Similarly, if the anchor containing the text "la casa grande" also points to webpage 972, its weight will be reduced by an appropriate factor because it contains more terms (i.e., 3) than other anchors it aligns to. Such a weighting scheme may be reflected in the probability calculation shown in FIG. 11B by multiplying the frequencies associated with the terms of these anchors by appropriate factors.
The translation process described above may also be used to increase the effectiveness of the search itself. For example, the probabilistic dictionary may be used to expand queries over the air (on the fly) to include, for example, various translations and synonyms of the original query term. By expanding the user query prior to document retrieval, retrieval may be performed simultaneously for the same "concept," thereby increasing the likelihood that the search results contain the terms the user is looking for. Alternatively, or in addition, the probability dictionary may be used to supplement the normal document indexing process by providing an extension of the document items. For example, translations from the probabilistic dictionary may be utilized to supplement a document index table with terms found in the document, thereby increasing the probability that the document is located even by a search that does not precisely use the same terms found in the original document.
One problem that arises when using the translation techniques described above is that the system cannot obtain sufficiently accurate probabilistic mappings due to data sparseness (e.g., not enough anchors to finalize the "casa" mapping to "house") or lack of diversity (e.g., all anchors say the same thing). Thus, in some embodiments, by examining user behavior, the probability mapping may be further improved. Several illustrative techniques are described below.
For example, assume again that the server wishes to obtain a translation for "house". However, it is assumed that only one anchor text containing the phrase "big house" or the phrase "casa grande" can be found. Due to the lack of diversity in the anchor text, the probability dictionary can get the following mappings:
house → casa, with a probability of 0.5
House → grande, with a probability of 0.5
big → Casa, with a probability of 0.5
big → grande, with a probability of 0.5
grande → house, with a probability of 0.5
grande → big, with a probability of 0.5
casa → house, with a probability of 0.5
casa → big, with a probability of 0.5
Imagine that the user now queries the search engine with the term "casa". Meanwhile, the search engine returns a web page containing the term "casa" and is also mixed in N results containing only the term "house" and M results containing only the term "big". In practice, N and M may be adjusted to account for the underlying probabilities of the mappings, so that less likely mappings will result in fewer results being displayed. If it is found that the user clicks more often on a result containing only the term "house" than on a result containing only the term "big", the mapping probability may be adjusted, for example, as follows:
house → casa, with a probability of 0.9
House → grande, with a probability of 0.1
big → Casa, with a probability of 0.1
big → grande, with a probability of 0.9
grande → house, with a probability of 0.1
grande → big, with a probability of 0.9
casa → house, with a probability of 0.9
casa → big, with a probability of 0.1
Note that the actual number depends on a number of other factors, such as the number of users that clicked into account, the number of web pages that clicked on to contain both items, the placement of the results in the result set containing the item in question, and so forth. It should also be understood that the probabilities of adjustment (i.e., 0.1 and 0.9) given in this example are for illustrative purposes only. Those skilled in the art will appreciate that the actual weighting given to the user feedback, such as described above, may be performed in any suitable manner.
It should also be noted that the foregoing example is simplified for ease of illustration of the use of feedback to the user. For example, in some systems, it will be possible to use information obtained from other translations to help perform a given translation. For example, in the example just presented, even if the term "house" only appears in the anchor text called "big house", it may be determined that "house" maps more properly to "casa" than "house" maps to "grande". For example, if it has been determined that in a sufficiently large dataset (if it is assumed that the anchor text contains little synonym in a list), then "big" maps to "grande" with very high probability, then the house-to-case mapping still takes precedence over the house-to-grande mapping, even if the anchor text containing "house" or "case" is ambiguous.
By examining the user query session history, translation accuracy and/or the usefulness of search results may also be improved. For example, in many cases, the system will know (e.g., through cookies or information stored in a user account at the server) the previous queries that the user has entered. This historical data can be used to rank the likely feelings (sense) of queries from the user, potentially eliminating "banks" for fishing-related queries from flight-related queries. Thus, the process may be used to narrow down the set of possible translations. In some embodiments, by displaying them in conjunction with, for example, "Did you mean to search for X" (where "X" refers to the budget's translation priority) in the user interface, the system can suggest these while also potentially displaying a small number of results from each possible reformulation in the first web page of results. When the user selects one of the selectable objects suggested by "Did you mean" display or results presented on the results web page, the system will obtain additional evidence (evidences) about the possible translations of the query terms and the user's possible search preferences. Both signals can then be used by the system to update the likely scores of the item map (e.g., in the probability dictionary), both in general situations and in user-specific situations.
D. Conclusion
As described in detail above, methods and systems consistent with the invention may be used to provide search results in response to ambiguous search queries and/or to translate terms into other character sets and/or languages. A variety of translation and search techniques have been described. It will be understood, however, that the foregoing description is presented for purposes of illustration and that various modifications and changes may be made in light of the above description or by practice of the invention. For example, although the above description is based on a client-server architecture, those skilled in the art will recognize that a peer-to-peer architecture may also be used in accordance with the present invention. Furthermore, although the described embodiments include software, the present invention may be implemented as a combination of hardware and software or as hardware itself. In addition, while aspects of the present invention have been described as being stored in memory, those skilled in the art will appreciate that these aspects can also be stored on other types of computer-readable media, such as secondary storage devices, like hard disks, floppy disks, or CD-ROMs; a carrier wave from the internet; or other forms of RAM or ROM. Accordingly, the scope of the invention is defined by the claims and their equivalents.

Claims (8)

1. A computer-implemented method, comprising:
receiving a first search query represented in a first format, wherein the first search query contains a first term written in the first format;
identifying, from a set of known web pages, a first set comprising a number of first anchor text strings and a second set comprising a number of second anchor text strings, wherein each of the number of first anchor text strings contains the first item, each of the number of second anchor text strings contains one or more second items written in a second format different from the first format, and each of the number of first anchor text strings and each of the number of second anchor text strings is a displayed portion of a respective first hyperlink, wherein all of the respective first hyperlinks point to a first document;
deriving one or more candidate translations of the first item written in the second format based at least in part on a frequency of occurrence of each of the one or more unique items in the number of second anchor text strings; and
providing search results in response to the search query represented in the first format, wherein the search results are identified by a search engine using a second search query containing at least one of the one or more candidate translations written in the second format.
2. The method of claim 1, wherein deriving the one or more candidate translations further comprises:
for each of the one or more unique items in the second set comprising a number of second anchor text strings,
determining respective counts of the unique items in the second set of second anchor text strings; and
calculating, based on the respective counts of the unique terms relative to a total number of occurrences of all unique terms in the second set of several second anchor text strings, respective probabilities that the unique terms are accurate translations of the first term; and
identifying the one or more candidate translations based on the respective probabilities calculated for the one or more unique terms.
3. The method of claim 1, wherein:
the method further includes identifying, from the set of known web pages, a third set comprising a number of third anchor text strings and a fourth set comprising a number of fourth anchor text strings, wherein each of the number of third anchor text strings contains the first item, each of the number of fourth anchor text strings contains one or more fourth items written in the second format, each of the number of third anchor text strings and each of the number of fourth anchor text strings is a displayed portion of a respective second hyperlink, wherein all respective second hyperlinks point to a second document different from the first document; and
deriving one or more candidate translations of the first item written in the second format based on respective frequencies of occurrence of each of the one or more unique items in all anchor text strings in a combined set comprising the second and fourth sets.
4. The method of claim 3, wherein deriving the one or more candidate translations further comprises:
for each of the one or more unique items in the combined set of anchor text strings,
determining respective counts of the unique items in the combined set of anchor text strings; and
calculating, based on the respective counts of the unique items, respective probabilities that the unique items are accurate translations of the first item written in the second format relative to a total number of occurrences of all unique items in the combined set of anchor text strings; and identifying the one or more candidate translations based on the respective probabilities calculated for the one or more unique terms in the combined set of anchor text strings.
5. A system for performing a search using a query, comprising:
a first module for receiving a first search query represented in a first format, wherein the first search query contains a first term written in the first format;
a second module for identifying, from a set of known web pages, a first set comprising a number of first anchor text strings and a second set comprising a number of second anchor text strings, wherein each of the number of first anchor text strings contains the first item, each of the number of second anchor text strings contains one or more second items written in a second format different from the first format, and each of the number of first anchor text strings and each of the number of second anchor text strings is a displayed portion of a respective first hyperlink, wherein all of the respective first hyperlinks point to a first document;
a third module for deriving one or more candidate translations of the first item written in the second format based at least in part on a frequency of occurrence of each of the one or more unique items in the number of second anchor text strings; and
a fourth module for providing search results in response to the search query represented in the first format, wherein the search results are identified by a search engine using a second search query containing at least one of the one or more candidate translations written in the second format.
6. The system of claim 5, wherein deriving the one or more candidate translations further comprises:
for each of the one or more unique items in the second set comprising a number of second anchor text strings,
determining respective counts of the unique items in the second set of second anchor text strings; and
calculating, based on the respective counts of the unique terms relative to a total number of occurrences of all unique terms in the second set of several second anchor text strings, respective probabilities that the unique terms are accurate translations of the first term; and
identifying the one or more candidate translations based on the respective probabilities calculated for the one or more unique terms.
7. The system of claim 5, wherein:
the system further comprises: a fifth module configured to identify, from the set of known web pages, a third set comprising a number of third anchor text strings and a fourth set comprising a number of fourth anchor text strings, wherein each of the number of third anchor text strings contains the first item and each of the number of fourth anchor text strings contains one or more fourth items written in the second format, each of the number of third anchor text strings and each of the number of fourth anchor text strings is a displayed portion of a respective second hyperlink, wherein all respective second hyperlinks point to a second document different from the first document; and deriving one or more candidate translations of the first item written in the second format based on the respective frequencies of occurrence of each of the one or more unique items in all anchor text strings in a combined set comprising the second and fourth sets.
8. The system of claim 7, wherein deriving the one or more candidate translations further comprises:
for each of the one or more unique items in the combined set of anchor text strings,
determining respective counts of the unique items in the combined set of anchor text strings; and
calculating, based on the respective counts of the unique items, respective probabilities that the unique items are accurate translations of the first item written in the second format relative to a total number of occurrences of all unique items in the combined set of anchor text strings; and identifying the one or more candidate translations based on the respective probabilities calculated for the one or more unique terms in the combined set of anchor text strings.
HK12104074.7A 2003-09-30 2012-04-25 A computer-implemented method and a system and device for performing searches using queries HK1163846B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/676,724 2003-09-30
US10/676,724 US8706747B2 (en) 2000-07-06 2003-09-30 Systems and methods for searching using queries written in a different character-set and/or language from the target pages

Publications (2)

Publication Number Publication Date
HK1163846A1 HK1163846A1 (en) 2012-09-14
HK1163846B true HK1163846B (en) 2014-04-25

Family

ID=

Similar Documents

Publication Publication Date Title
US9734197B2 (en) Determining corresponding terms written in different formats
US7136854B2 (en) Methods and apparatus for providing search results in response to an ambiguous search query
US9411906B2 (en) Suggesting and refining user input based on original user input
US8706474B2 (en) Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names
US20100017382A1 (en) Transliteration for query expansion
JP2009528636A (en) System and method for identifying related queries for languages with multiple writing systems
Bilac et al. Bringing the Dictionary to the User: the FOKS system
HK1163846B (en) A computer-implemented method and a system and device for performing searches using queries
HK1163858A (en) Systems and methods for searching using queries written in a different character-set and/or language from the target pages
HK1163858B (en) Systems and methods for searching using queries written in a different character-set and/or language from the target pages