HK1147326A

HK1147326A - Method and apparatus for relating datasets by using semantic vectors and keyword analyses

Info

Publication number: HK1147326A
Application number: HK11101346.6A
Authority: HK
Inventors: 文圆; 普瑞特斯‧马克里特; 弗朗斯‧荷利三世杰拉德; 劳伦斯‧法瑞斯安德鲁; 咖贝尔‧斯汀伯格
Original assignee: 特克斯特怀茨有限责任公司
Filing date: 2008-07-29
Publication date: 2011-08-05

Description

Method and apparatus for analyzing associated data sets using semantic vectors and keywords

Technical Field

The present invention relates to methods and systems for identifying contextually-associated datasets, such as documents, web pages, e-mails, search questions, advertisements, and the like, and more particularly to methods and systems for identifying datasets contextually-associated with a subject dataset by analyzing a unique semantic vector for the dataset and a keyword semantic representation of information containing representative keywords in the dataset.

Background

Search engines or ad placement systems, such as those developed by microsoft, google, Vibrant Media, or yahoo, are widely used to identify documents or files potentially associated with a search question entered by a user, or to select and display advertisements that are contextually related to one or more data sets, such as documents, email messages, RSS message sources, or web pages, that the user has or is browsing or operating on.

However, even after years of development and modification, existing search engines or ad placement systems are still far from satisfactory. The results of the search or the identified advertisements often lack sufficient correlation to the search questions entered by the user or to the documents or web pages that the user is or has viewed.

Disclosure of Invention

The present invention describes various embodiments that efficiently identify one or more datasets, such as documents, web pages, e-mails, etc., that may be contextually related to a subject dataset, such as a search question, a web page being browsed by a user, by analyzing a unique semantic vector representing the dataset and a semantic representation containing information of representative keywords in the dataset.

According to an exemplary method of the present invention, a data processing system is controlled such that at least one data set from a set of data sets is associated to a subject data set. Each data set or the subject data set includes at least one keyword. The method obtains a semantic vector representing the subject data set and a respective semantic vector representing each individual data set in the group. Each semantic vector representing each individual dataset in the group comprises aggregate information of the relationship between each of the at least one keyword in the individual dataset and a predetermined directory to which each of the at least one keyword in the individual dataset may be associated. The semantic vector representing a subject data set comprises common information of the relationship between each of the at least one keyword in the subject data set and a predetermined directory to which each of the at least one keyword in the subject data set may be associated, and the semantic vector representing the subject data set or each individual data set in the group has dimensions equal to the number of predetermined directories. For each dataset of the group, determining a first similarity between the subject dataset and each dataset of the group by comparing the semantic vector associated with the subject dataset to the semantic vector associated with each dataset of the group. The exemplary method further obtains a keyword semantic representation of the subject dataset and a keyword semantic representation of each individual dataset in the group. The keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group comprises information indicative of a representative keyword of the subject data set or each individual data set of the group, and the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group is structured differently from the semantic vector of the subject data set or the semantic vector of each individual data set of the group. For each individual dataset of the group, determining a second similarity between the subject dataset and each dataset of the group by comparing the keyword semantic representation of the subject dataset with the keyword semantic representation of each dataset of the group. Selecting at least one of the datasets in the group based on the first similarity between the subject dataset and each dataset in the group and the second similarity between the subject dataset and each dataset in the group. The method associates the at least one selected data set of the group to the subject data set. The at least one of the data sets may be presented to the user simultaneously with the subject data set or after the subject data set is presented to the user. The at least one of the data sets or the subject data set may be presented to the user in a voice form, a visual form, a video form, a tactile form, or any combination thereof.

In one embodiment, at least one of the data sets in the group is an advertisement and the subject data set is a document, a web page, an email, an RSS news feed, a data stream, broadcast data, or user-related information; or a portion of one or more documents, web pages, emails, RSS news feeds, data streams, broadcast data, or information related to the user, or a combination thereof. According to yet another embodiment, the exemplary method communicates the at least one selected data set or the file associated with the selected data set and the subject data set or the file associated with the subject data set to a user. The at least one selected data set may be communicated to the user by displaying the at least one selected data set, playing a voice signal according to the at least one selected data set, or providing a link to the at least one selected data set.

In one embodiment, the at least one keyword includes at least one of a word, a phrase, a string, a pre-assigned keyword, a sub data set, meta information (meta information), and information retrieved based on a link contained in the separate data set. In another embodiment, the semantic vector for each data set is pre-computed and contained in the separate data set. The semantic vector may be generated dynamically on the fly.

According to one embodiment, the semantic vector representing each individual dataset in the group is constructed based on at least one keyword of each individual dataset in the group and a known relationship between a known keyword and a predetermined directory to which the known keyword is likely to be associated, and the semantic vector representing a subject dataset is constructed based on at least one keyword of the subject dataset and the known relationship between a known keyword and a predetermined directory to which the known keyword is likely to be associated. According to another embodiment, the semantic vector associated with the separate data set is generated further based on information related to at least one user or at least one data set linked to the separate data set. The information related to the at least one user includes at least one of previously viewed documents, previous search requests, user preferences, and personal information.

According to one embodiment, the step of selecting at least one of the data sets in the group based on the first similarity between the subject data set and each data set in the group and the second similarity between the subject data set and each data set in the group comprises assigning one of the first similarity and the second similarity as a primary similarity and the other as a secondary similarity, obtaining information of a plurality of preset association levels of the primary similarity; for each data set in the group, mapping the primary similarity to one of the preset association levels according to the primary similarity; sorting the data sets in the group according to preset association levels to which the data sets in the group are respectively mapped; ranking, in each relevance level, the data sets in each relevance level according to the secondary similarity of the data sets; and selecting at least one of the data sets in the group according to the result of the sorting of the data sets in each association level.

According to another embodiment, the step of selecting at least one of the data sets of the group on the basis of a first similarity between the subject data set and each data set of the group and on the basis of a second similarity between the subject data set and each data set of the group comprises: designating one of the first similarity and the second similarity as a primary similarity and the other as a secondary similarity; ordering the data sets in the group according to the primary similarity; selecting at least one candidate data set from the sorted data sets according to a preset standard; ranking the at least one candidate data set according to the secondary similarity; selecting the at least one of the data sets in the group according to the result of the ordering of the at least one candidate data set.

According to yet another embodiment, the step of selecting at least one of the data sets according to a first similarity between the subject data set and each data set of the group and according to a second similarity between the subject data set and each data set of the group comprises: for each data set in the group, calculating a composite similarity based on the respective first similarities of the data sets and the respective second similarities of the data sets according to a preset formula; selecting the at least one of the data sets in the group according to the respective composite similarities of the data sets.

An exemplary data processing system is operative to associate at least one data set of a set of data sets with a subject data set. Each data set or the subject data set includes at least one keyword. The system includes a data processor configured to process data and a data storage system configured to store instructions for execution by the data processor, the system controlling the data processor to perform specified steps. These steps include obtaining a semantic vector representing the subject data set and a respective semantic vector representing each individual data set in the group, wherein: each semantic vector representing each individual dataset in said group comprising aggregate information having a relationship between each of said at least one keyword in said individual dataset and a predetermined directory to which each of said at least one keyword of said individual dataset may be associated, said semantic vector representing said subject dataset comprising aggregate information having a relationship between each of said at least one keyword of said subject dataset and a predetermined directory to which each of said at least one keyword of said subject dataset may be associated, and said semantic vector representing said subject dataset or said each individual dataset in said group having a dimension equal to the number of predetermined directories; for each data set in the group, determining a first similarity between the subject data set and each data set in the group by comparing the semantic vector associated with the subject data set with the semantic vector associated with each data set in the group; obtaining a keyword semantic representation of the subject dataset and a keyword semantic representation of each individual dataset in the group, wherein: the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group comprises information indicating representative keywords of the subject data set or the individual data sets of the group, and the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group is structured in a different way than the semantic vector of the subject data set or the semantic vector of each individual data set of the group; for each dataset in the group, determining a second similarity between the subject dataset and each dataset in the group by comparing the keyword semantic representation of the subject dataset to the keyword semantic representation of each dataset in the group; and selecting at least one of the data sets in the group based on the first similarity between the subject data set and each data set in the group and the second similarity between the subject data set and each data set in the group; and associating the at least one selected data set to the subject data set.

The exemplary systems described herein may be implemented using one or more computer systems and/or appropriate software.

One embodiment of the invention is a machine-readable medium carrying instructions which are executed by a data processing system, the machine-readable medium controlling the data processing system to perform machine-implemented steps to associate at least one dataset of a set of datasets to a subject dataset. Each data set or the subject data set includes at least one keyword. These steps include storing a semantic vector representing the subject data set and a respective semantic vector representing each individual data set in the group, wherein: each semantic vector representing each individual dataset in said group comprising set information having a relationship between each of said at least one keyword of said individual dataset and a predetermined directory to which each of said at least one keyword of said individual dataset may be associated, said semantic vector representing said subject dataset comprising set information having a relationship between each of said at least one keyword of said subject dataset and a predetermined directory to which each of said at least one keyword of said subject dataset may be associated, and a semantic vector representing said subject dataset or each individual dataset in said group having a dimension equal to the number of predetermined directories; for each data set in the group, determining a first similarity between the subject data set and each data set in the group by comparing the semantic vector associated with the subject data set with the semantic vector associated with each data set in the group; obtaining a keyword semantic representation of the subject dataset and a keyword semantic representation of each individual dataset in the group, wherein: the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group comprises information indicating representative keywords of the subject data set or the individual data sets of the group, and the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group is structured in a different way than the semantic vector of the subject data set or the semantic vector of each individual data set of the group; for each dataset in the group, determining a second similarity between the subject dataset and each dataset in the group by comparing the keyword semantic representation of the subject dataset to the keyword semantic representation of each dataset in the group; and selecting at least one of the data sets in the group based on the first similarity between the subject data set and each data set in the group and the second similarity between the subject data set and each data set in the group; and associating the at least one selected data set to the subject data set.

Additional advantages and novel features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The embodiments shown and described provide illustration of the best mode contemplated for carrying out the invention. Each feature or embodiment described herein may be implemented alone or in combination with other features or embodiments. The present invention may be modified in various obvious respects, all without departing from the spirit and scope of the present invention. The drawings and description are to be regarded as illustrative in nature, and not as restrictive. The advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.

Drawings

The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which elements having the same reference numeral designations represent like elements throughout and in which:

FIG. 1 is a block diagram of an exemplary advertisement placement system;

FIG. 2 illustrates an embodiment of an exemplary advertisement placement system in accordance with the present invention;

FIG. 3 illustrates operation of another embodiment of an exemplary advertisement placement system according to the present invention;

FIG. 4 is an exemplary table showing the relationship between words and directories;

FIG. 5 is an exemplary table representing values corresponding to the meaning of the words of FIG. 4;

FIG. 6 sets forth an exemplary table of representations of the words of FIG. 4 in semantic space; and

FIG. 7 is a block diagram of an exemplary computer system on which an exemplary advertisement placement system may be implemented.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the inventive concept may be practiced or carried out without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the present invention.

As used in the description herein, the term "dataset" refers to a collection of human and/or machine readable and/or understandable representations, while the term "keyword" refers to one or more elements, such as textual or symbolic elements, numbers, etc., in the dataset. For example, if the data set is a document, the keywords may be one or more words, phrases, punctuation, symbols, and/or sentences contained in the document. The data set may be a collection of various different types of data sets, or a portion of a larger data set. A data set may be a summary and/or tags that summarize or describe the contents of another data set. The keywords may or may not be directly visible to the user. For example, the keyword may be a closed caption of a video file or a portion of a closed caption, lyrics of an audio file, or an element of Word document-related metadata. Additional processing may be performed before a person or machine can determine or process the keywords. For example, to facilitate processing and/or recognition by a human or machine, light characteristic recognition or voice recognition may be applied to convert an element from a first format to a second format.

Examples of data sets include web pages, videos, voice or multimedia files, advertisements, emails, documents, RSS message sources, multimedia files, photos, images, portraits, electronic computer files, sound tapes, broadcasts, video files, metadata, etc., or a collection of one or more of the foregoing.

Examples of keywords include words, phrases, symbols, terms, hyperlinks, metadata information, and/or displayed or undisplayed terms included in or associated with a data set. In the context of the present invention, "WEB page" is understood to mean a union or collection of information that can be displayed in a WEB page (WEB) browser, such as microsoft IE, the contents of which may include, but are not limited to: hypertext markup language (HTML) pages, Java description language, XML pages, email messages, and RSS news message sources.

As applied to the present invention, the term "subject data set" refers to one or more data sets that an exemplary system is intended to identify from among a set of data sets that are contextually related to the subject data set. For example, a subject data set may be a search question that a user enters in an attempt to find documents relevant to the search question; or one or more web pages with which an exemplary system according to the present invention intends to find suitable advertisements to display.

For illustrative purposes, the following examples describe operations in embodiments that identify one or more datasets, such as advertisements, that are contextually related to a topic dataset, such as a webpage being viewed by a user, based on analysis of unique semantic vectors, such as Training Semantic Vectors (TSVs), representing the webpage and the advertisement, and semantic representations containing representative keyword information for the webpage and the advertisement. Various formulas and statistical operations may be performed to identify important or representative keywords so that they can be weighted more than others.

It is to be understood that similar approaches and methods may be applied to different types of data sets and/or subject data sets. For example, documents or web pages that are contextually related to one or more search questions (the subject data set) input by a user may be identified in a similar manner; or identify a web page that may be potentially associated with one or more advertisements.

The Training Semantic Vector (TSV) is the only type of semantic representation of the data set and is generated based on the data points contained in the data set and the known relationships between the known data points and the predetermined directory. For a detailed description of the structure and characteristics of the training semantic vectors, see U.S. patent No. 6,751,621, entitled "structure and classification of training semantic vectors," filed on 2.5.2000, and U.S. patent application serial No. 11/126184 (attorney docket No. 55653-.

FIG. 1 is an exemplary advertisement placement system 10 configured to analyze at least two types of semantic representations of an advertisement 12 and a web page 11 based on: a plurality of TSVs and a plurality of semantic representations including the advertisements 12 and representative keyword information for the web page 11 identifies one or more advertisements from the set of advertisements 12 that are contextually related to the web page 11 being viewed by the user. The advertisement 12 may be composed of any combination of media, such as text, sound, animation, or the like. Based on the results of these analyses, system 10 generates matching results that identify selected advertisements that are contextually related to web page 12.

The selection of one or more advertisements for a particular data set or web page may occur at the time the data set is presented to the user, or before the data set is presented to the user, or after the data set is presented to the user. In another embodiment, the advertisement placement system 10 is used to select one or more advertisements 12 contextually associated with a web page 11 such that the web page is displayed with or linked to the one or more selected advertisements. Those data sets identified as being associated with the subject data set are transmitted or presented to the user with the subject data set, and the number of times the subject data set is transmitted or presented is different. These data sets may be transmitted or presented to a user in various forms or formats, such as voice form, video form, visual form, tactile form, machine readable format, or any combination thereof.

The TSVs associated with each advertisement 12 or web page 11 may be pre-calculated or calculated on the fly. In one embodiment, each web page or advertisement includes embedded or associated information for their respective pre-computed plurality of TSVs. In another embodiment, the TSVs associated with web page 11 are dynamically calculated by system 10.

FIG. 2 is a detailed block diagram of one embodiment of the advertisement placement system 10. As shown in FIG. 2, the advertisement placement system 10 includes term extractors 102, 112 that are used to identify and retrieve keywords for an advertisement 12 or a web page 11. The term extractor 102, 112 performs linguistic analysis on the content of the advertisement 12 or web page 11 to segment the sentence from the advertisement 12 or web page 11 into smaller units, such as words, phrases, etc. Terms that are used frequently, such as grammatical words like "the", "a", etc., may be deleted with a preset stop list. If the advertisement 12 or web page 11 includes information that is not actual content (e.g., HTML markup tags or Java scripts), the information may be deleted. Software for performing the term extraction is widely available and known to those skilled in the art.

The advertisement placement system 10 further includes TSV generators 103, 113 that are used to calculate TSVs for the advertisement 12 or web page 11 based on the output of the term decimators 102, 112. System 10 may use a common TSV generator for both advertisement 12 and web page 11. Alternatively, the outputs of the advertisement 12 and the web page 11 may be processed separately using separate TSV generators.

The advertisement placement system 10 includes a TSV indexer 114 and a TSV index database 118 for organizing and storing the generated TSVs for efficient searching. The TSV indexer 114 may be implemented using a full database management system (DBMS) or just a software package for large scale data record management, while the TSV index database 118 may be implemented along with a database storage TSV index file that includes TSVs for its links and advertisements 12. Different retrieval schemes may be applied to speed up the search. For example, one conventional approach to retrieving TSVs is to list them under each semantic directory to which they refer.

The TSVs associated with each advertisement 12 and the TSVs associated with the webpage 11 are input to the TSV matcher 104 to determine the respective TSV similarities between the webpage 11 and each advertisement. These similarities may be in the form of relevance scores. In one embodiment, the similarity or association between TSVs is determined based on the distance between semantic vectors (TSVs), e.g., determining an N-dimensional euclidean distance between TSVs, where N is the semantic space or the dimension of a predetermined directory. The shorter the distance between the TSV of the web page 11 and the TSV of the advertisement, the more similar between the web page 11 and the advertisement. Other comparison methods may also be applied, such as cosine measurements, hamming distances, Minkowski (Minkowski) distances, or Mahalanobis (Mahalanobis) distances. Various optimizations may be made to increase comparison times, including reducing the dimensions of the TSVs prior to comparison and deleting certain advertisements using filters before or after comparison.

Based on the TSV comparison results, the TSV matcher 104 generates a TSV match list 105 that includes a ranked list of matching advertisements selected from the advertisements 12 according to their respective TSV similarities with the webpage 11. A preset threshold may be applied to select only those advertisements that have a degree of similarity above the preset threshold.

The advertisement placement system 10 further includes means other than a TSV type to determine and compare textual representations for the web page 11 and the advertisement 12. In one embodiment, the advertisement placement system 10 generates a semantic representation that includes information for representative keywords of the web page 11 and the advertisement 12.

As shown in FIG. 2, the keyword selectors 115, 106 input the terms retrieved by the term extractors 102, 112 and select a subset of keywords from the content of the web page 11 or advertisement 12 to represent the web page 11 or each advertisement 12 according to one or more prosody, such as term frequency (frequency of occurrence of terms in the page), reverse document frequency (including the portion of the page in the set of terms), or other methods known to those skilled in the art. For example, the keyword selectors 115, 106 may calculate the frequency or number of occurrences of each text in the web page 11 or each advertisement, and select a representative keyword based on the calculated frequency or number of occurrences of each text.

Another example is to delete keywords that provide little topic information about the web page 11 or advertisement 12 with a stop list. The term extractor 102, 112 maintains or has entries to a stop list that includes the most commonly occurring words that provide little information about the topic. The keywords included in the stop list are not good search terms. The stop list may be created by a linguistic expert, automatic analysis (e.g., statistically), or by the user or a combination of the three. It will be appreciated that other methods known to those skilled in the art may be applied to select keywords from the web page 11 or advertisement 12 for use on behalf of the web page 11 or advertisement 12.

After the keyword selector 115 identifies a representative keyword for each advertisement, a keyword index database 117 is provided to store the representative keywords and links to the various advertisements 12.

As shown in FIG. 2, a keyword matcher 107 is provided for determining keyword similarity between the web page 11 and each of the advertisements 12 based on information representative of each individual advertisement and the selected keywords of the web page 11. In one embodiment, the keyword matcher 107 queries the keyword index database 117 for a group of selected keywords for the web page 11 and generates a keyword association score for each advertisement and web page 11 according to one or more known algorithms. For example, an association score between two sets of representative keywords is calculated based on the number of matches or common keywords (a term, a ticket) contained in the advertisement and web page. In another embodiment, the keyword matcher 107 employs a more elaborate voting scheme (election group, weight assignment, privilege with absolute veto, loudness supported) to determine the degree of similarity between each advertisement and the web page 11. Other types of calculations, such as vector space models, may calculate the relevance score using direct or modified cosine similarity.

After the keyword matcher 107 calculates the respective similarities between the web page 11 and each individual advertisement, the keyword matcher 107 generates a keyword matching list 108 that orders the advertisements 12 based on their respective similarities to the web page 11 or their respective association scores.

The TSV matching list 105 and the keyword matching list 108 are sent to a combiner 109, which generates a final matching list 110 from the information contained in the keyword matching list 108 and the TSV matching list 105. In one embodiment, for each advertisement of the TSV match list 105 or the keyword match list 110, the consolidator 109 computes a composite relevance score based on its relevance scores in the TSV match list 105 and the keyword match list 110. A final matching list 110 is then generated based on the respective composite relevance scores of the advertisements.

In one embodiment, the composite relevance score is calculated by:

if advertisements are included in both the TSV match list 105 and the keyword match list 108, then

Union-fraction ═ a₁TSV-fraction + b₁Keyword-score + c₁ (1)

If advertisements are included only in the TSV match list 105, then

Union-fraction ═ a₂TSV-fraction + c₂ (2)

If advertisements are included only in the keyword matching list 108, then

Union-fraction ═ b₃Keyword-score + c₃ (3)

To some extent, these coefficients a₁、a₂、b₁、b₃、c₁、c₂、c₃Can be chosen in such a way that equations (2) and (3) are a special case of equation (1). The relevance scores at each or all of the matching lists may be normalized to [0, 1]]And (3) a range. Conditional or unconditional thresholds may be applied to the association scores of each or all match lists to narrow downAnd (4) listing. The final matching list 110 is derived based on the composite scores of the advertisements.

In another embodiment, advertisements in the TSV match lists 105 and the keyword match lists 108 are rearranged using proprietary formulas to form the exemplary final match list 110. Each advertisement in the TSV matching list 105 and keyword matching list 108 is associated with a respective TSV relevancy score and keyword relevancy score. The TSV match list 105 ranks the advertisements according to their respective TSV relevance scores, while the keyword match list 108 ranks the advertisements based on their respective keyword relevance scores. One of the TSV affiliation score and the keyword affiliation score is designated a primary affiliation score and the other is designated a secondary affiliation score.

Table 1 represents an exemplary ordered list with TSV relevance scores as primary relevance scores and keyword relevance scores as secondary relevance scores. For purposes of illustration, the relevance scores are normalized to be in the range of [0, 1 ].

TABLE 1

The primary relevance score of each advertisement is mapped to a preset relevance level corresponding to a particular range of relevance scores. The advertisements are then ordered according to their mapped association levels. The advertisements are ordered within each relevance level using the secondary relevance score of each individual advertisement.

For example, the example shown in table 1, TSV relevance scores are mapped to three different relevance levels:

if the relevance score is < 0.4, then

Association rank of 1

If 0.4 < ═ correlation score < 0.7, then

Association rank 2

If the relevance score is > -0.7, then

Association level 3

After conversion, the advertisements are reordered according to their respective association levels. The advertisements within each individual relevance level are then reordered according to their respective secondary relevance levels. Table 2 shows the results of the reordering. Column 1 of table 2 is the final associated ordering of the ads.

TABLE 2

ID	Association level (according to TSV-score)	TSV-score	Keyword-score
ID	Association level (according to TSV-score)	TSV-score	Keyword-score	K2	3	0.78	0.88
T2	3	0.85	0.85	K2	3	0.78	0.88
T2	3	0.85	0.85	T1	3	0.89	0.75
K1	2	0.5	0.95	T1	3	0.89	0.75
K1	2	0.5	0.95	T3	2	0.45	0.6
K3	1	0.3	0.73	T3	2	0.45	0.6

The ad placement system 10 then selects one or more ads from the final match list 110 for association to the web page 11 based on the ranking of the final match list 110. According to one embodiment, the selected advertisement is displayed with the web page 11 or linked to the web page 11.

It will be appreciated that in other embodiments, the keyword affinity score may be named as a primary affinity score and the TSV affinity score may be named as a secondary affinity score. It will also be appreciated that the relevance scores may be converted with different numbers at the range level, as desired by the design. It will also be appreciated that a conditional or unconditional threshold may be applied to the association scores of each or all matching lists to narrow the lists.

In another embodiment, the system 10 may generate the final match list 100 primarily in accordance with one of the TSV match list 105 and the keyword match list 108. For example, the system 10 selects a preset number of advertisements based on their respective keyword relevance scores in accordance with the keyword matching list 108, yet calculates the TSV relevance scores for each advertisement. The advertisements in the keyword ranked list 108 are then reordered based on their respective TSV relevance scores. The system 10 outputs the reordered match list as the final match list 110.

FIG. 3 illustrates another exemplary advertisement placement system 20 for associating one or more advertisements 12 to a web page 11 based on context-based association of the advertisements. To simplify the discussion, elements having the same reference number designation represent like elements previously discussed.

In system 20, the TSV and keyword semantic representations of advertisement 12 are stored within database 212. For each advertisement, the database 212 provides two data fields, one for TSV and one for a keyword semantic representation. The advertisement placement system 20 further includes a TSV and keyword indexer 211 for organizing and managing TSV and keyword semantic representations. The TSV and key indexer 211 may apply a full database management system (DBMS) or only a software package for large-scale data record management when executed, and the database 212 may be executed together with the database. Different retrieval schemes may be applied to speed up the search.

The system 20 includes the terms decimator 102 and 112, TSV generators 103 and 113, and keyword selectors 106 and 115, all having the same functionality as previously described in fig. 2. For each advertisement, the TSV and keyword consolidator 210 associates its TSV and keyword semantic representations appropriately to that advertisement. Similarly, for web page 11, TSV generator 103 generates TSVs and keyword selector 106 generates a keyword semantic representation. The TSV and keyword associator 205 associates or links its TSV and keyword semantic representations to the web page 11. The information associated with the plurality of TSV and keyword semantic representations of the webpage 11 and advertisement 12 is processed by a TSV and keyword matcher 206 that performs functions similar to those of the TSV matcher 104 and keyword matcher 107 previously discussed with respect to FIG. 2. Relevance scores for TSV and keyword semantic representations may be calculated in methods similar to those described with respect to fig. 2. As previously discussed with reference to fig. 2, the TSV and keyword matcher 206 generates a final match list 213.

In another embodiment, the joint relevance score for each advertisement or each candidate or target data set may be calculated by joining the semantic representation of the keyword and the semantic vector representation of the data set in the same vector space. For example, the keyword representation and the semantic vector representation of the advertisement are treated as vectors in the same vector space and are joined to form a signal joint semantic vector representation of the advertisement.

In computing the joint semantic vector representation, different weights may be configured for the semantic vector representation and the keyword semantic representation. For each advertisement, an association score is calculated based on the joint semantic vector representation of the advertisement and the joint semantic vector representation of the target dataset. The TSV and keyword matcher 206 generates a final matching list 213 from the respective joint relevance scores of the advertisements.

It will be appreciated that the matching list generated based on the keyword or TSV comparison may be further refined or reordered by other known methods. For example, the data sets or web pages in the ordered list may be rearranged based on link information between the final ordered web pages using an algorithm, such as the page rank (PageRank) algorithm developed by Google corporation, entitled "method of node ranking in a Link database," described in U.S. Pat. No. 6285999, the entire contents of which are incorporated herein by reference.

TSV structure

The structure of the TSVs of the data set will now be described. Further details of TSVs are described in U.S. patent No. 6751621 and US patent application serial No. 11/126184, the contents of which are previously incorporated by reference.

In preparation for generating TSVs for a data set, a semantic dictionary is applied to look up TSVs corresponding to data points contained in the data set. The semantic dictionary includes known relationships between a plurality of known data points and a plurality of predetermined categories. In other words, the semantic dictionary contains "definitions" of the corresponding words and phrases, e.g., TSVs.

An exemplary process for generating TSVs for a data set with a TSV generator is now described. The data set may be an advertisement, a web page, or any type of data set. For purposes of illustration, the word is used as an example of a keyword included in a document. It will be appreciated that many other types of data points or keywords may be included in a document, such as words, phrases, symbols, terms, hyperlinks, metadata information, graphics, and/or any displayed or non-displayed items or any combination thereof.

Based on the input keywords of the document, the TSV generator identifies corresponding keywords in the semantic dictionary and retrieves the respective TSV for each keyword contained in the document based on the definitions provided by the semantic dictionary. The TSV generator 103 generates TSVs of a document by combining the respective TSVs of keywords contained within the document. For example, the TSVs of the document may be defined as vector appends to the respective TSVs of all keywords contained within the document.

The process of creating a semantic dictionary is now described. In one embodiment, the semantic dictionary is generated by appropriately determining one or more predetermined directory groups to which each of a plurality of known data sets belongs. The sample data set may belong to more than one predetermined catalog, or the sample data set may be restricted to being associated to a single catalog. For example, news stories relating to patent infringement litigation involving an electronic computer company based on the contents of the story and based on a predetermined catalog may belong to the catalog including "intellectual property laws", "trade disputes", "operating systems", "economic problems", and the like. Once it is determined that the sample data set is to be associated to some predetermined directory or directories, all keys contained in the sample data set are associated to the same predetermined directory. The same processing is performed in all sample data sets.

In one embodiment, the relationship between sample documents and catalogs may be determined by analyzing open catalog engineering (ODP), which assigns thousands of web pages to a rich topic hierarchy by expert editors. These sample web pages with assigned catalogs are referred to as training documents that determine the relationship of keywords to predetermined catalogs. It should be clear to those skilled in the art that other online hierarchies, classification schemes, and ontologies can be applied in a similar way to associate sample training documents to catalogs.

The following steps describe how the ODP hierarchy is transformed for the purpose of generating a TSV semantic dictionary.

1. The ODP web page is downloaded. An association is maintained between each web page and the ODP directory to which the web page belongs. Any incorrectly downloaded web pages are deleted and URLS is transmitted to the internal pathname.

2. Alternatively, all web pages referenced by any of the above-described ODP web pages are downloaded, and an association is created between each new web page and the ODP directory to which the source ODP web page belongs. Alternatively, the filter web page retains only those new web pages that have the same directory as the web page from which the source ODP web page originated. Any incorrectly downloaded web pages are deleted and URLS is transmitted to the internal pathname.

3. The undesired directory is optionally deleted. Some types of ODP directories are deleted before processing. These deleted directories may include empty directories (directories without corresponding documents), letter column directories ("movie titles" starting at A, B. have useless semantic distinctions), and other directories that do not contain useful information for identifying semantic content (e.g., empty directories, regional pages displayed in an unexpected foreign language), or other directories that contain misdirection or incorrect information (e.g., adult content pages).

4. Pages that are not suitable for training are deleted. In one embodiment, only pages with at least a minimum amount of content are used for training. In another embodiment, the training page must have at least 1000 bytes of text being converted, and a maximum of 5000 separator whitespace words.

5. Optionally, any pages not written in English are deleted. This may be done by standard methods such as HTML meta-tags, automatic language detection, filtering URL domain names, filtering character ranges, or other techniques familiar to those skilled in the art.

6. Optionally, duplicates are deleted. If a page appears in more than one ODP directory, the page is an ill-classified and possibly not a good training candidate.

7. Dimensions of the candidate TSV's are identified. Running the compression-pruning (collaps-trim) algorithm described below automatically flattens the ODP hierarchy and identifies the dimensions of the candidate TSVs.

8. Optionally, the dimensions of the TSVs are adjusted. Based on the expected semantic properties of those dimensions, the dimensions of the automatically generated TSVs are examined and certain dimensions are manually compressed, separated, or deleted. Types of adjustments may include, but are not limited to, as described below. First, if certain words frequently appear in the source directory name, then those directories may be compressed to their parent (either because they all discuss the same thing or because they are semantically meaningless). Second, some special directories may be compressed to their parent (usually because they are too special). Third, certain directory groups that are independent within the ODP hierarchy may be merged together (e.g., "arts/magazines and E-presentations/E-presentations" may be merged into "arts/online works/E-presentations").

9. A TSV training file is created. For each potential training page, that page is associated with the dimension of the TSV to which the directory of pages is compressed. Pages are then selected from each TSV dimension that will be used to train those dimensions, taking care not to over-train or under-sample. In one embodiment, we randomly select 300 pages with at least 1000 bytes of converted text (if less than 300 pages, we select all of them). We delete any page that exceeds 5000 space words and we leave a maximum of 200000 space words for the entire dimension, starting with the smallest page and stopping when the cumulative word number reaches 200000.

10. The dimension is optionally re-labeled with a label. Each dimension starts with the same label as the ontology path of the ODP directory from which the dimension originates. In one embodiment certain tags are manually adjusted to reduce them, making them more readable, and ensuring that they reflect different subdirectories that are joined or deleted. For example, the source labels of "top/shopping/vehicle/motorcycle/part and accessory/harley davison" can be written as "harley davison, part and accessory".

In one embodiment, the compression-pruning algorithm looks for the number of directly available pages at each directory node, bottom-up through the ODP hierarchy. If at least 100 pages are stored at that node, we reserve that node as the dimension of the TSV. Otherwise we compress it into the parent node.

After the allocation of sample data sets to predetermined directories (dimensions) is performed, a data table is created to store information indicating the relationship between keywords contained in one or more sample data sets and predetermined directories based on the allocation results. Each entry of the data table establishes a relationship between the key and one of the predetermined directories. For example, each entry of the data table may correspond to the number of sample data sets within the directory that contain a particular key. The keywords correspond to the contents of the sample data set when the predetermined catalog corresponds to a dimension of the semantic space. For application in constructing trainable semantic vectors, a semantic dictionary may be generated with a data table that contains "definitions" for each word, field, or other key within a particular semantic space formed by a predetermined directory.

FIG. 4 shows an exemplary data table for constructing a semantic dictionary. For simplicity and ease of understanding, the number of words and the number of predetermined categories in FIG. 4 are reduced to 5, which in practice may be thousands of terms or predetermined categories.

As shown in FIG. 4, table 200 contains data corresponding to predetermined directory Cat₁，Cat₂，Cat₃，Cat₄And Cat₅Line 410 and representative word W₁，W₂，W₃，W₄And W₅Column 412. Watch with watch200 corresponds to the number of documents having a particular word, e.g., word W, appearing in the corresponding directory₁，W₂，W₃，W₄And W₅One or more of (a).

The total number of entire columns 412 through each row 410 provides the total number of documents containing the word represented by row 410, which values are represented in column 416. Referring to FIG. 4, the word W₁In the directory Cat₂Appears 20 times in the directory Cat₅Appear 8 times in. Word W₁Not present in the directory Cat₁，Cat₃And Cat₄In (1).

Reference column 416, word W₁A total of 28 occurrences occur throughout all directories. In other words, 28 classified documents contain the word W₁. Examination of exemplary column 412, e.g., Cat₁Disclosure word W₂Only in the directory Cat₁1 time, word W₃In the directory Cat₁8 times of occurrence, and/or W₅In the directory Cat₁Appear 2 times. Word W₁Not present in directory 1 at all. Reference line 418, corresponding to directory Cat₁Indicates that 11 documents are classified in the directory Cat₁。

According to one embodiment, after the data table is created, the meaning of each entry of the data table is determined. In some cases, the meaning of the entry may be considered to be the relative strength of the occurrence of the word in the ad hoc directory, or the association of the word with the ad hoc directory. However, such relationships should not be considered limiting. The meaning of each entry is limited only to the actual dataset and directory (e.g., features, which are considered important in representing and describing the directory). According to one embodiment of the invention, the meaning of each word is determined based on the statistical behavior of the word across all directories. This can be done by first calculating the percentage of keywords present in each directory according to the following formula:

probability (entry | directory) ═ entry (entry)_nDirectory of_m) Directory_{m_total}

Next, the probability distribution of key occurrences across all directories is calculated according to the following formula:

probability (directory | entry) ═ entry (entry, directory)_m) Inlet/outlet_{n_total}

u and v both represent the strength of association of a word with a particular directory. For example, if a word only appears in a small number of datasets in one directory and not in any other directory, then it will have a high v value and a low u value for that directory. If an entry appears in both a large number of datasets in one directory and several other directories, then it will have a high u value and a low v value for that directory.

Depending on the amount and type of information being represented, additional data manipulation may be performed to improve the deterministic meaning of each word. For example, the u value of each directory may be normalized (e.g., divided evenly) by the total number of all values of the key, thus allowing interpretation as a probability distribution.

The weighted mean of u and v can be applied to determine the meaning of the keyword according to the following formula:

α(v)+(1-α)u

the variable α is a weighting factor that can be determined based on the information represented and analyzed. According to one embodiment of the invention, the weighting factor has a value of approximately 0.75. Other values may be selected depending on various factors, such as the type and amount of information, or the level of detail necessary to represent the information. From the full evidence gathered from the experiments, the inventors have determined that the weighted mean of the u and v vectors can yield better results than those achieved with u alone, or v alone, or unweighted combinations of u and v.

Based on the data of fig. 4, fig. 5 represents the operation process described above. In fig. 5, table 230 stores values indicating the relative strength of each word with respect to the directory. In particular, the percentage of keys (e.g., u) that appear in each directory is in the form of a vector of each word. The value of each entry in the u vector is calculated according to the following formula:

probability (word | directory) ═ word_nDirectory of_m) Directory_{m_total}

Table 230 also presents the probability distribution (e.g., v) of key occurrences across all directories in the form of a vector of each word. The value of each entry in the v vector is calculated according to the following formula:

probability (directory | entry) ═ word_nDirectory of_m) Word_{n_total}

Turning now to FIG. 6, a table 250 is shown for illustrating semantic representations or "definitions" of words in FIG. 4. Table 250 is a combination of 5 TSVs corresponding to the semantic representation of each word throughout the semantic space. For example, the first row corresponds to the word W₁The TSV of (1). Each TSV has dimensions corresponding to a predetermined directory. Additionally, according to one embodiment of the invention, word W is computed₁，W₂，W₃，W₄And W₅Wherein the entries are adjusted to optimize meaning of words with respect to the particular directory. More specifically, the value is calculated by the following formula:

α(v)+(1-α)u

the entry for each TSV is calculated based on the actual values stored in table 230. Thus, the TSVs shown in table 250 correspond to the exemplary word W represented in fig. 4₁，W₂，W₃，W₄And W₅"definition" of (1), exemplary word W₁，W₂，W₃，W₄And W₅Corresponding to each predetermined directory or vector dimension, the predetermined directory or vector dimensions are combined to form a semantic dictionary for a vector space formed by the predetermined directories.

It is sometimes desirable to place advertisements in documents that are local to the market for the advertised product. This may be done by embedding graphical information (e.g., zip code, city/country name) in the advertisement or using a graphical area to obtain and associate the IP address of the user. However, not all documents contain graphical information in a suitable form, and not all users have an IP address corresponding to their local area. In this case, during the semantic dictionary formation as described above, an additional directory associated to the graphic region may be included in the predetermined directory. Each graphical region becomes a dimension within the semantic space and a sample data set tagged with graphical information is used to create the semantic dictionary. The semantic dictionary may be used to generate TSVs for data sets and advertisements that reflect the strength with which those data sets and advertisements are associated to different graphical regions.

The application of TSVs is not limited to one language. Once a suitable sample data set is available, it is possible to create semantic dictionaries for different languages. For example, an English sample data set from open directory engineering may be replaced with an appropriate sample data set in another language when generating a semantic dictionary. There may be a separate semantic dictionary for each language. Alternatively, keywords for all languages may be grouped in a single common dictionary. Different languages may share the same predetermined directory or semantic dimension, or may have disparate predetermined directories or semantic dimensions, depending on whether they share the same semantic dictionary and whether it is desired to compare semantic vectors across languages.

After creating the semantic dictionary, the TSV generator 103 may retrieve the semantic dictionary to find the corresponding TSVs for the keywords contained in the target document. In one embodiment, the TSVs included in the keywords of the target document are combined to generate TSVs of the target document. The manner in which the TSVs are combined depends on the particular implementation. For example, joint TSVs may be operated with vector additions. In this case, the TSV for the document may be represented by the following formula:

TSVs (documents) ═ TSVs (W1) + TSVs (W2) + TSVs (W3). + TSVs (wn)

Here W1, W2, W3..

The generation of TSVs for a data set may utilize multiple types of information including keywords for the data set, information retrieved based on keywords included in the advertisement and the data set, and additional information assigned to the data set. For example, the generation of advertised TSVs may be performed based on: including but not limited to words displayed in the advertisement, a set of keywords associated with each advertisement, an advertisement title, an advertisement summary description, marketing text associated with the advertisement describing the item being advertised or the viewer to whom it is being sold, website information that may be referenced by the advertisement. The generation of TSVs for a web page may be performed based on: the meta-text fields associated with a web page, such as title, keywords, and description, text linked to from or by other web pages, etc., are based on information including, but not limited to, some or all of the actual text that appears on the web page.

To increase operating speed, the TSVs for advertisements may be generated and upgraded offline as advertisements are modified, added, or deleted. But alternatively TSVs may be generated at the time of advertisement placement. Similarly, TSVs for web pages or other data sets may also be generated off-line or on-the-fly.

According to one embodiment, an exemplary system disclosed herein analyzes portions of a data set, such as web pages or displayed documents, based on a final matching list of background articles, and automatically links one or more descriptions of each portion to a set of background articles, such as encyclopedia articles from Wikipedia (http:// www.wikipedia.com).

It will be appreciated by those skilled in the art that the methods and systems disclosed herein may be applied for various purposes, such as associating one or more advertisements to one or more web pages or documents, or vice versa; retrieving associated documents based on a user search question; background information and equivalent terms for different parts of the data set are found. It will also be appreciated that a data set as used herein may include only a single type of data set, such as a web page or document, or a combination of different types of data sets, such as a combination of email and web pages, documents and broadcast data.

According to another embodiment of the invention, data sets, such as advertisements 12 and web pages 11, are represented and indexed using an improved representation of a so-called "tag key". The tag key will associate a key to a data set that is included in the data set with a particular semantic category available for one or more data sets. For example, the term "bank" may represent a variety of different meanings, but when it is tagged in a semantic category, such as a financial institution, the "bank" will no longer match a semantic category tagged as, for example, a geological structure.

When analyzing a data set, such as web page 11 or advertisement 12, as previously discussed with respect to FIG. 3, keyword selector 115 or 106 selects a candidate keyword from each advertisement or web page 11, the candidate keyword being considered representative of the web page or advertisement. In one embodiment, candidate keywords may be selected based on the frequency with which each keyword appears in a particular dataset or document. According to an exemplary system of the present invention, a semantic dictionary is obtained for associating information of a predetermined semantic directory and its relationship with a candidate keyword. For example, a dataset with N candidate keys and M predetermined directories, with MxN key and directory (possibly tag key) pairs available. Filters may be used to delete directories that are less associated with the key. A threshold specifying a minimum association requirement may be utilized to identify a directory sufficient to associate to a keyword. An exemplary method of selecting a directory for a keyword is to simply look up in the semantic dictionary discussed above, which includes information for the selection strength of a particular term for a given semantic directory. In one embodiment, the directory or set of directories used for the strongest selection of keywords would be the primary candidate for a tag.

For example, assume that a document contains two keywords, K1 and K2. Then K1 and K2 will be queried in the semantic dictionary to see which directory is associated with which key, if any. If a key is associated with more than one directory, such as directories C1, C2, C3, and C4, then there are several options: (1) selecting the directory with the strongest association with the keyword; (2) selecting all directories associated above a minimum threshold; (3) all directories are selected regardless of the strength of association. The result of this would be a list of pairs of directories and key, tag keys, such as K1+ C1, K1+ C2, and K2+ C4, etc., that represent the data set. Each tag key may be considered a semantic vector corresponding to a keyword, and the semantic vectors of candidate keywords may be joined, e.g., vector appended, to form a semantic vector representation of the data set. The semantic vector representation may be used in a manner similar to those already described in the present invention.

FIG. 7 is a block diagram representing a computer system 100 upon which an exemplary system of the present invention may be implemented. Computer system 100 includes a bus 702 or other communication mechanism for communicating information, and a processor 704 coupled with bus 702 for processing information. Computer system 100 also includes a main memory 706, such as a Random Access Memory (RAM) or other dynamic storage device, coupled to bus 702 for storing information and instructions to be executed by processor 704. Main memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions by processor 704. Computer system 100 further includes a Read Only Memory (ROM)708 or other static storage device coupled to bus 702 for storing static information and instructions for processor 704. A storage device 710, such as a magnetic disk or optical disk, is provided and coupled to the bus for storing information and instructions.

Computer system 100 may be coupled via bus 702 to a display 712, such as a Cathode Ray Tube (CRT), for displaying information to a computer user. Input devices 714, including alphanumeric and other keys, are coupled to bus 702 for communicating information and commanding selections to processor 704. Another type of user input device is cursor control 716, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 704 and for controlling cursor movement on display 712. Such input devices typically have two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allow the device to specify positions in a plane.

According to one embodiment of the invention, computer system 100 provides for the construction and semantic operation of TSVs in response to one or more sequences of one or more instructions being executed by processor 704 and/or received from main memory 706 or storage device 710 or network chain 120. Such instructions may be read into main memory 706 from another computer-readable medium, such as storage device 710. Execution of the sequences of instructions contained in main memory 706 causes processor 704 to perform the process steps described herein. The sequences of instructions contained in main memory 706 may also be executed in a multi-processing arrangement using one or more processors. In an alternative embodiment, hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software.

The term "computer-readable medium" as used herein refers to any medium that participates in providing instructions to processor 704 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 710. Volatile media includes dynamic memory, such as main memory 706. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 702. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other medium from which a computer can read.

Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 704 for execution. For example, the instructions may initially be carried on a magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 100 can receive the data on the telephone line and use an infrared converter to convert the data to an infrared signal. An infrared detector coupled to bus 702 can receive the data carried in the infrared signal and place the data on bus 702. Bus 702 carries the data to main memory 706, from which main memory 706 processor 704 retrieves and executes the instructions. The instructions received by main memory 706 may optionally be stored on storage device 710 either before or after execution by processor 704.

Computer system 100 also includes a communication interface 718 coupled to bus 702. Communication interface 718 provides a two-way data communication coupling to network link 120 that is connected to a local network 722. For example, communication interface 718 may be an Integrated Services Digital Network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 718 may be a Local Area Network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any other implementation, communication interface 718 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

Network link 120 typically provides data communication through one or more networks to other data devices. For example, network link 120 may provide a connection through local network 722 to a host computer 724 or to data equipment operated by an service provider (ISP) 726. ISP726 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "internet" 728. Local network 722 and internet 728 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 120 and through communication interface 718 are exemplary forms of carrier waves transporting the information, transporting the digital data to and from computer system 100.

Computer system 100 can send messages and receive data, including program code, through the network(s), network link 120 and communication interface 718. In the Internet example, a server 130 might transmit a requested code for an application program through Internet 728, ISP726, local network 722 and communication interface 718.

In accordance with the present invention, one such download application provides for the creation of TSVs and the performance of various semantic operations as described herein. Processor 704 may execute the following code: code received by the processor, and/or code stored in storage device 710, or code stored in other non-volatile storage for later execution. In this manner, computer system 100 may obtain application code in the form of a carrier wave.

In the previous descriptions, specific numerical details are set forth, such as materials, structures, processes, etc., in order to provide a thorough understanding of the present invention. However, as will be recognized by one of ordinary skill in the art, the present invention may be practiced without these specific details. In another example, well known processing structures have been described in detail in order not to unnecessarily obscure the present invention.

Only exemplary embodiments of the present invention and but also examples of their versatility are shown and described in the present invention. It is to be understood that the invention is capable of use in other combinations and environments and is capable of changes or modifications within the scope of the inventive concept as expressed herein.

Claims

1. A machine-implemented method for controlling a data processing system for associating at least one dataset of a set of datasets to a subject dataset, wherein each dataset or the subject dataset comprises at least one keyword, the method comprising the machine-implemented steps of:

obtaining a semantic vector representing the subject data set and a respective semantic vector representing each individual data set in the group, wherein:

each semantic vector representing each individual dataset in the group comprises set information of the relationship between each of the at least one keyword of the individual dataset, each of the at least one keyword of the individual dataset possibly being associated to the catalog, and a predetermined catalog;

the semantic vector representing the subject data set includes set information having a relationship between each of the at least one keyword of the subject data set and a predetermined category, the category to which each of the at least one keyword of the subject data set may be associated, and

the semantic vector representing the subject data set or each individual data set in the group has a number of dimensions equal to the number of predetermined categories;

for each data set in the group, determining a first similarity between the subject data set and each data set in the group by comparing the semantic vector associated with the subject data set with the semantic vector associated with each data set in the group;

obtaining a keyword semantic representation of the subject dataset and a keyword semantic representation of each individual dataset in the group, wherein:

the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group comprises information indicating representative keywords of the subject data set or of the individual data sets of the group, and

the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group is structured in a different way than the semantic vector of the subject data set or the semantic vector of each individual data set of the group;

for each dataset in the group, determining a second similarity between the subject dataset and each dataset in the group by comparing the keyword semantic representation of the subject dataset to the keyword semantic representation of each dataset in the group; and

selecting at least one of the datasets in the group based on the first similarity between the subject dataset and each dataset in the group and the second similarity between the subject dataset and each dataset in the group; and

associating the at least one selected data set to the subject data set.

2. The method of claim 1, wherein at least one of the datasets in the group is an advertisement and the subject dataset is a document, a web page, an email, an RSS news feed, a data stream, broadcast data, or information related to a user; or a portion of one or more documents, web pages, emails, RSS news feeds, data streams, broadcast data, or information related to the user, or a combination thereof.

3. The method of claim 1, wherein the subject data set is part of a document, a web page, an email, an RSS news feed, a data stream, broadcast data, or information related to a user.

4. The method of claim 1 further comprising the steps of: transmitting the at least one selected data set or the file associated with the selected data set and the subject data set or the file associated with the subject data set to a user.

5. The method of claim 4, wherein the at least one selected data set is communicated to a user by displaying the at least one selected data set, playing a voice signal according to the at least one selected data set, or providing a link to the at least one selected data set.

6. The method of claim 1, wherein the at least one keyword comprises at least one of a word, a phrase, a string, a pre-assigned keyword, a sub data set, meta information, and information retrieved based on a link contained in the separate data set.

7. The method of claim 1, wherein the semantic vector for each data set is pre-computed and contained in the separate data set.

8. The method of claim 1, the semantic vector being dynamically generated.

9. The method of claim 1, wherein the semantic vector representing each individual dataset in the group is constructed based on at least one keyword of each individual dataset in the group and a known relationship between a known keyword and a predetermined directory to which the known keyword is likely to be associated, and the semantic vector representing a subject dataset is constructed based on at least one keyword of the subject dataset and the known relationship between a known keyword and a predetermined directory to which the known keyword is likely to be associated.

10. The method of claim 1, wherein the semantic vector associated with the separate data set is generated further based on information related to at least one user or at least one data set linked to the separate data set.

11. The method of claim 10, wherein the information associated with the at least one user includes at least one of previously viewed documents, previous search requests, user preferences, and personal information.

12. The method of claim 1, wherein the step of selecting at least one of the datasets in the group based on the first similarity between the subject dataset and each dataset in the group, the second similarity between the subject dataset and each dataset in the group, comprises designating one of the first similarity and the second similarity as a primary similarity and the other as a secondary similarity, obtaining information of a plurality of preset relevance levels for the primary similarity; for each data set in the group, mapping the primary similarity to one of the preset association levels according to the primary similarity; sorting the data sets in the group according to preset association levels to which the data sets in the group are respectively mapped; ranking, in each relevance level, the data sets in each relevance level according to the secondary similarity of the data sets; and selecting at least one of the data sets in the group according to the result of the sorting of the data sets in each association level.

13. The method of claim 1, wherein the step of selecting at least one of the datasets in accordance with a first similarity between the subject dataset and each dataset in the group and in accordance with a second similarity between the subject dataset and each dataset in the group comprises: designating one of the first similarity and the second similarity as a primary similarity and the other as a secondary similarity; ordering the data sets in the group according to the primary similarity; selecting at least one candidate data set from the sorted data sets according to a preset standard; ranking the at least one candidate data set according to the secondary similarity; selecting the at least one of the data sets in the group according to the result of the at least one candidate data set ordering.

14. The method of claim 1, wherein the step of selecting at least one of the datasets in accordance with a first similarity between the subject dataset and each dataset in the group and in accordance with a second similarity between the subject dataset and each dataset in the group comprises: for each data set in the group, calculating a composite similarity based on the respective first similarities of the data sets and the respective second similarities of the data sets according to a preset formula; selecting the at least one of the data sets in the group according to the respective composite similarity of the data sets based on preset criteria.

15. The method of claim 1, further comprising providing the at least one of the data sets to a user simultaneously with the subject data set.

16. The method of claim 1, further comprising providing the at least one of the data sets to a user after providing the subject data set to the user.

17. The method of claim 1, wherein the at least one of the data sets or the subject data set is provided to a user in audio form, visual form, video form, tactile form, or any combination thereof.

18. A data processing system for associating at least one data set of a set of data sets to a subject data set, wherein each data set or the subject data set contains at least one keyword, the system comprising:

a data processor configured to process data; and

a data storage system configured to store instructions, said instructions being executable by said data processor, said system controlling said data processor to perform the steps of:

each semantic vector representing each individual dataset in the group comprises aggregate information having a relationship between each of the at least one keyword in the individual dataset and a predetermined directory, the directory to which each of the at least one keyword of the individual dataset may be associated;

the semantic vector representing the subject data set comprises set information having a relationship between each of the at least one keyword of the subject data set and a predetermined category to which each of the at least one keyword of the subject data set may be associated, and

a semantic vector representing the subject dataset or each of the individual datasets in the group has a dimension equal to the number of predetermined categories;

the keyword semantic representation of the subject data set or the keyword semantic representation of each individual data set of the group comprises information indicating representative keywords of the subject data set or the individual data sets of the group, and

associating the at least one selected data set to the subject data set.

19. A machine-readable medium carrying instructions which are executed by a data processing system, the machine-readable medium controlling the data processing system to perform machine-implemented steps to associate at least one dataset from a set of datasets to a subject dataset, wherein each dataset or the subject dataset contains at least one keyword, the steps comprising:

each semantic vector representing each individual data set in the group comprises aggregate information having a relationship between each of the at least one keyword of the individual data set and a predetermined directory to which each of the at least one keyword of the individual data set may be associated;

a semantic vector representing the subject dataset or each individual dataset in the group has dimensions equal to the number of predetermined categories;

associating the at least one selected data set to the subject data set.

20. A machine-implemented method for controlling a data processing system for associating at least one dataset of a set of datasets to a subject dataset, wherein each dataset or the subject dataset comprises at least one keyword, the method comprising the machine-implemented steps of:

each semantic vector representing each individual data set in the group comprising aggregate information having a relationship between each of the at least one keyword of the individual data set and a predetermined directory to which each of the at least one keyword of the individual data set may be associated,

a semantic vector representing the subject data set or each individual data set in the group has a number of dimensions equal to the number of predetermined categories;

for each data set, generating a joint vector representation of the data set from the semantic vector associated to each data set and the keyword semantic representation of each data set;

for the subject data set, generating a joint vector representation of the subject data set from the semantic vector associated to the subject data set and the keyword semantic representation of the subject data set;

determining a similarity between the subject dataset and each dataset in the group by comparing the joint vector representation of the subject dataset to the joint vector representation of each dataset in the group; and

selecting at least one of the data sets in the group according to the determined similarity; and

associating at least one selected one of the data sets in the group to the subject data set.

21. A machine-readable medium carrying instructions for execution by a data processing system, the machine-readable medium controlling the data processing system. Performing machine-implemented steps to associate at least one dataset from a set of datasets to a subject dataset, wherein each dataset or the subject dataset contains at least one keyword, the steps comprising:

22. A machine-implemented method for controlling a data processing system for associating at least one dataset of a set of datasets to a subject dataset, wherein each dataset or the subject dataset comprises at least one keyword, the method comprising the machine-implemented steps of:

obtaining a tag key representation representing the subject data set and a respective tag key representation representing each individual data set in the group, wherein:

each of the label keys representing each individual data set in the group represents aggregate information including a relationship between each of the representative keys of each individual data set and a predetermined catalog to which each of the representative keys of each individual data set is associated;

the tag key representation representing the subject data set includes aggregate information having a relationship between each of the representative keywords of the subject data set and a predetermined catalog to which each of the representative keywords of the subject data set may be associated,

for each data set in the group, determining a level of similarity between the subject data set and each data set in the group by comparing the label key representation associated with the subject data set with the label key representation associated with each data set in the group;

selecting at least one of the data sets in the group according to the determined level of similarity between the subject data set and each data set in the group; and

23. A machine-readable medium carrying instructions for execution by the data processing system which control the data processing system to perform machine-implemented steps for associating at least one dataset from a set of datasets to a subject dataset, wherein each dataset or the subject dataset contains at least one keyword, the steps comprising:

each of the tag key representations representing each individual data set in the group includes aggregate information having a relationship between each of the representative keys of each individual data set and a predetermined catalog, each of the representative keys of each individual data set being associated with the catalog;

24. A machine-implemented method for controlling a data processing system to generate a tagged key representation of a data set containing at least one key, the method comprising:

identifying a representative keyword from the at least one keyword for representing the data set;

obtaining data identifying known relationships between each known keyword and a predetermined directory;

determining a relationship between each representative keyword and the predetermined directory by referring to the acquired data;

creating a label key representation of the data set according to the relationship between each representative key word and the predetermined directory; and

representing the data set using the created tag key representation.

25. A machine-readable medium carrying instructions for execution by a data processing system, the machine-readable medium controlling the data processing system to perform machine-implemented steps for associating at least one dataset from a set of datasets to a subject dataset, wherein each dataset or the subject dataset contains at least one keyword, the steps comprising:

representing the data set using the created tag key representation.