WO2008036960A2 - Identification de contenus et collaboration de type pair-à-pair - Google Patents
Identification de contenus et collaboration de type pair-à-pair Download PDFInfo
- Publication number
- WO2008036960A2 WO2008036960A2 PCT/US2007/079257 US2007079257W WO2008036960A2 WO 2008036960 A2 WO2008036960 A2 WO 2008036960A2 US 2007079257 W US2007079257 W US 2007079257W WO 2008036960 A2 WO2008036960 A2 WO 2008036960A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- content
- computer
- vector
- content objects
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- This invention relates to the field of online services and, in particular, to peer-to- peer collaboration and sharing interests.
- the system is a system for indexing content.
- An embodiment of the system includes a crawler, a crawl database, an indexer, a classification application, and an indexed data server.
- the crawler is configured to crawl the internet for content objects.
- the crawl database is coupled to the crawler and configured to cache the content objects.
- the indexer is coupled to the crawl database and configured to perform feature extraction on the content objects and cluster the content objects by generating an object vector.
- the classification application is coupled to the indexer and configured to cluster the object vectors and generate a summary vector.
- the indexed data server is coupled to the indexer and configured to communicate the content objects with a client.
- Other embodiments of the system are also described.
- the computer program product includes a computer useable storage medium to store a computer readable program that, when executed on a computer, causes the computer to perform one or more operations.
- the operations include an operation to crawl the internet for content objects, an operation to cache the content objects, an operation to perform feature extraction on the content objects and cluster the content objects by generating an object vector, an operation to cluster the object vectors and generate a summary vector, and an operation to communicate the content objects with a client.
- Other embodiments of the computer program product are also described.
- the method is a method for indexing content.
- An embodiment of the method includes crawling the internet for content objects, caching the content objects, performing feature extraction on the content objects and cluster the content objects by generating an object vector, clustering the object vectors and generate a summary vector, and communicating the content objects with a client.
- Other embodiments of the method are also described.
- Figure 1 illustrates one embodiment of an online system.
- Figure 2 illustrates one embodiment of a client of the online system of Figure 1.
- Figure 3 illustrates one embodiment of a web browser with a user interface for the online system of Figure 1.
- Figure 4 illustrates another embodiment of the web browser and user interface of
- Figure 5 illustrates another embodiment of the web browser and user interface of
- Figure 6 illustrates another embodiment of the web browser and user interface of
- Figure 7 illustrates another embodiment of the web browser and user interface of
- Figure 8 illustrates a schematic flow chart diagram of one embodiment of a content search algorithm.
- Figure 9 illustrates a schematic flow chart diagram of another embodiment of a content search algorithm for searching a static domain.
- FIG. 10 illustrates a schematic flow chart diagram of another embodiment of a content search algorithm for searching a dynamic domain using really simple syndication (RSS) feeds.
- RSS really simple syndication
- Figures 11-23 illustrate embodiments of hierarchical classification algorithms that may be implemented in the online system of Figure 1.
- Figure 24 illustrates another embodiment of the client of Figure 2. DETAILED DESCRIPTION
- FIG. 1 illustrates one embodiment of an online system 100.
- the depicted online system 100 uses the internet 112 to facilitate communications among the various components. However, in other embodiments, communications among the many components, or among some of the components, also may occur over one or more networks such as a local area network (LAN), a wide area network (WAN), a wireless network (WiFi), or other types of conventional networks. Alternatively, one or more components within the online system 100 may be coupled directly to another component of the online system 100.
- the illustrated online system 100 includes a crawler 114, a crawl database 116, and an indexer 118. In one embodiment, the crawler 114 searches the internet 112 for one or more types of content.
- the crawler 114 may search the internet 112 for static websites and for dynamic websites, which include dynamic content such as real simple syndication (RSS) feeds.
- the crawler 114 is implemented using a third party server (or server farm).
- Copies of the content may be stored or cached on the crawl database 116.
- the data objects e.g., websites, news items, etc.
- the crawl database 116 may contain a substantial amount of data (e.g., 100 terabytes).
- the indexer 118 is coupled to the crawl database 116, in one embodiment via the internet 112, to perform operations on the data stored in the crawl database 116.
- the indexer 118 may perform feature extraction on the data in the crawl database 116.
- a more detailed explanation of feature extraction is provided below in reference to the client.
- the indexer 118 may perform other operations to manipulate the data on the crawl database 116.
- the indexer 118 after feature extraction, may pre-process the data with various forms of scaling.
- the indexer 118 may apply the Term Frequency (TF), Inverse Document Frequency (IDF), or TFIDF (both combined) approaches, or the indexer 118 may eliminate redundant features, or the indexer 118 may eliminate features with less information content.
- TF Term Frequency
- IDF Inverse Document Frequency
- TFIDF both combined
- the indexer 118 may then cluster the data and encode it into static modules so that the crawled information can be distributed more efficiently to users. Some of the scaling and elimination functions may also be applied after clustering. These operations may be performed independently to each domain, such as news, blogs, websites, books, etc.
- the depicted online system 110 also includes an ad server 120 and an ad database 122.
- the ad server 120 and ad database 122 are representative of one or more ad servers and databases, which might be distributed anywhere on the internet 112.
- the ad server 120 pulls ads from the ad database 122 and sends them to be displayed on a web browser at a client 124 (either 124a or 124b).
- the ad server 120 and ad database 122 also may be used to facilitate improved advertising methods at the client 124 according to a user's advertisement profile, described in more detail below.
- the client 124 may run software which accesses the ad server 120 in real time. In this way, the client 124 may pull and display the ads to the user.
- the software may redirect the client browser through the ad server 120 so that, under predetermined business arrangements, the ad server 120 and another party may get credits or payment for the user's advertisement selection.
- the online system 100 also includes a web application server 129.
- the online system 100 also includes an indexed data server 126.
- the indexed data server 126 is coupled to an indexed database 128.
- the indexed database 128 stores object vectors and summary vectors associated with the content objects stored on the crawl database 116.
- the object vectors and summary vectors are described in more detail below, with reference to the client 124, it should be noted that the object vectors are vector representations of the content objects (e.g., websites, news items, etc.) on the crawl database 116, and the summary vectors are vector representations of a group of object vectors.
- the indexed data server 126 may access a hierarchy of object vectors and summary vectors (including higher level summary vectors to describe lower level summary vectors).
- the indexed data server 126 serves the static data (e.g., vectors) in the indexed database 128.
- the data on the indexed database 128 may be distributed around the internet 112 as static files and served by multiple indexed data servers 126.
- the online system 100 also includes one or more clients 124a and 124b
- Each client 124 represents a user computer or other user access device that is capable of running a web browser. Although many different web browsers may be used, some typical web browsers include INTERNET EXPLORER® by Microsoft and FIREFOX® by Mozilla. Exemplary clients 124 include personal computers, laptop computers, personal digital assistants (PDAs), cellular telephones, and other internet access devices.
- PDAs personal digital assistants
- Figure 2 illustrates one embodiment of a client 124 of the online system 100 of Figure 1.
- the client 124 runs one or more client applications which facilitate accessing web content that is correlated to a user's interests. Additionally, the web content may be classified according to the user's disinterests, e.g., topics or content in which the user does not explicitly or implicitly have an interest.
- the client application(s) may facilitate additional functionality.
- the description below describes separate applications for various functions, the same or similar functionality may be embodied in a single application which runs on the client 124.
- the client 124 includes a feature extraction application 132.
- the indexer 118 also may perform feature extraction.
- the feature extraction application 132 implements a method for modeling a content object by a vector of numbers.
- the method may include implementing one or more feature extraction algorithms 133. Once a content object is represented by a vector of numbers, classification algorithms can be applied to these vectors, as described below.
- This feature extraction is based on extracting a core set of features from a content object to adequately model that content object.
- the features of a piece of text can be a set of unique words in that text
- the vector that models the text can be the frequency of each of the words in that text.
- the vector may represent a superset of known words, with the frequency of each word stored in the vector at the locations corresponding to the words used in the text.
- the vector could include, for example, a million numbers corresponding to a million different known words, only the numbers of the vector which correspond to the words used in the text would be non-zero numbers; all other numbers would be zero.
- one simplified vector may look like [0 0 0 0 0 0 0.1 0 0 0 0 0 0.8 0 0 0 0 0.3 0 0 0 0], wherein the non-zero elements are associated with some content or feature of a content object.
- the vectors are relatively larger and may have millions of entries (many of which might be zero). [0031] Together with the identification of the relative content, an extract can be identified that can best describe this item to the user.
- the identification of the extract can be made based on multiple features including, but not limited to, the length of the extract, proximity to the title, etc. This extract can be used as an extended description of the item in the user interface.
- a photograph that represents the item can also be identified.
- the identification of the photograph that best describes the item can be made based on multiple features including, but not limited to, the size of the photograph, proximity to the title, location within the relative content region, and other features that determine if the photograph is or is not related to an advertisement.
- This photograph can also be used in the user interface to describe the item to the user.
- the feature abstraction application 132 focuses on modeling content objects with meta information and possibly hyperlinks within the content object.
- the feature abstraction application 132 uses the meta functions (e.g., titles, subtitles, tables, figure captions, etc.) within the content object and models each of these meta function items as separate features. Using meta functions in this manner may significantly enhance the information content of the model.
- the feature abstraction application 132 may use any hyperlinks within a content object in order to increase the information content of the model.
- the feature abstraction application 132 follows the hyperlinks and determines if the content indicated by each of the hyperlinks is fundamentally relevant to the object. If it is relevant, then the model may incorporate the content of the hyperlink into the original content object. This method may be applied to all the hyperlinks in an object. Moreover, this method may be applied to hyperlinks contained in the content associated with the original hyperlinks, and so forth.
- the feature extraction application 132 also may identify one or more parts of a content object that has the relevant content. There are multiple ways to do this. In one method, a graphical template approach is used to identify an area of a content object which has relevant information. This template subsequently may be automatically applied to other similar content objects (e.g., different pages of a website). [0035] In another method, an info gain metric may be associated with each feature of the content object. In one embodiment, identified features with negligible info gain for the content object may be eliminated or downgraded. For example, common words such as "a" and "the” may be disregarded when they appear frequently in a given content object. In some embodiments, a similar info gain metric may be applied to several content objects of similar type. This has the effect of enhancing the features with the most information for a class of objects.
- common features within similar content objects may be identified and eliminated or downgrade. This also may have the effect of enhancing those features with the most information for a class of objects.
- objects in a given class are compared to one another and a "structural difference" operation is performed on its content. In other words, structural commonalities in different content objects (e.g., common menu button locations in websites) may be identified and eliminated or downgraded. This also may have the effect of enhancing those features with the most information for a class of objects, assuming the most unique features have the most useful information.
- content of an object may be reconfigured to identify relevant information.
- formatting commands in a file may be represented by an appropriate number of spaces for each type of formatting command. Then this string of characters (i.e., letters, numbers, and spaces) may be processed to identify contiguous blocks of content.
- a contiguous block of content is delineated by a long enough string of spaces, based on the contiguous number of characters preceding it. The usefulness of contiguous blocks of content may be identified by their length after such reformatting.
- the client 124 includes a classification application 134.
- the classification application 134 implements a method for classifying large amounts of objects. The method may include implementing one or more classification algorithms 135.
- the method is based on clustering the objects and summarizing each of these clusters by a "summary vector.”
- Summary vectors may be similar to object vectors, but summary vectors may be denser than object vectors because they are weighted sums of object vectors.
- the summary vectors may be used in the classification to determine if a cluster has relevant vectors that should be used in the next level of classification.
- a hierarchical approach can be constructed that scales for multiple levels by implementing clusters of summary vectors.
- the negative and positive labeled elements of the training set can be modified at each level of the hierarchy to achieve better results. For example, certain negatively labeled elements may be dropped at higher levels of the hierarchy based on the granularity of that level. In another embodiment, certain positively and negatively labeled elements of the training set can be "combined" into one or more negatively and/or positively labeled training set elements based on a given level of the hierarchy.
- boundary nodes may be used to represent a collection of vectors by characterizing the boundary between a cluster and other clusters. In this approach, a cluster may be represented by a single summary vector.
- a series of vectors may be constructed so that each vector characterizes the relationship with one other cluster.
- all vectors that have a neighborhood relationship with another cluster can be used to characterize the boundary of a cluster.
- these neighborhood vectors can themselves be summarized into a preset number of vectors.
- summary vectors for relevant clusters are "blown up," or replaced by its elements.
- the summary vector is replaced by the elements associated with the summary vector.
- a classification algorithm is used to determine which clusters should be blown up.
- the result of the adaptive graph classification is the set of elements of the graph that are already at the leaves of the hierarchy.
- One example of the adaptive graph method is shown in more detail in Figures 11-19.
- a separate classification step is explicitly constructed for each level of the hierarchy.
- the first classification is performed using the training set and the summary vectors at the first level of the hierarchy. Then, based on the result of the classification, a subset of the first level clusters is blown up, and, together with the training set, a second level of classification is performed. This iteration is repeated until the leaf level of the hierarchy is reached.
- One example of the rapid fire method is shown in more detail in Figures 20-23.
- the negative and positive labeled elements of the training set can be modified at each level of the hierarchy to achieve better results. For example, certain negatively labeled elements may be dropped at higher levels of the hierarchy based on the granularity of that level. In another embodiment, certain positively and negatively labeled elements of the training set can be "combined" into one or more negatively and/or positively labeled training set elements based on a given level of the hierarchy.
- Clustering also may be used to facilitate classification of the vectors, as described above.
- the client 124 includes a clustering application 136.
- the clustering application 136 implements a method for clustering the vectors into high-level groups.
- a set of 1,000,000 vectors may be subdivided into ten clusters.
- the method may include implementing one or more clustering algorithms 137.
- cluster vectors two approaches include using k-means and graph-based clustering.
- the approach that is implemented to perform clustering may depend on the number of elements in the set to be clustered.
- k-means clustering may be used for a set having 1 ,000,000 vectors.
- graph-based clustering may be used for a set having, for example, 10,000 vectors or less.
- graph-based clustering may provide better results, but is typically limited in the number of objects it can deal with.
- other approaches may be used.
- the clustering application 136 may cluster 1,000,000 objects into ten groups.
- the groups may or may not be equal in size.
- each group is represented with one or more summary vectors, as described above.
- Each of these 10 groups is then divided into, for example, another 10 groups. This division process can continue in a similar fashion either for a pre-defined set of levels or until there are a sufficiently small number of elements at the leaf groups.
- hierarchical classification also may be implemented for RSS feeds or other dynamic content. News feeds and weblogs (blogs) are examples of dynamic content.
- RSS feeds and many website applications is that the websites are typically static, whereas the content of the RSS changes dynamically.
- the classification application 134 may implement a different method to classify dynamic content.
- each RSS feed in the domain may be represented statically by taking a snapshot of its contents at some point in time. Typical feature extraction may be performed on the static snapshot of the dynamic content.
- the hierarchical classification approach may be used to select a group of RSS feeds that the user might be interested in for a particular tag. Selected RSS feeds may be added to any RSS feeds that the user may have configured manually for the same tag.
- the current items in each RSS feed may be sampled and classified with positive and negative examples which the user has provided in order to pick the set of RSS items that the user is likely to be interested in.
- the item level classification may be repeated.
- the item level classification may be repeated on a regular basis.
- the hierarchical RSS feed classification may be repeated when the user provides new input in the form of positive or negative tagging.
- the classification application 134 also may implement a method for optimizing the parameters used to classify the vectors. First, a random training set is selected based on clusters. For a given training set, the optimum parameters are found for a level in the hierarchy based on achieving the maximum percentage of truly positive elements surviving in the hierarchy nodes selected as a result of the classification. This optimization step may be repeated for each level of the hierarchy.
- the optimum parameters are found based on achieving one or more of the following: a maximum percentage of truly positive elements within those that are classified as positive; a weighted sum of scores for the positive elements at the leaf level; or the number of truly positive elements within a pre-determined number of highest ranked elements. Alternatively, other criteria may be used.
- the optimum parameters are then applied to a series of tests, each of which uses a different random training set. In one embodiment, the test is measured by statistics on the accuracy of the top diverse elements that are selected.
- the client 124 also includes a content searching application 138.
- the content searching application 138 implements a method for searching for content that is similar to a user's interest profile 140.
- the content searching application 138 uses the classified and clustered vectors to determine which objects should be associated with the user's interests and which objects should not be associated with the user's interests. Additionally, the content searching application 138 may determine which objects might be associated with the user' s disinterests.
- the method may include implementing one or more content searching algorithms 139.
- the method may implement a Bayesian algorithm, a support vector machine (SVM) algorithm, or a spectral graph theory (SGT) algorithm. Each of these algorithms is described below, although the general details of these algorithms are known within the context of conventional applications. Alternatively, the method may implement another algorithm.
- the Bayesian algorithm is a conventional statistical approach. Basically, it considers the weighted average for a positive example (e.g., known interests) and a negative example set (e.g., known disinterests). Using this information, the content searching application 138 may determine whether a candidate object should be labeled as an interest or a disinterest (or simply not labeled as an interest) based on the candidate object's relative distance from the two weighted averages.
- the SVM algorithm is a conventional algorithm to determine a boundary between the positive set and the negative set. In order to identify the boundary, the SVM algorithm takes into account known positive and negative examples, and finds the "maximally separating" boundary between the two sets.
- the SGT algorithm is a conventional algorithm that is somewhat similar to the SVM algorithm.
- the SGT algorithm operates on a graph that represents all the objects in the domain, including positive and negative objects, as well as candidate objects. It reduces the boundary identification to a minimum cut problem on the graph.
- the client 124 may store at least a partial copy of the data from the indexed database 138 on a local cache 142.
- the cache 142 stores data that is likely to be related to the user's interests defined in the interest profile 140.
- the client 124 may primarily search the local cache 142, saving time and power by not having to communicate with the indexed data server 126 or other system components for every content search.
- the client cache 142 may be updated, for example, periodically or in response to an update to the user's interest profile 140.
- the interest profile 140 is a vector of numbers to indicate which objects a user has indicated are interests and which objects the user has indicated are disinterests.
- the user may "tag" or mark a content object (or the associated vector) as an interest or disinterest by marking the content via a user interface such as an internet browser.
- the user may tag one or more of the returned content objects as an interest by selecting an icon next to a representation (e.g., a hyperlink or a summary description) of the content object.
- the icons for the user to select interests and disinterests may be "thumbs up” and “thumbs down” icons, respectively, although other types of icons, graphics, text, or colors may be used instead or in addition to these exemplary icons.
- the client 124 also may store an advertisement profile 144.
- the advertisement profile 144 similar to the interest profile 142, may be a vector of numbers to indicate which advertisements the user does or does not like. In some embodiments, the advertisement profile 144 may depend at least in part on the interest profile 142. In some embodiments, the advertisement profile 144 or the interest profile 142, or both, may be used to select advertisements to be presented to the user. For example, advertisement keywords may be computed in real time, based on the "dynamic" interest profile 140, and sent to the ad server 120, which returns relevant ads (or a subset of ads) to be displayed to the user.
- Figure 3 illustrates one embodiment of a web browser 150 with a user interface for the online system 100 of Figure 1.
- a particular web browser 150 is shown in the drawing, other embodiments may be implemented in conjunction with other types of web browsers.
- the illustrated user interface implemented in the web browser 150 includes a toolbar 152, a sidebar 154, and a main window 156.
- the main window 156 displays content from the internet 112. This content may be retrieved from the internet 112 according to the interests and disinterests of the user (which may be displayed in the sidebar 154, for example). As described above, the interests and disinterests of the user may be defined in the interest profile 150 stored on the client 124.
- the main window 156 may display internet links 158 to several categories of internet content, including "News and Blogs,” “Interests,” “Books,” and “Group Posts” (shown in Figure 3). Advertisements 160 also may be displayed (for example, along the right edge of the main window 156). Additionally, the main window 156 may include excerpts from the linked websites, dates, times, pictures, and other similar content information.
- the main window 156 also includes icons 162 (or other selection mechanisms) to allow a user to indicate whether or not they are interested in the displayed link or content.
- This selection may be stored in the user's interest profile 140.
- the user may designate a link to a national tennis tournament as an interest, but designate a link to a table tennis website as a disinterest.
- the interest profile 140 is hierarchical in that it allows the user to designate content that the user considers interesting or not interesting as it relates to a particular theme. Using the previous example, the user may select a table tennis link as a disinterest as it relates to the theme, or interest, of tennis.
- the user also may designate the same table tennis link as an interest as it relates to another theme, or interest, such as ping pong.
- the same content may be designated as selectively belonging to one interest, or theme, and not to others for the same user.
- the sidebar displays a list of the user's designated interests. For each interest, the user may select a tab 164, and the contents of the sidebar 154 may be adjusted to show a summary of links 158 or other information related to the selected interest. In another embodiment, the sidebar 154 also may show designated disinterests. Additionally, the sidebar 154 may display an icon or use another indicator to indicate which interests are shared with other users. [0060] In one embodiment, the toolbar 152 may include several buttons 162 or other user interface devices to allow the to navigate the user interface, designate content as an interest or disinterest, share interests with other users, search for content related to a selected interest, and so forth.
- Figure 4 illustrates another embodiment of the web browser 150 and user interface of Figure 2.
- the user interface allows a user to see and navigate properties of each of the selected interests. For example, a user may see and modify which content objects (e.g., websites, links, RSS feeds, etc.) the user has designated as belonging to that interest theme. Also, the user may see and modify which content objects the user has specifically excluded as disinterests (i.e., negatively tagged) from the selected interest theme.
- the depicted user interface also may allow the user to modify sharing properties for the interest, view and modify group posts related to the interest, and so forth.
- Figure 5 illustrates another embodiment of the web browser 150 and user interface of Figure 2.
- Figure 5 shows an exemplary list of negatively tagged content objects.
- the user has selected these items as being disinterests as they relate to the selected interest theme.
- Figure 6 illustrates another embodiment of the web browser 150 and user interface of Figure 2.
- Figure 6 shows an exemplary list of potential users with whom the user may share an interest.
- the user may invite other users to share a selected interest, thereby allowing the invited users to so and potentially modify the user's selected interest profile.
- Figure 7 illustrates another embodiment of the web browser 150 and user interface of Figure 2.
- the user interface may allow the user to see the groups combined positively and negatively tagged content objects of the group. In this way, the user may see which other users have tagged a particular content object.
- the user interface may allow the user to perform other functions in regard to creating and managing the user's interest profile, as well as finding new content objects that might relate to the user's selected interests.
- the user may provide high level preferences as they relate to their interests which can then be used in conjunction with and to drive the classification results.
- the user may group his interests to higher level interest groups, which the application could use to organize content.
- Figure 8 illustrates a schematic flow chart diagram of one embodiment of a content classification algorithm 170.
- the content classification algorithm 170 may be implemented by the classification application 134, as described above.
- the user provides 172 positive or negative examples such as the designations from a user interest profile 140.
- the classification application 134 then runs 174 a classification algorithm for each domain, and then may display 176 the results.
- Figure 9 illustrates a schematic flow chart diagram of another embodiment of a content classification algorithm 180 for classifying a static domain.
- the classification application 134 gets 182 the first level of a static tree and adds a training set.
- An example of a training set is a set of positive and negative examples provided by the user.
- the classification application 134 then performs 184 diverse classification and selects a set of best elements so that, in one embodiment, the total number of "children" elements is less than some number, N.
- the classification application 134 then expands 186 the selected list and repeats the previous operations until the leaf nodes are reached. Then, the classification application 134 performs 188 diverse classification and selects the best diverse set which may be shown to the user.
- FIG 10 illustrates a schematic flow chart diagram of another embodiment of a content classification algorithm 190 for classifying a dynamic domain using really simple syndication (RSS) feeds.
- the classification application 134 gets 192 the first level of an RSS tree and adds the training set.
- the classification application 134 then performs 194 diverse classification and selects a set of best elements so that, in one embodiment, the total number of "children" elements is less than some number, N.
- the classification application 134 then expands 196 the selected list and repeats the previous operations until the leaf RSS nodes are reached.
- the classification application 134 then performs 198 diverse classification and selects the best diverse set of RSS feeds to be sampled.
- Figures 11-23 illustrate embodiments of hierarchical classification algorithms that may be implemented to classify content objects in the online system of Figure 1.
- Figures 13-19 illustrate one embodiment of the adaptive graph method 210
- Figures 20-23 illustrate one embodiment of the rapid fire method 220, both of which are described above.
- the indexed data server 126 has a representation of the results of hierarchical classification. At each node, the indexed data server 126 has a summary vector (SV).
- the indexed data server 126 maintains the closest URL for that summary vector. In some embodiments, this URL is not repeated further below in the tree.
- the indexed data server 126 also maintains the closest RSS for each node. At the leaf level, the indexed data server 126 maintains URLs and RSSs. In some embodiments, this representation in the indexed data server 126 changes periodically (e.g., monthly).
- the client 124 instantiates several classifiers (one per user/tag pair). For each classifier, the client 124 develops a relevant and resource-constrained mirror of the server- side tree. In some embodiments, a classifier is relevant if it contains a URL that the user is interested in. Similarly, a classifier may be relevant if it contains summary vectors that allow a good classification for classification of new RSS items and/or ads. In some embodiments, a classifier is resource-constrained if it is not possible to replicate the entire server side tree onto the client 124. [0072] In one embodiment, the adaptive graph method 210 begins by blowing up the root. For example, URLs and/or RSSs may be presented to the user for tagging.
- nodes may be blown up. For example, positively scored nodes may be blown up. Also, nodes with the highest score may be blown up. This process continues until constrains are violated or until the leaves of the URLs and RSSs are reached. In the meantime, there may be families of nodes that may be removed from the tree (see Figures 17 and 18) to improve the results. If maximum constraints are reached for the number of nodes in the tree, then families of nodes that pull down the results (e.g., negative averages) may be removed. In some embodiments, this is performed without affecting the current status of the other nodes. When no other positively scored nodes can be blown up, then the nodes with the most positive scores can be blown up. Other embodiments of the adaptive graph method 210 may include additional features.
- the root is blown up, similar to the description above. URLs and/or RSSs are also presented o the user for tagging. As the best nodes are blown up, some of the levels of the hierarchy may be discarded (see Figures 22 and 23). Some of the operations of the rapid fire method 220 may be substantially similar to the operations of the adaptive graph method 210. Other embodiments of the rapid fire method 220 may include additional features. [0074] ADDITIONAL EMBODIMENTS. It should be noted that many of the embodiments described herein may incorporate additional functionality such as the functional described below. Figure 24 illustrates another embodiment of the client 124 of Figure 2, including additional applications to facilitate implementation of some or all of the functions described below.
- the depicted client 124 also includes an accordion interface application 232, a user relevance application 234, a negative examples application 236, a smart scrolling application 238, an advertisement selection application 240, and a peer to peer collaboration application 242.
- Other embodiments of the client 124 may include fewer or more applications.
- DIVERSE SUGGESTIONS AND LEARNING One embodiment implements a method to increase classification accuracy as well as suggestion diversity using a very small set of learning examples.
- the learning examples are shown to the user to allow the user to identify which ones are interests and which was are disinterests.
- the domain is classified and clustered.
- the cluster information may be used in addition to score information, which results from clustering. In this way, the method may facilitate showing a diverse set of possible selections to the user, while limiting the number of similar possible selections.
- a single content object is shown from a group of similar content objects (e.g., news items) from different sources (e.g., news agencies), no matter how strongly relevant they might be to the user's selected interest. Instead, different possible selections of content objects (e.g., news items) that are relevant to the user's interests may be shown. Additionally, since the user's feedback is based on the suggestions, the algorithm receives diverse feedback. This may beneficially speed up the convergence of the classification algorithm.
- ADVERTISEMENT SELECTION One embodiment implements a method for displaying relevant advertisements while the user is surfing the web.
- the possible advertisement content is classified based on user interests.
- all the potential advertisements are downloaded and feature extraction is performed.
- the domains of advertisements may be classified together with other domains, and the top relevant advertisements are shown to the user.
- a domain of keywords may be used. For each keyword, a web search is performed, and feature extraction is performed on the results that are returned.
- the domain of keywords may be classified together with other domains.
- the top keywords are used and sent to the advertisement feed (e.g., advertisement server) to receive advertisement content relevant to those keywords.
- advertisement feed e.g., advertisement server
- an advertisement domain search is performed for each keyword, and feature extraction is performed on the results that are returned.
- the domain of keywords may be classified together with other domains.
- the top keywords are sent to the advertisement feed to receive advertisements relevant to those keywords.
- the advertising selection methods may allow the user to provide positive or negative feedback on the advertisement. This feedback is then used to select targeted advertisements that match the user's advertisement profile.
- similar functionality may be applied to conventional online auction content that is content based rather than keyword based.
- PEER TO PEER LEARNING AND CLASSIFICATION One embodiment implements a method of collaborating on web surfing and communicating results between different users.
- a user may share an interest among a set of peers chosen by the user.
- the positive and negative feedback supplied by any peer within this group is then applied to the classifiers of all other peers within the group as positive and negative examples, respectively.
- the classifications for each peer within the group may be performed independently. In this way, the users potentially may be shown different results resulting from the different classifications. Their feedback is collected and the process is repeated with the new feedback.
- the member can attach a note to the item which will then be transmitted to all the other group members.
- an application program interface to exchange data may be used.
- many internet chat programs have an application to application API which allows users to use its chat capabilities for exchanging data.
- SKYPE® may be used as the internet chat program.
- Skype is a chat and voice program available from Skype Technologies of London, United Kingdom.
- Skype has an application to application API which allows its chat capabilities to be used for exchanging data to implement the sharing functionality.
- Skype's user naming mechanism may be used to uniquely identify users across the internet. In this way, the user naming and the chat mechanism may allow one user (local user) to invite another user (target user) to share a tag, or interest. For example, in one embodiment, a target member may be polled to determine if the target member has installed application the peer to peer collaboration application. If not, the local user may invite the target user to install the application.
- the target user may be notified about the local user's invitation to share a tag, or interest.
- the target user may be notified about this tagging action.
- the local user may attach a note to this notification.
- this notification is performed reliably (i.e., even if a member user is not present, that user is eventually notified when they become available).
- One embodiment implements a method for content discovery which uses training examples for one type of content (or domain) to discover a completely different type of content (or domain).
- training examples are provided for one type of content such as internet websites.
- the classification algorithm described above, may be capable of finding relevant content from, for example, news articles using examples provided from the internet sites domain. This is possible at least in part because each domain has its own feature extraction. By using a unique feature extraction for each domain, fundamental pieces of information may be extracted from a content object and used to model an object from each domain. Therefore, feedback received for objects in a particular domain can be used to discover content in a completely different domain.
- This method may have business applications. For example, this method may be implemented in business models supported by advertising.
- feedback provided by the user in the news domain and the websites domain may be used to extract keywords relevant to the user. These keywords are then used to extract keyword based advertisement from ad servers.
- the extracted advertisements are relevant to the user's interest. Under conventional business models, when a user clicks on an advertisement, revenue is generated.
- the feedback may be used to classify a books domain, allowing potential book selections that are relevant to the user's interest to be shown to the user. Under conventional business models, when the user makes a purchase, revenue is generated.
- the feedback may be used to classify an auctions domain, allowing auction items that are relevant to the user's interest to be shown to the user. When the user makes a purchase from the auction site, revenue is generated.
- TEMPLATES One embodiment implements a template.
- a template is an interest tagged with a pre-defined set of positive and negative examples. Templates may be used in various ways. For example, templates may be prepared for typical interests such as international politics, sailing, football, and so forth.
- a library of templates may be created and distributed to a user as a service.
- partner organizations may request that specific templates be prepared for them for their use or for their clients.
- the preparation of templates may be provided as a service for organizations.
- users may be allowed to create templates or template groups and distribute these to other users.
- ACCORDION INTERFACE Another embodiment implements an "accordion" user interface mechanism.
- the accordion user interface mechanism may be used both in the sidebar 154 and the main page 156.
- Each accordion user interface mechanism may include the following properties: the contents can be anything; restrictions can be imposed on their behavior (for example, when one item is opened, all other items can be forced to close); they can be reordered by conventional drag and drop operations; and they remember their state so that when a page is revisited, the open/close state for each individual pane is preserved.
- USER RELEVANCE DURING BROWSING the contents can be anything; restrictions can be imposed on their behavior (for example, when one item is opened, all other items can be forced to close); they can be reordered by conventional drag and drop operations; and they remember their state so that when a page is revisited, the open/close state for each individual pane is preserved.
- One embodiment implements a method for inferring a user preference of a particular URL by observing the user's browsing habits.
- user sessions are identified in which the user is looking for a particular piece of information or a particular category of information. Such sessions may be delineated by the gaps in user activity with the browser.
- the user's preference may be inferred by the length of time a user spends at a particular page or how the user navigates away from the page. For example, a preference to designate a page as a user interest may increase with as the user's "dwell" time and interaction with a page increases.
- a preference to designate a page as a user interest may decrease if the user navigates away from the page by using a typical "back" page operation in the browser.
- CREATING NEGATIVE EXAMPLES WHEN THERE ARE NONE One embodiment implements a method for creating negative examples when the user has supplied no negative examples. For example, if the user has only positive examples, then the method may create negative examples to facilitate improved classification. In this method, a number of examples from clusters that are farthest away from the positive examples may be selected as the negative examples.
- a number of examples from the positive examples related to other interests of the user may be selected so that examples from overlapping interests can be determined and avoided by using a distance metric between examples for the interest in question and the other interests.
- negative examples which the user may have specified for any other interest may be used, with the exception of those examples which are similar to the positive examples for the selected interest, as determined by a distance metric.
- SMART SCROLLING OF LISTS One embodiment implements a method for scrolling lists in an infinite "tape loop" fashion. This method involves an infinite slider, in addition to tape player style play, stop, and fast forward buttons. The user can access any location in this potentially infinite list by moving the slider, and the list changes "on demand" based on the user requests. In addition, the user can instigate scrolling of this list by hitting the play button in the appropriate direction. Hitting the stop button stops the scrolling and hitting the fast forward button increases the scrolling speed. [0088] Embodiments of the present invention include various operations, which are described herein. These operations may be performed by hardware components, software, firmware, or a combination thereof.
- the term "coupled to” may mean coupled directly or indirectly through one or more intervening components. Any of the signals provided over various buses described herein may be time multiplexed with other signals and provided over one or more common buses. Additionally, the interconnection between circuit components or blocks may be shown as buses or as single signal lines. Each of the buses may alternatively be one or more single signal lines and each of the single signal lines may alternatively be buses. [0089] Certain embodiments may be implemented as a computer program product that may include instructions stored on a machine-readable medium. These instructions may be used to program a general-purpose or special-purpose processor to perform the described operations.
- a machine-readable medium includes any mechanism for storing or transmitting information in a form (e.g., software, processing application) readable by a machine (e.g., a computer).
- the machine-readable medium may include, but is not limited to, magnetic storage medium (e.g., floppy diskette); optical storage medium (e.g., CD-ROM); magneto- optical storage medium; read-only memory (ROM); random-access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; electrical, optical, acoustical, or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.); or another type of medium suitable for storing electronic instructions.
- magnetic storage medium e.g., floppy diskette
- optical storage medium e.g., CD-ROM
- magneto- optical storage medium e.g., magneto- optical storage medium
- ROM read-only memory
- RAM random-access memory
- some embodiments may be practiced in distributed computing environments where the machine-readable medium is stored on and/or executed by more than one computer system.
- the information transferred between computer systems may either be pulled or pushed across the communication medium connecting the computer systems.
- the digital processing device(s) described herein may include one or more general-purpose processing devices such as a microprocessor or central processing unit, a controller, or the like.
- the digital processing device may include one or more special-purpose processing devices such as a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or the like.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- the digital processing device may be a network processor having multiple processors including a core unit and multiple microengines.
- the digital processing device may include any combination of general-purpose processing device(s) and special-purpose processing device(s).
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne un système et un procédé d'indexation de contenus. Le système comporte un moteur de balayage, une base de données de balayage, un indexeur, une application de classification et un serveur de données indexées. Le moteur de balayage est configuré pour balayer le réseau Internet et rechercher des objets de contenus. La base de données de balayage est couplée au moteur de balayage et configurée pour placer les objets de contenus dans une mémoire cache. L'indexeur est couplé à la base de données de balayage et configuré pour effectuer une extraction de caractéristiques concernant les objets de contenus, puis regrouper les objets de contenus en générant un vecteur d'objet. L'application de classification est couplée à l'indexeur et configurée pour regrouper les vecteurs d'objets, puis générer un vecteur résumé. Le serveur de données indexées est couplé à l'indexeur et configuré pour communiquer les objets de contenus à un client.
Applications Claiming Priority (16)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US84678806P | 2006-09-22 | 2006-09-22 | |
| US60/846,788 | 2006-09-22 | ||
| US11/859,514 US20080077659A1 (en) | 2006-09-22 | 2007-09-21 | Content Discovery For Peer-To-Peer Collaboration |
| US11/859,467 | 2007-09-21 | ||
| US11/859,446 US20080077576A1 (en) | 2006-09-22 | 2007-09-21 | Peer-To-Peer Collaboration |
| US11/859,457 | 2007-09-21 | ||
| US11/859,504 | 2007-09-21 | ||
| US11/859,457 US20080077578A1 (en) | 2006-09-22 | 2007-09-21 | Feature Extraction For Peer-To-Peer Collaboration |
| US11/859,478 | 2007-09-21 | ||
| US11/859,478 US20080077580A1 (en) | 2006-09-22 | 2007-09-21 | Content Searching For Peer-To-Peer Collaboration |
| US11/859,514 | 2007-09-21 | ||
| US11/859,496 | 2007-09-21 | ||
| US11/859,496 US20080077494A1 (en) | 2006-09-22 | 2007-09-21 | Advertisement Selection For Peer-To-Peer Collaboration |
| US11/859,467 US20080077579A1 (en) | 2006-09-22 | 2007-09-21 | Classification For Peer-To-Peer Collaboration |
| US11/859,504 US20080077669A1 (en) | 2006-09-22 | 2007-09-21 | Peer-To-Peer Learning For Peer-To-Peer Collaboration |
| US11/859,446 | 2007-09-21 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2008036960A2 true WO2008036960A2 (fr) | 2008-03-27 |
| WO2008036960A3 WO2008036960A3 (fr) | 2008-11-13 |
Family
ID=39201352
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2007/079257 Ceased WO2008036960A2 (fr) | 2006-09-22 | 2007-09-22 | Identification de contenus et collaboration de type pair-à-pair |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2008036960A2 (fr) |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6906719B2 (en) * | 2002-10-12 | 2005-06-14 | International Business Machines Corporation | System and method for content-based querying using video compression format |
-
2007
- 2007-09-22 WO PCT/US2007/079257 patent/WO2008036960A2/fr not_active Ceased
Also Published As
| Publication number | Publication date |
|---|---|
| WO2008036960A3 (fr) | 2008-11-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20080077494A1 (en) | Advertisement Selection For Peer-To-Peer Collaboration | |
| CN100485677C (zh) | 搜索结果中放置内容排序的个性化 | |
| US8589373B2 (en) | System and method for improved searching on the internet or similar networks and especially improved MetaNews and/or improved automatically generated newspapers | |
| JP5941075B2 (ja) | 信頼ネットワークを含むユーザ判断を一体化したサーチシステム、方法及びコンピュータ読取可能媒体 | |
| US7647306B2 (en) | Using community annotations as anchortext | |
| US20090006388A1 (en) | Search result ranking | |
| US20060155764A1 (en) | Personal online information management system | |
| US20050222989A1 (en) | Results based personalization of advertisements in a search engine | |
| US20080077669A1 (en) | Peer-To-Peer Learning For Peer-To-Peer Collaboration | |
| US20080077580A1 (en) | Content Searching For Peer-To-Peer Collaboration | |
| Wang et al. | Exploring online social activities for adaptive search personalization | |
| US20030149580A1 (en) | Customized interaction with computer network resources | |
| JP2008176511A (ja) | コンピュータネットワークにおける情報処理方法および情報処理装置 | |
| CN106462588B (zh) | 来自所提取的内容的内容创建 | |
| US20080077576A1 (en) | Peer-To-Peer Collaboration | |
| US20080077578A1 (en) | Feature Extraction For Peer-To-Peer Collaboration | |
| WO2008130482A1 (fr) | Systèmes et procédés de personnalisation d'un journal | |
| Bogers | Recommender systems for social bookmarking | |
| US20080077659A1 (en) | Content Discovery For Peer-To-Peer Collaboration | |
| US20080077579A1 (en) | Classification For Peer-To-Peer Collaboration | |
| WO2008036960A2 (fr) | Identification de contenus et collaboration de type pair-à-pair | |
| Hsu et al. | Efficient and effective prediction of social tags to enhance web search | |
| AU2012202738B2 (en) | Results based personalization of advertisements in a search engine | |
| Athinarayanan et al. | Using pattern analysis and machine learning to categorise users of online directories based on their surfing habits | |
| Khanchana et al. | Web page prediction for web personalization: a review |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07843029 Country of ref document: EP Kind code of ref document: A2 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 07843029 Country of ref document: EP Kind code of ref document: A2 |