WO2017023359A1

WO2017023359A1 - Management of content storage and retrieval

Info

Publication number: WO2017023359A1
Application number: PCT/US2016/014248
Authority: WO
Inventors: Sitaram Asur; Freddy Chua; Srinivas Subbarao MULGUND
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2015-07-31
Filing date: 2016-01-21
Publication date: 2017-02-09

Abstract

According to an example, a plurality of content may be accessed, in which the plurality of content is of heterogeneous content types, and may be aggregated into a plurality of contextual topics. A query containing a keyword may be received and a contextual search may be performed on the plurality of contextual topics based upon the query to identify content relevant to the query. A query-based summary for each of the identified content may be created, in which the query-based summary is a merging of sentences ordered according to their perplexities with respect to the contextual topic to which the identified content containing the sentences are aggregated and sentences ordered according to counts of the query keyword contained in the sentences. In addition, the identified content and the query-based summary may be outputted.

Description

MANAGEMENT OF CONTENT STORAGE AND RETRIEVAL

BACKGROUND

[0001] Many businesses and corporations maintain repositories with large volumes of data that are generated by a large number of personnel. The data often includes heterogeneous types of documents, such as spreadsheets, emails, invoices, engineering drawings, or other document types. As the data is typically collected and stored over a number of years, the repositories often store vast amounts of diverse types of data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:

[0003] FIG. 1 is a simplified schematic diagram of a computing infrastructure, according to an example of the present disclosure;

[0004] FIG. 2 shows a flow diagram of a method for managing content storage and retrieval, according to an example of the present disclosure;

[0005] FIG. 3 shows a flow diagram of a method for aggregating a plurality of content into a plurality of contextual topics, according to an example of the present disclosure;

[0006] FIG. 4 shows a flow diagram of a method for integrating and extracting topics from content of heterogeneous sources, according to examples of the present disclosure;

[0007] FIGS. 5A and 5B show flow diagrams of a method for generating a topic model for integrating and extracting topics from content of heterogeneous sources, according to examples of the present disclosure;

[0008] FIG. 6 is a diagram of an example plate notation of parameters used to integrate and extract topics from content of heterogeneous sources, according to an example of the present disclosure;

[0009] FIG. 7 shows a flow diagram of a method for extracting summaries from the contextual topics, according to an example of the present disclosure;

[0010] FIG. 8 shows a flow diagram of a method for creating a query-based summary for each content identified during performance of a query, according to an example of the present disclosure; and [0011] FIG. 9 is schematic representation of a computing apparatus, which may be equivalent to the computing apparatus depicted in FIG. 1 , according to an example of the present disclosure.

DETAILED DESCRIPTION

[0012] For simplicity and illustrative purposes, the present disclosure is described by referring mainly to an example thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. As used herein, the terms "a" and "an" are intended to denote at least one of a particular element, the term "includes" means includes but not limited to, the term "including" means including but not limited to, and the term "based on" means based at least in part on.

[0013] Disclosed herein are apparatuses and methods for managing content storage and retrieval, in which the content may be of heterogeneous content types and may be organized for easy retrieval of relevant information. According to an example, the apparatuses and methods disclosed herein may aggregate a plurality of diverse content into a plurality of contextual topics and may generate summaries of each of the contextual topics based upon perplexities of the sentences in the content with respect to the contextual topics. The perplexity of a sentence with respect to a contextual topic may be defined as a measure of a likelihood that the sentence is or is not relevant to the contextual topic.

[0014] In addition, a contextual search for a query may be performed to identify content that is relevant to the query. A query-based summary for each of the identified content may be created, in which the query-based summary may be created through a merging of sentences ordered according to their perplexities with respect to the contextual topic to which the identified content containing the sentences are aggregated and sentences ordered according to counts of a query keyword contained in the sentences. Moreover, the identified content and the query-based summary may be outputted to a user. [0015] Through implementation of the apparatuses and methods disclosed herein, storage and retrieval of large amounts of diverse data may be managed such that the relevant data may be presented to users in a simple and efficient manner. Additionally, users may be presented with summaries of contextual topics corresponding to aggregations of the data as well as summaries of content identified to be relevant to queries. The summaries may respectively be short concise summaries of the contextual topics and short concise summaries of the identified content. By way of example, the summaries of the contextual topics may include sentences having the highest level of relevance to the contextual topics. Likewise, the summaries of the identified content may include sentences having the highest level of relevance to a query.

[0016] With reference first to FIG. 1, there is shown a block diagram of a computing infrastructure 100 within which is contained a computing apparatus that is to execute a method for managing content storage and retrieval according to an example of the present disclosure. It should be understood that the computing infrastructure 100 depicted in FIG. 1 may include additional components and that some of the components described herein may be removed and/or modified without departing from a scope of the computing infrastructure 100.

[0017] As shown, the computing infrastructure 100 may include a computing apparatus 110. The computing apparatus 100 may be a personal computer, a server computer, a smartphone, a tablet computer, or the like. The computing apparatus 110 is depicted as including a processor 112 and a machine-readable storage medium 120. The processor 112 may be any of a central processing unit (CPU), a semiconductor-based microprocessor, an application specific integrated circuit (ASIC), and/or other hardware device suitable for retrieval and execution of instructions stored in the machine-readable storage medium 120. The processor 112 may fetch, decode, and execute instructions, such as the instructions 122-134 stored on the machine-readable storage medium 120. [0018] That is, the processor 112 may execute the instructions 122-134 to access a plurality of content 122, aggregate the plurality of content into a plurality of contextual topics 124, extract respective summaries from the plurality of contextual topics 126, receive a query containing a keyword 128, perform a contextual search on the plurality of contextual topics based upon the query to identify content relevant to the query 130, create a query-based summary for each of the identified content 132, and output the identified content and the query-based summary 134. As an alternative or in addition to retrieving and executing instructions, the processor 112 may include one or more electronic circuits that include electronic components for performing the functionalities of the instructions 122-134. These processes are described in detail below.

[0019] The machine-readable storage medium 120 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. Thus, the machine-readable storage medium 120 may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage device, an optical disc, and the like. In some implementations, the machine-readable storage medium 120 may be a non-transitory machine-readable storage medium, where the term "non-transitory" does not encompass transitory propagating signals.

[0020] As described in detail below, the processor 112 may execute the instructions 122-134 to manage storage and retrieval of heterogeneous content such that results of queries performed on the heterogeneous content may be obtained in a relatively quick manner while also providing useful information pertaining to the retrieved content. In one regard, execution of the instructions 122-134 may enable content to be organized for relatively easy retrieval of information relevant to a query, for instance, in a manner that results in an increased efficiency with which the query may be performed on a large set of heterogeneous data. Additionally, the results of the query may be provided with summaries that enable users to quickly and efficiently distinguish between the results of the query to identify the most relevant content

[0021] As also shown in FIG. 1 , the computing apparatus 110 may also include a data store 114 on which the processor 112 may store information, such as keywords or terms of a received query, results of a performed query, etc. The computing apparatus 110 may further include an input/output interface 116 through which the processor 112 may communicate with external devices over a network 150, which may be a local area network, a wide area network, the Internet, etc. In one example, the processor 112 may communicate with a plurality of sources 140a-140n, in which "n" represents an integer value greater than one. Each of the sources 140a-140n may be a data repository or other data storage device and may store content 142a-142n, source metadata 144a-144n, and content metadata 146a-146n. By way of example, the sources 140a-140n may include blogs, social networks, news sites, online retailers, document repositories, email clients, discussion forums, etc. In addition, the content 142a-142n may include documents, emails, databases, discussion groups, online postings, sharepoint, tutorials, news articles, blueprints, invoices, bug reports, source code, etc.

[0022] In any regard, and as discussed in greater detail below, the processor 112 may execute the instruction 124 to aggregate the content 142a-142n into contextual topics to enable the content 142a-142n to be searched and retrieved in an efficient manner. In addition, the processor 112 may execute the instruction 126 to extract summaries from the contextual topics. The processor 112 may further execute the instruction 130 to perform a contextual search to identify content 142a-142n from the contextual topics relevant to a query and the instruction 132 to create query-based summaries of the identified content.

[0023] As further shown in FIG. 1 , the processor 112 may communicate with an input/output device 160 through the input/output interface 116. The input/output interface 116 may include hardware and/or software to enable the processor 112 to communicate over the network 150. The input/output interface 116 may enable a wired or wireless connection to the network 150. The input/output interface 116 may further include a network interface card and/or may also include hardware and/or software to enable the processor 112 to communicate with various input and/or output devices 160, such as a keyboard, a mouse, a display, another computing device, etc., through which a user may input instructions into the computing apparatus 110 and may view outputs from the computing apparatus 110. Thus, although the input/output device 160 has been depicted as being a single component in communication with the input/output interface 116 through the network 150, it should be understood that the input/output device 160 may be multiple devices connected directly to the input/output interface 116, for instance, as peripheral devices to the computing apparatus 110.

[0024] With reference now to FIGS. 2-8, there are respectively shown flow diagrams of methods 200-800, according to various examples. It should be understood that the methods 200-800 depicted in FIGS. 2-8 may include additional operations and that some of the operations described therein may be removed and/or modified without departing from the scopes of the methods 200-800. The descriptions of the methods 200-800 are made with reference to the computing apparatus 110 depicted in FIG. 1 for purposes of illustration and thus, it should be understood that the methods 200-800 may be implemented in computing apparatuses having architectures different from those shown in the computing apparatus 110 in FIG. 1.

[0025] Generally speaking, the processor 112 of the computing apparatus 110 may implement or execute the instructions 122-132 stored on the machine-readable storage medium 120 to perform some or all of the methods 200-800.

[0026] With reference first to the method 200, which is a method for managing content storage and retrieval, at block 202, the processor 112 may execute instructions 122 to access a plurality of content 142a-142n, in which the content 142a-142n is of heterogeneous content types. For instance, the content 142a-142n may include emails, word processing documents, spreadsheet documents, invoices, and the like. As shown in FIG. 1, the processor 112 may access the content 142a-142n stored in multiple sources 140a-140n via a network 150.

[0027] At block 204, the processor 112 may execute instructions 124 to aggregate the plurality of content 142a-142n into a plurality of contextual topics. For instance, the processor 112 may aggregate the content 142a-142n from heterogeneous sources into latent contextual topics, while preserving the properties of the sources. By way of example, the processor 112 may preserve the types of the content 142a-142n during the aggregation. Various manners in which the processor 112 may aggregate the plurality of content 142a-142n into the plurality of contextual topics are described in greater detail herein below.

[0028] At block 206, the processor 112 may execute instructions 126 to extract respective summaries from the plurality of contextual topics. The summaries may be short concise summaries from the contextual topics that ensure maximal coverage of the contextual topics. Various manners in which the processor 112 may extract the respective summaries of the plurality of contextual topics are described in greater detail herein below. According to an example, the extraction of the summaries of the plurality of contextual topics may be optional. For instance, the extraction of the summaries may not be needed for a query to be performed on the contextual topics, but may be performed to provide users with summaries of the contextual topics to enable the users to refine query searches. By way of particular example, a user may determine that a particular contextual topic is either relevant or irrelevant to their query based upon the summaries and may thus adjust a query to either include or exclude the particular contextual topic.

[0029] At block 208, the processor 112 may execute instructions 128 to receive a query containing a keyword. The processor 112 may receive the query containing the keyword from a user via the input/output device(s) 160. For instance, a user may input, through the input/output device(s) 160. a query for content relevant to a particular keyword or keywords stored in the sources 140a-140n. According to an example, the processor 112 may parse the query to identify the keyword or keywords contained in the query. In addition, the processor 112 may scan the full text of the query to determine the context of the query, for instance, based upon the identified keyword or keywords.

[0030] At block 210, the processor 112 may implement instructions 130 to perform a contextual search on the plurality of contextual topics based upon the received query to identify content relevant to the query. A contextual search may be defined as a form of optimizing search results based upon the context provided by the user in the query. A contextual search may differ from a normal keyword search in that the contextual search may return results based on the relevance of the results to the query and not just results that contain a keyword contained in the query.

[0031] According to an example, the processor 112 may use the plurality of contextual topics to identify a topic distribution of the query. That is, the processor 112 may use the plurality of contextual topics to identify which of the plurality of contextual topics are relevant to the query. In addition, the processor 112 may search through the other content 142a-142n to find the content that are most relevant to the query. This type of search may be known as having the K Nearest Neighbor (KNN) search problem. In addition, the cost of this type of search may increase linearly as the size of the database grows, i.e., O(N). According to an example, in order to handle large volumes of data, the processor 112 may employ the use of metric of ball trees for the KNN search problem so that the cost of performing the KNN search reduces to 0(logN However, the metric ball tree requires the use of metric distance comparison between documents. Although the Euclidean distance is commonly used as a metric distance measure, the Euclidean distance may not be used because the vectors representing the documents and query do not belong to the Real space, instead they are statistical distributions. A statistical divergence such as Kullback Leibler may instead be used.

[0032] Divergence does not satisfy the symmetric and triangle inequality while the Jensen Shannon Divergence does not satisfy the triangle inequality. In one regard, therefore, the processor 112 may use the divergence given by Endres and Schindelin, which was proven to be a metric distance. In one regard, the tree search optimization may reduce search time significantly.

[0033] At block 212, the processor 112 may implement the instructions 132 to create a query-based summary for each of the content identified at block 208, in which the query-based summary is a merging of sentences ordered according to their perplexities with respect to the contextual topic to which the identified content containing the sentences are aggregated and sentences ordered according to counts of the query keyword contained in the sentences. The processor 112 may create the query-based summary to include representative sentences extracted from the identified content that are relevant to both the topic of the identified content and the query. Various manners in which the processor 112 may create the query-based summary for each of the identified content are described in greater detail herein below.

[0034] At block 214, the processor 112 may implement the instructions 134 to output the identified content and the query-based summaries. For instance, the processor 112 may output the identified content and the query-based summaries to the input/output device(s) 160, e.g., a display, such that a user may view the identified content and the query-based summaries as a result to a submitted query.

[0035] With reference now to FIG.3, there is shown a flow diagram of a method 300 for aggregating a plurality of content 142a-142n into a plurality of contextual topics, according to an example. The method 300 may therefore the equivalent to block 204 in the method 200 depicted in FIG. 2. The plurality of content 142a-142n may therefore be equivalent to the content accessed at block 202 discussed above with respect to FIG. 2. Generally speaking, the processor 112 may analyze the content 142a-142n by applying a modeling technique, such as the Discriminative Dirichlet Allocation (DDA) modeling technique. The DDA modeling technique models the dependence of latent topics on the observed words for a collection of documents. Accordingly, the DDA modeling technique maximizes the probability of topics assigned to individual words given certain features.

[0036] Examples of features for the model may use include the probability of a topic given a document, the probability of a topic given a word occurrence in a source, the probability of a topic given the word occurrence in all sources, etc. In the scenario in which the features are topic-word, topic-document and topic-source relationships, DDA attempts to maximize the conditional probability as follows:

The plate notation for Equation (1 ) above is discussed below with respect to FIG.6 . In this example, is the global topic distribution of word /, φ- is the local topic

distribution of word / in source is the topic distribution of content j in source s,

is the topic distribution of the word, /, associated with the word in source s,

is the number of contents in source s, S is S is the number of sources, is the

topic of the x^th word in the j^th document in source s, V is the number of words in the vocabulary, where is the number of words in content D, χ is a

K-dimensional vector where

the parameter for the Dirichlet prior on the per-word topic distributions, a is a K-dimensional vector where the

parameter for the Dirichlet prior on the per-document topic distributions on the per-topic word distributions, and K is the number of topics.

[0037] At block 302, the processor 112 may execute the instructions'! 24 to identify a plurality of observed words in the accessed plurality of content 142a-142n. For example, the processor 112 may analyze the content 142a-142n, e.g., documents, from each of the sources 140a-140n to identify the unique words occurring within the content 142a-142n and the quantity of each of the words occurring within each content In some cases, particular words may be ignored, such as common words (e.g., the, a, an, etc.). All of the words identified in the content 142a-142n of all of the sources'! 40a- 140n may be designated as the global vocabulary. The words identified in the content 142a of a single source 142a may be designated as the source vocabulary.

[0038] At block 304, the processor 112 may implement the instructions'^ to preserve content metadata 146a-146n of the plurality of content 142a-142n and source metadata 144a-144n of a plurality of sources 140a-140n from which the plurality of content 142a-142n was accessed. That is, for instance, the processor 112 may collect and store the source metadata 144a-144n and the content metadata 146a-146n in the data store 114. According to an example, the processor 112 may use source metadata 144a-144n to create a source profile for each of the sources 140a-140n. A source profile may describe various characteristics of a source 140a-140n such as maximum document length, average volume of documents per time interval, general intention of the content in the documents (e.g., objective content, subjective content news content, opinion content, etc.), etc.

[0039] According to an example, the processor 112 may use the content metadata 146a-146n to create a content profile for each content 142a-142n in each of the sources 140a-140n. A content profile may describe various characteristics of a content, e.g., document, such as title, author-designated keywords (e.g., hashtags), author, comments, user voting results (e.g., up votes, down votes, etc.), etc.

[0040] At block 306, the processor 112 may execute the instructions 124 to use the content metadata 146a-146n to calculate a plurality of word topic probabilities for the plurality of observed words. The processor 112 may determine topic probabilities for a word occurring in all sources 140a-140n. Specifically, the processor 112 may determine the probability that a topic /c will be assigned to a content 142a that includes a word v, where the particular source 140a of the content 142a is considered. The probability distribution for the word may be generated based on the word in a universal context that applies to all the sources 140a-140n. In other words, the processor 112 may determine the probability of a topic k being assigned to this occurrence of word v while taking the global topic distribution of v into account by weighing the importance of this occurrence of word v in source s by the importance of the source s with respect to word where

is the weight

associated with source s in the global topic distribution of word is the probability

of topic k occurring for word v in source s, and is the global topic distribution of

word v.

[0041] At block 308, the processor 112 may execute the instructions124 to use the source metadata 144a-144n to calculate a plurality of source topic probabilities for the plurality of observed words. For instance, the processor 112 may determine topic probabilities for a word occurring in a source. Specifically, the processor 112 may determine the probability a topic will be assigned to a content 142a that includes a word and is from a given source 140a. The probability for the source 140a may be generated based on the word and a source profile of the given source 140a.

[0042] At block 310, the processor 112 may implement the instructions124 to use a modeling technique to determine a latent topic for one of plurality of contents 142a-142n based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities. The modeling technique may be the Discriminative Dirichlet Allocation (DDA) modeling technique. The processor 112 may apply the DDA modeling technique to the content 142a-142n and associated content metadata 146a-146n to determine the latent topic.

[0043] According to an example, the processor 112 may perform DDA modeling to maximize the probability of topics assigned to the content 142a-142n of the sources 140a-140n. Specifically, the processor 112 may generate a probability distribution that includes maximized probabilities for topics that account for document topic probabilities, source topic probabilities, and word topic probabilities. In other words, the maximized topic probabilities may account for the relevant document profile and source profile, which ensures that the source and content specific characteristics are preserved. For example, if a prolific source has a high average volume of documents per time interval, the probabilities calculated for the prolific source may account for the higher probability that documents from the prolific source are noise (i.e., the probabilities that a topic is assigned to a document from the prolific source are decreased because it is more likely that the document is noise). In another example, if a news source is designated as having news content, the probabilities calculated for the news source may account for the higher probability that documents from the prolific source are relevant (i.e., the probabilities that a topic is assigned to a document from the news source are increased because it is more likely that the document is related to the topic).

[0044] The processor 112 may adjust probabilities in the generated probability distribution. Specifically, processor 112 may adjust the probabilities based on a second criterion (the first criterion being maximizing the probabilities) that controls the number of words assigned to each topic. In other words, the second criterion balances the number of words between topics. For example, the maximized bicriterion may be expressed as the following:

In Equation

is the probability of the topic associated with the word in the document of source s. To control the number of words assigned to each topic and to adjust for biases (e.g., volume bias) introduced due to the variability of sources, for each word drawn from Multinomial^) and is weighed using

is chosen from the Dirichlet Distribution of f in the denominator, where τ is the K-dimensional vector where the parameter for the Dirichlet prior on the

number of words per topic distribution. [0045] In this example, the relative weights given to the local and global topic distribution of every word are determined by€ (0 < e < 1). Using collapsed Gibbs sampling for inference, the following can be shown:

In this example, is the topic of the n^th word in the m^th document in source s, k is

a topic, W is a number of words in a vocabulary, a is a K-dimensional vector where the parameter for the Dirichlet prior on the per-document topic distributions

on the per-topic word distributions, χ is a K-dimensional vector where

the parameter for the Dirichlet prior on the per-word topic distributions, τ is the K-dimensional vector where the parameter for the Dirichlet prior on the

number of words per topic distribution, n is the topic distribution of the collection of documents, a_k is a selection for topic k from the Dirichlet prior on the per-document topic distributions on the per-word distribution, and x_k is a selection for topic k from the Dirichlet prior on the per-word topic distributions.

[0046] In Equation (3), three potential scenarios may be applied to the Gibbs sampling. First, the probability of a topic given word In this

case for a single source, and are chosen as discussed

above, this scenario reduces to the LDA algorithm. Further, for

and for multiple sources, the result of this scenario would reduce to LDA applied on a collection created by pooling all documents of all sources. Second, the probability of topic given word

is the number of sources, and N is the number of words in document D in source s.

Third, the probability of a topic given word

wheretf

is the number of sources, is the number of

words in document D in source s, K is the number of topics, and n is the topic distribution of the collection of documents.

[0047] Turning now to FIG. 4 is a flow diagram of a method 400 for integrating and extracting topics from content of heterogeneous sources, according to an example. The method 400 may therefore the equivalent to block 204 in the method 200 depicted in FIG. 2. Although execution of method 400 is described below with reference to the computing device apparatus 110 depicted in FIG. 1 , other suitable devices for execution of method 400 may be used.

[0048] The method 400 may start at block 402 and proceed to block 405, where the processor 1 12 may select next source s in a set of sources. In other words, the processor 112 may iteratively process each of the various sources 140a-140n that the processor 112 is to analyze. At block 410, the processor 112 may select the next document d from a collection of documents of source s. In other words, the processor 112 may iteratively process each of the documents of the source s selected in block 405.

[0049] At block 415, a topic distribution Θ may be selected for document d from the Dirichlet distribution of a. a is a K-dimensional (i.e., number of topics) Dirichlet distribution describing the topic probabilities for the given document d. At block 420, a local topic distribution <p may be selected for word w in source s from the Dirichlet distribution of χ. x ^'\s a V-dimensional (i.e., number of words in global vocabulary), Dirichlet distribution describing the topic probabilities for the given word w. In block 325, a global topic distribution is selected for word from the Dirichlet distribution

of

[0050] At block 430, the processor 112 may select the next word ω/_; in the source vocabulary for current source s. In other words, the processor 112 may iteratively process each of the words in the current source s. At block 435, a topic for the word in the i^tft document of source s

may be selected based on a product multinomial weighted by ω and e.

[0051] At block 440, the processor 112 may determine if the probability of topics assigned to the selected word are maximized (i.e., convergence) for the specified features. In this example, the specified features considered are (1) probabilities of a topic given a document (0), (2) probability of a topic given a word occurrence in a source ( φ), and (3) probability of a topic given a word occurrence in all sources

. If the probability of topics is not maximized, method 400 may return to block 435 to select a new topic.

[0052] If the probability of topics is maximized, method 400 may proceed to block 443, where the processor 112 may determine if all the words in the current document have been processed. If there are more words to process, method 400 may return to block 430 to process the next word. If there are no more words to process, method 400 may proceed to block 445, where the processor 112 may determine if there are more documents to process in the selected source s. If there are more documents to process, method 400 may return to block 410 to process the next document. If there are no more documents to process, method 400 may proceed to block 450, where the processor 112 may determine if there are more sources to process. If there are more sources to process, method 400 may return to block 405 to process the next source. If there are no more sources to process, method 400 may proceed to block 455, where method 400 may stop. [0053] FIGS. 5A and 5B are flow diagrams of a method 500 for generating a topic model for integrating and extracting topics from content of heterogeneous sources, according to an example. The method 500 may therefore the equivalent to block 204 in the method 200 depicted in FIG. 2. Although execution of method 500 is described below with reference to computing apparatus 110 of FIG. 1, other suitable devices for execution of method 500 may be used. Method 500 may be implemented in the form of executable instructions stored on a machine-readable storage medium and/or in the form of electronic circuitry.

[0054] In FIG. 5A, method 500 may start at block 502 and proceed to block 504, where the processor 112 may retrieve a number (S) of sources. For example, metadata for various content sources may be pre-configured by a user, which may then be retrieved using the authorizations or credentials provided by the user. At block 506, a number of documents is retrieved for each of the sources. The documents

may be obtained from each of the sources using the pre-configured information provided by the user. The documents retrieved may also include metadata describing characteristics of the documents.

[0055] At block 508, the processor 112 may determine relevant source and document characteristics for calculating topic probabilities. For example, source profiles and document profiles may be generated based on the metadata obtained at blocks 504 and 506. At block 510, a global vocabulary may be defined as the collection of observed words from all documents of all sources, and a source vocabulary may be defined for each of the sources as the collection of observed words in each of the sources. The observed words may be extracted from the documents obtained at block 506 and then normalized.

[0056] At block 512, the processor 112 may construct a model for latent topics based on the dependence of the latent topics on the observed words for the collection of documents. For example, the probability of topics assigned to each of the observed words may be maximized as discussed above and then used to construct the model. Assuming that the source distribution y is generated from a Dirichlet distribution with parameter fi, the total probability of the model may be expressed as shown below in equation (4):

In this example, ø is the S-dimensional vector where is the word

distribution in source s,

f is the probability distribution on sources,

is the V-dimensional vector where the parameter for the Dirichlet prior on the

per-topic word distributions, p is the Dirichlet prior on source distribution, is the

word distribution of topic / in source s, is the source generating the topic

distribution in the , word in the document in source s is the x^th word in the

document in source s, and the remaining variables are as described above with respect to equations (1 Η³)· At block 514, the processor 1 12 may extract the latent topics from the model via Gibbs sampling. The Gibbs sampling may be performed as discussed below.

[0057] Applying collapsed Gibbs sampling to Equation (4), Equation (5) below may be shown as:

In Equation is the number of words that are generated using word

distributions of topics from source y, excluding the word in the n^th position of the m^th document in source S, and the remaining variables are as described above with respect to Equations (1)-(4).

[0058] In FIG. 5B, method 514 may start at block 520 and proceed to block 522, where the processor 112 may select the next source in the collection of sources. At block 524, the processor 112 may select the next document in the selected source. In other words, the documents in each of the sources may be iteratively processed.

[0059] At block 526, the processor 112 may sample topic proportions from the selected document of the selected source. At block 528, the processor 112 may sample word distributions from the selected document of the selected source. At block 530, the processor 112 may determine if all the documents have been selected. If all the documents have not been selected, method 514 may return to block 524 to select the next document. At block 532, the processor 112 may determine if all the sources have been selected. If all the sources have not been selected, method 514 may return to block 526 to select the next source.

[0060] At block 534, the processor 112 may sample topic distributions for each topic. At block 536, the processor 112 may determine if a bicriterion is maximized, where the bicriterion maximizes the topic probabilities assigned to words while balancing the size distributions of topics across the words (i.e., ensuring that certain topics are not over-assigned). For example, the size distributions may be balanced using dynamic thresholds that are used to monitor the number of topics assigned to each word. If the bicriterion is not maximized, the processor 112 may reset the selected sources at block 528 and then return to block 522 to repeat the sampling process. In this case, the sampling parameters may be adjusted based on the topic distribution. If the bicriterion is maximized, method 514 may proceed to block 540, where method 514 may stop.

[0061] The DDA modeling technique discussed above is naturally attuned to handling heterogeneous sources. After the weighing of sources is configured (e.g., whether the importance of the sources would be proportional to the volume of their content or not), the DDA modeling technique may be parameterized by E, which depends upon how much average divergence between the local and global topics is acceptable. It is observed that the average divergence is small and relatively stable

[0062] Over a wide range of heterogeneous sources including social media, news sources, blogs, document repositories, online retailer sites, emails, discussion forums, and radio transcripts, the DDA modeling technique provides superior performance in integrating heterogeneous web sources, which allows for efficient data-compression. Further, topic-topic correspondence between different topics may be achieved. DDA topics are separated more proficiently, which provides for a higher probability that each latent topic detected addresses a different semantic topic. The DDA modeling technique may be used in applications such as summarization of documents, identification of correlated documents in heterogeneous sources, topical categorization of documents, and social search.

[0063] The DDA modeling technique is a general model in which additional features may be easily and naturally added. In general, the bi-criterion that DDA maximizes is as shown below in Equation (6):

In this case, F is the set of features,

is the

feature, and Z is the topic assignment. The features considered in the examples above are the document based topic distribution and word based topic distribution; however, other features such as temporal topic distribution (i.e., probability of the observed topic assignment given time), or labels may be added.

[0064] FIG.6 is a diagram of an example plate notation 600 of parameters used to integrate and extract topics from content of heterogeneous sources, according to an example. The plate notation 600 includes a sources plate 602 and a global vocabulary plate 604. A plate is a grouping of parameters as described below. The sources plate 602 further includes a source vocabulary plate 610 and a documents plate 606, which further includes a words plate 608. Each plate groups parameters into a subgraph that repeats, where the number shown on the plate represents the number of repetitions of the subgraph in the plate. In this case, the parameters in the subgraph are indexed by the number (e.g. S, M, N, V), and any relationships between parameters that cross a plate boundary are repeated for each repetition. The plate notation 600 also includes two parameters that occur outside the bounds of all plates, weights (e) 612 and global Dirichlet prior parameter for the global per-word topic distributions.

[0065] The sources plate 602 includes a parameter 616 for the Dirichlet prior on per-document topic distributions and a Dirichlet prior parameter (X) 618 for the per-word topic distributions. The documents plate 606 includes a document topic probability parameter (6) 622. The words plate 608 Includes a topic parameter (Zw) 624. The source vocabulary plate 610 includes a word probability parameter 620. The global vocabulary plate 604 includes a source weight parameter 626 and a global word probability parameter 628. Dirichlet prior parameters may be Dirichlet distributions that are used as prior distributions in Bayesian inference. For example, a Dirichlet prior parameter may be a symmetric Dirichlet distribution, where all parameters are equal. In this example, no component is favored because there is no prior information.

[0066] With reference now to FIG. 7, there is shown a flow diagram of a method 700 for extracting summaries from the contextual topics, according to an example. The processor 112 may execute the instructions 126 to implement the method 700 following aggregation of the content into contextual topics at block 204 in FIG. 2. The method 700 may therefore be equivalent to block 206 in the method 200 depicted in FIG. 2.

[0067] At block 702, the processor 112 may compute, for each content 142a-142n, a respective perplexity score for each sentence contained in the content 142a-142n with respect to the contextual topics. For instance, the processor 112 may compute a perplexity score for each sentence contained in a first content 142a with respect to a first contextual topic determined at block 204 (FIG. 2). The processor 112 may also compute a perplexity score for each sentence contained in the first content 142a with respect to a second contextual topic determined at block 204 as well as with respect to additional contextual topics determined at block 204. In this regard, the processor 112 may compute a respective perplexity score for each sentence contained in each content 142a-142n with respect to each of the contextual topics determined at block 204. A perplexity score for a sentence with respect to a contextual topic may be a measure of a likelihood that the sentence is or is not relevant to the contextual topic.

[0068] The perplexity score of a sentence d may be given by the exponential of the log likelihood normalized by the number of words in a sentence.

where N_d is the number of words in sentence d. Because sentences with fewer words may tend to have a higher inferred probability and hence a lower perplexity score, N_d may be normalized to favor sentences with more words.

[0069] Using the contextual topics aggregated at block 204 as discussed above, learned from the set of relevant sentences D_e, a representative sentence from each content with respect to a contextual topic may be determined to summarize the contextual topic. To determine the most representative sentence d in a content e for a topic z, the perplexity score may be computed with respect to the topic z for sentence d∈ D_eand a sentence d with the lowest perplexity score with respect to the topic z may be chosen to use in a summarization of the topic z. For example:

[0070] In this regard, at block 704, the processor 112 may, for each content, determine a relevance of each sentence contained in the content to each of the contextual topics based on the perplexity score of the sentence with respect to a respective contextual topic. The processor 112 may determine the relevance of each of the sentences to each of the contextual topics based upon the respective perplexity scores of the sentences with respect to the contextual topics.

[0071] At block 706, the processor 112 may, for each content, extract the most relevant sentence for each contextual topic to which the content is considered to be relevant. Thus, for instance, the processor 112 may extract the sentence in a content that is considered to be relevant to a first contextual topic having the lowest perplexity score with respect to the first contextual topic and so forth.

[0072] At block 708, the processor 112 may, for each contextual topic, construct a summary of the contextual topic to include the extracted sentences having the lowest perplexity scores with respect to the contextual topic. The sentences used to construct the summary of a contextual topic may be the sentences having the lowest perplexity scores from multiple ones of the content 142a-142n. In addition, the number of sentences that are used to construct the summary may be selected based upon any of a number of criteria. For instance, the number of sentences may be user-defined, predefined to a predetermined number, etc. A respective summary for each of the contextual topics may thus be constructed through performance of the method 700.

[0073] Turning now to FIG.8, there is shown a flow diagram of a method 800 for creating a query-based summary for each content identified during performance of a query, according to an example. The processor 112 may execute the instructions 132 to implement the method 800 following performance of a contextual search to identify content relevant to a query at block 210 in FIG. 2. The method 800 may therefore be equivalent to block 212 in the method 200 depicted in FIG. 2.

[0074] At block 802, the processor 112 may calculate perplexity scores for each of the sentences contained in the identified content. Particularly, the processor 112 may calculate perplexity scores for each of the sentences in the identified content with respect to the query performed at block 210 (FIG. 2). In other words, the perplexity scores for each of the sentences may measure a likelihood that the sentence is either relevant to or not relevant to the query. Similarly to the discussion above with respect to Equation (7), the perplexity score of a sentence d may be given by the exponential of the log likelihood normalized by the number of words in a sentence.

[0075] Using the query received at block 208 as discussed above, learned from the set of relevant sentences D_e, for instance, the sentences contained in the content identified as being relevant to the query, may be determined. To determine the most representative sentence d in a content e to a query z, the perplexity score may be computed with respect to the query z for sentence d e D_e. For example:

[0076] At block 804, the processor 112 may extract first representative sentences of the identified content, in which the first representative sentences are sentences that are relevant to the query as indicated by the calculated perplexity scores. For instance, the processor 112 may extract the sentences having perplexity scores that are below a predetermined threshold value, which may be determined through testing, user-defined, or the like.

[0077] At block 806, the processor 112 may order the extracted first representative sentences according to their calculated perplexity scores. That is, the processor 112 may sort the extracted first representative sentences according to the perplexity scores calculated for each of the extracted first representative sentences. By way of example, the processor 112 may sort the extracted first representative sentences in ascending order such that the first representative sentences having the lowest perplexity scores are at the top of the order and the first representative sentences having the highest perplexity scores are at the bottom of the order.

[0078] At block 808, the processor 112 may extract second representative sentences of the identified content, in which the second representative sentences are sentences containing the query keyword. In other words, the processor 112 may identify which of the sentences in the identified content contain the query keyword and may extract those sentences. The processor 112 may also determine the number of times, e.g., the counts, the query keyword appears in each of the respective second representative sentences.

[0079] At block 810, the processor 112 may order the extracted second representative sentences according to counts of the query keyword in the second representative sentences. That is, the processor 112 may sort the extracted second representative sentences according to the counts of the query keyword contained in each of the extracted second representative sentences. By way of example, the processor 112 may sort the extracted second representative sentences in descending order such that the second representative sentences having the highest counts of the query keyword are at the top of the order and the second representative sentences having the lowest counts of the query keyword are at the bottom of the order.

[0080] At block 812, the processor 112 may merge the first representative sentences with the second representative sentences using weights to obtain a ranked list of summary sentences. That is, for instance, the processor 112 may merge the first representative sentences at or near the top of the ordered list of first representative sentences with the second representative sentences at or near the top of the ordered list of second representative sentences. For instance, the processor 112 may merge the top three first representative sentences with the top three second representative sentences. If there is an overlap in sentences, the processor 112 may simply include one of the sentences, i.e., prevent duplicates. In addition, the processor 112 may apply weights, which may be user-defined, to one of the first representative sentences and the second representative sentences such that one of the first representative sentences and the second representative sentences are higher in the ranked list of summary sentences.

[0081] At block 814, the processor 112 may create the query-based summary for each of the identified content to be a predetermined number of the summary sentences at the top of the ranked list of summary sentences. The predetermined number of the summary sentences may be user-defined, preselected based upon a desired criteria, based upon testing, or the like.

[0082] Some or all of the operations set forth in the methods 200-800 may be contained as utilities, programs, or subprograms, in any desired computer accessible medium. In addition, the methods 200-800 may be embodied by computer programs, which may exist in a variety of forms both active and inactive. For example, they may exist as machine readable instructions, including source code, object code, executable code or other formats. Any of the above may be embodied on a non-transitory computer readable storage medium.

[0083] Examples of non-transitory computer readable storage media include computer system RAM, ROM, EPROM, EEPROM, and magnetic or optical disks or tapes. It is therefore to be understood that any electronic device capable of executing the above-described functions may perform those functions enumerated above.

[0084] Turning now to FIG. 9, there is shown a schematic representation of a computing apparatus 900, which may be equivalent to the computing apparatus 110 depicted in FIG. 1, according to an example. The computing apparatus 900 may include a processor 902, a display 904; an interface 908, which may be equivalent to the input/output interface 116; and a computer-readable medium 910, which may be equivalent to the machine-readable medium 120. Each of these components may be operatively coupled to a bus 912. For example, the bus 912 may be an EISA, a PCI, a USB, a FireWire, a NuBus, or a PDS. [0085] The computer readable medium 910 may be any suitable medium that participates in providing instructions to the processor 902 for execution. For example, the computer readable medium 910 may be non-volatile media, such as an optical or a magnetic disk; volatile media, such as memory. The computer-readable medium 910 may also store media scanning control machine readable instructions 914, which, when executed may cause the processor 902 to perform some or all of the methods 200-800 depicted in FIGS. 2-8.

[0086] Although described specifically throughout the entirety of the instant disclosure, representative examples of the present disclosure have utility over a wide range of applications, and the above discussion is not intended and should not be construed to be limiting, but is offered as an illustrative discussion of aspects of the disclosure.

[0087] What has been described and illustrated herein is an example of the disclosure along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the disclosure, which is intended to be defined by the following claims - and their equivalents - in which all terms are meant in their broadest reasonable sense unless otherwise indicated.

Claims

What is claimed is:

1. A method for managing content storage and retrieval, said method comprising: accessing a plurality of content, wherein the plurality of content is of heterogeneous content types;

aggregating, by a processor, the plurality of content into a plurality of contextual topics;

receiving, by the processor, a query containing a keyword;

performing, by the processor, a contextual search on the plurality of contextual topics based upon the query to identify content relevant to the query;

creating, by the processor, a query-based summary for each of the identified content, wherein the query-based summary is a merging of sentences ordered according to their perplexities with respect to the contextual topic to which the identified content containing the sentences are aggregated and sentences ordered according to counts of the query keyword contained in the sentences; and

outputting, by the processor, the identified content and the query-based summary.

2. The method according to claim 1 , wherein aggregating the plurality of content into a plurality of contextual topics further comprises:

identifying a plurality of observed words in the accessed plurality of content; preserving content metadata of the plurality of content and source metadata of a plurality of sources from which the plurality of content was accessed;

using the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words;

using the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and

using a modeling technique to determine a latent topic for one of plurality of contents based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities.

3. The method according to claim 2, wherein using the modeling technique further comprises using the Discriminative Dirichlet Allocation (DDA) modeling technique to determine the latent topic.

4. The method according to claim 2, further comprising:

using a global vocabulary and a global Dirichlet prior parameter to determine a plurality of global word topic probabilities, wherein the latent topic is further based on the plurality of global word topic probabilities.

5. The method according to claim 1 , wherein creating the query-based summary further comprises creating the query-based summary to include representative sentences extracted from the identified content that are relevant to both the topic of the identified content and the query.

6. The method according to claim 1 , further comprising:

extracting respective summaries from the plurality of contextual topics by: for each content of the plurality of content,

computing a respective perplexity score for each sentence contained in the content with respect to the plurality of contextual topics;

determining a relevance of each sentence contained in the content to the each of the contextual topics;

extracting the most relevant sentence for each contextual topic to which the content is considered to be relevant; and

for each contextual topic of the plurality of contextual topics, constructing a summary of the contextual topic to include the extracted sentences having the lowest perplexity scores with respect to the contextual topic.

7. The method according to claim 6, wherein the perplexity score for a sentence with respect to a contextual topic is a measure of a likelihood that the sentence is or is not relevant to the contextual topic.

8. The method according to claim 1 , wherein creating the query-based summary further comprises:

calculating perplexity scores for each of the sentences contained in the identified content;

extracting first representative sentences of the identified content, wherein the first representative sentences are sentences that are relevant to the query as indicated by the calculated perplexity scores;

ordering the extracted first representative sentences according to the calculated perplexity scores;

extracting second representative sentences of the identified content wherein the second representative sentences are sentences containing the query keyword; ordering the extracted second representative sentences according to counts of the query keyword in the second representative sentences;

merging the first representative sentences with the second representative sentences using weights to obtain a ranked list of summary sentences; and

creating the query-based summary for each of the identified content to be a predetermined number of the summary sentences at the top of the ranked list of summary sentences.

9. A computing apparatus comprising:

a processor; and

a machine-readable medium on which is stored machine readable instructions that are to cause the processor to:

aggregate a plurality of content into a plurality of contextual topics; receive a query containing a keyword;

perform a contextual search on the plurality of contextual topics based upon the query to identify content relevant to the query;

create a query-based summary for each of the identified content, wherein the query-based summary is a merging of a first set of sentences ordered according to their perplexities with respect to the contextual topic to which the identified content containing the first set of sentences are aggregated and a second set of sentences ordered according to counts of the query keyword contained in the second set of sentences; and

output the identified content and the query-based summary.

10. The computing apparatus according to claim 9, wherein the machine readable instructions are further to cause the processor to:

extract respective summaries from the plurality of contextual topics by:

for each content of the plurality of content,

compute a respective perplexity score for each sentence contained in the content with respect to the plurality of contextual topics;

determine a relevance of each sentence contained in the content to the each of the contextual topics;

extract the most relevant sentence for each contextual topic to which the content is considered to be relevant; and

for each contextual topic of the plurality of contextual topics, construct a summary of the contextual topic to include the extracted sentences having the lowest perplexity scores with respect to the contextual topic.

11. The computing apparatus according to claim 10, wherein the perplexity score for a sentence with respect to a contextual topic is a measure of a likelihood that the sentence is or is not relevant to the contextual topic.

12. The computing apparatus according to claim 10, wherein, to create the query-based summary, the machine readable instructions are further to cause the processor to create the query-based summary to include representative sentences extracted from the identified content that are relevant to both the topic of the identified content and the query.

13. The computing apparatus according to claim 9, wherein, to create the query-based summary, the machine readable instructions are further to cause the processor to:

calculate perplexity scores for each of the sentences contained in the identified content;

extract first representative sentences of the identified content, wherein the first representative sentences are sentences that are relevant to the query as indicated by the calculated perplexity scores;

order the extracted first representative sentences according to the calculated perplexity scores;

extract second representative sentences of the identified content, wherein the second representative sentences are sentences containing the query keyword;

order the extracted second representative sentences according to counts of the query keyword in the second representative sentences;

merge the first representative sentences with the second representative sentences using weights to obtain a ranked list of summary sentences; and

create the query-based summary for each of the identified content to be a predetermined number of the summary sentences at the top of the ranked list of summary sentences.

14. A non-transitory computer readable storage medium on which is stored machine readable instructions that when executed by a processor, cause the processor to: aggregate a plurality of content into a plurality of contextual topics;

receive a query containing a keyword;

output the identified content and the query-based summary.

15. The non-transitory computer readable storage medium according to claim 14, wherein, to create the query-based summary, the machine readable instructions are further to cause the processor to: