CN111966813B

CN111966813B - An information mining method and device and an information recommendation method and device

Info

Publication number: CN111966813B
Application number: CN201910417882.4A
Authority: CN
Inventors: 蒋卓人; 蓝金炯; 高正; 杨红霞
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2024-12-24
Anticipated expiration: 2039-05-20
Also published as: CN111966813A

Abstract

The present application discloses an information mining method and device and an information recommendation method and device. On the one hand, the present application simplifies large-scale heterogeneous graphs based on edge activation vectors, so that the scale of heterogeneous graphs is greatly compressed, overcoming the problems of low efficiency of running directly on heterogeneous graphs due to the large scale of heterogeneous graphs, the irrationality of manually specifying edge weights during the conversion of heterogeneous graphs to homogeneous graphs, and the introduction of invalid edges. On the other hand, since the user representation directly obtained from the heterogeneous graph integrates a variety of behavioral data information, more accurate information recommendation is guaranteed, thereby improving the user experience.

Description

Information mining method and device and information recommending method and device

Technical Field

The present application relates to, but not limited to, data mining technologies, and in particular, to a method and apparatus for mining information and a method and apparatus for recommending information.

Background

And the personalized recommendation algorithm is used for constructing a user portrait by mining behavior data of the user in the service scene, and recommending related contents to the user according to the user portrait.

The built user image can be comprehensive and accurate only when the user has rich behavior data in the corresponding scene, and the personalized recommendation algorithm effect can be greatly reduced when the behavior of the user in the scene is sparse. Taking an application scenario in which a user does not have a reading action through an APP, a general policy in this case is that the APP recommends hot content irrelevant to the user, where the recommended content is not necessarily interesting to the user, so that the user experience of the user in the scenario is greatly reduced.

In addition to behavior data occurring in a specific scene, there are various behavior data such as commodity behavior data and the like in a system of a large product. These behavioral data characterize the user's point of interest from different perspectives, and if the behavioral data from multiple sources within the product can be integrated to supplement the single behavioral data in a particular scenario, a deeper capture of the user's preferences can be produced.

When the method for integrally mining behavior data of various sources is used, the relationship of the behavior data is complex and various, so that the method is very suitable for representing the behavior data by constructing an abnormal pattern, and the related pattern embedding algorithm is used for mining on the constructed abnormal pattern. Under the condition, the user characterization obtained from the heterograms fuses various behavior data, and a good foundation is laid for the follow-up more accurate recommendation information to the user. Wherein the heterogeneous graph is one type of graph, and is a graph with more than one type of node or edge; correspondingly, the isomorphic graph is one type of graph, and particularly refers to a graph with only one type of node type and edge type. Graph embedding is a graph algorithm that represents nodes of a graph as a multidimensional vector representation.

In the related art, the method for performing graph embedding learning on the heterograms includes:

One is to first convert the heterogeneous graph into a isomorphic graph by assigning different weights to different types of edges on the isomorphic graph and normalizing the different types of edges between nodes into a single weight by means of weighted summation, then learn using isomorphic graph embedding algorithms such as deep walk, which learn the social features of vertices in the graph (the social representation of graph' S VERTICES), and improve the random walk, which is a shortened stream random walk method (a stream of short random walk). According to the method for carrying out graph embedding learning on the heterograms, weights corresponding to different edges are manually specified, so that the method is unreasonable, and some edges are invalid for recommended tasks of specific scenes, but are still reserved, so that noise is introduced, and the application effect is directly influenced.

The other is a traditional heterogeneous graph embedding algorithm, which either requires manual prior knowledge to formulate a path cluster for guiding random walk, or faces the challenge of running efficiency of walk in large-scale heterogeneous graphs.

Disclosure of Invention

The application provides an information mining method and device and an information recommending method and device, which can improve the efficiency of information processing and the user experience.

The embodiment of the invention provides an information mining method, which comprises the following steps:

acquiring an edge activation vector in the heterogram;

optimizing the obtained edge activation vector at least according to the association degree between two nodes in the heterogram to obtain the edge activation vector with task characteristics;

And learning the graph embedded representation with specific task characteristics based on the node sequence generated by the random walk.

In one illustrative example, the acquiring the edge activation vector in the iso-graph includes:

Initializing k binary heterogeneous edge activation vectors according to different types of edges of the heterogeneous graph, wherein the value of k is related to the complexity degree of the heterogeneous graph, and the more the heterogeneous graph is, the more the types of the edges are, the larger the k value is.

In one illustrative example, the degree of association between two nodes in the heterogram is obtained by a fitness function that evaluates the appropriateness of different edge activation vectors in the heterogram.

In an exemplary embodiment, the optimizing the obtained edge activation vector according to the fitness function includes:

Respectively calculating the fitness function values corresponding to p different edge activation vectors on a specific task by using the fitness function, wherein each edge activation vector comprises k numerical values;

Proportionally selecting p/2 edge activation vector pairs according to the fitness function value;

performing intra-pair feature cross processing on the selected p/2 edge activation vector pairs, and then performing gene mutation processing to generate p new edge activation vectors;

and returning to the step of calculating the fitness function value, continuing to execute until convergence, and selecting the edge activation vector with the optimal yield as the edge activation vector with the task characteristics.

In one illustrative example, a graph algorithm is used to obtain the degree of association between two nodes in the heterogeneous graph.

The present application also provides a computer-readable storage medium storing computer-executable instructions for performing the information mining method of any one of the above.

The application also provides a device for realizing information mining, which comprises a memory and a processor, wherein the memory stores instructions executable by the processor for executing the steps of the information mining method.

The application also provides an information recommendation method, which comprises the following steps:

Acquiring an edge activation vector in the heterogram, and optimizing the acquired edge activation vector at least according to the association degree between two nodes in the heterogram to obtain an edge activation vector with task characteristics;

Performing random walk on the abnormal graph after the edge activation vector optimization to generate a node sequence required by graph embedding algorithm learning;

And constructing a user portrait according to the learned graph embedded representation of the specific task characteristics, and recommending information to the user according to the constructed user portrait.

In one exemplary embodiment, the optimizing the obtained edge activation vector at least according to the fitness function includes:

Calculating the fitness function values corresponding to p different edge activation vectors on a specific task by using the fitness function, wherein each edge activation vector comprises k numerical values, the value of k is related to the complexity degree of the heterogeneous graph, the more complex the heterogeneous graph is, the more the types of edges are, and the larger the k value is;

Performing intra-pair feature cross processing on the selected p/2 edge activation vector pairs, and then performing gene mutation processing to generate k new edge activation vectors;

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the information recommendation method of any one of the above.

The application further provides a device for realizing information recommendation, which comprises a memory and a processor, wherein the memory stores instructions executable by the processor for executing the steps of the information recommendation method.

The application simplifies the large-scale heterogeneous graph based on the edge activation vector, greatly compresses the scale of the heterogeneous graph, and solves the problems of low operation efficiency caused by overlarge scale of the heterogeneous graph, irrational weight of the artificial designated edge in the process of converting the heterogeneous graph into the isomorphic graph, introduction of invalid edges and the like. That is, by adopting the information mining method, the operation time is obviously reduced, and meanwhile, better effects are obtained compared with the related technology method, so that the user experience is improved.

According to the information recommendation method, the user characterization directly obtained from the heterogeneous graph fuses various behavior data information, so that more accurate information recommendation is ensured, and user experience is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and do not limit the application.

FIG. 1 is a flow chart of an information mining method of the present application;

fig. 2 is a flowchart of an information recommendation method according to the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be arbitrarily combined with each other.

In one typical configuration of the application, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.

The steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

FIG. 1 is a flow chart of the information mining method of the present application, as shown in FIG. 1, comprising:

step 100, obtaining an edge activation vector in the iso-graph.

In one illustrative example, the present step may include:

K binary heterogeneous edge activation vectors are initialized according to different types of edges of the iso-graph. For example, if there are 20 different types of edges on the iso-graph, then by initializing, k 20-dimensional vectors can be constructed, wherein each dimension in the vector has a value other than 0, i.e. 1, and the value 0 or 1 is selected randomly at the beginning.

Wherein the value of k is a variable parameter, and is related to the complexity of the isomerism graph, and the more the isomerism graph is, the more the types of edges are, the larger the value of k is.

And 101, optimizing the obtained edge activation vector at least according to the association degree between two nodes in the heterogram to obtain the edge activation vector with task characteristics.

In one illustrative example, the degree of association between two nodes in the heterogram may be obtained by a fitness function that evaluates the appropriateness of different edge activation vectors in the heterogram.

In an illustrative example, the fitness function is also used to evaluate how appropriate different edge activation vectors are in the iso-graph, and may include as shown in equation (1):

In the formula (1), two nodes on an iso-graph represented by i and j, z represents a common neighbor of the node i and the node j in the iso-graph, n and m represent specific types of edges, C is a candidate edge activation vector, E is a certain type of edge on the iso-graph, ω (E) is a weight of the edge type E, E _C (v) represents a current valid edge set of the node v under the condition of a given edge activation vector C, and E ^C _i,j represents a valid edge set between the node i and the node j under the condition of the given edge activation vector C.

The fitness function shown in equation (1) well characterizes the effective information retention between nodes. The corresponding fitness function values of p different edge activation vectors are calculated on a specific task, p/2 edge activation vector pairs are proportionally selected according to the fitness function values, and the probability of being selected is higher when the fitness is higher. Wherein each edge activation vector includes k values.

In one illustrative example, optimizing the obtained edge activation vector according to the fitness function in this step includes:

respectively calculating fitness function values corresponding to p different edge activation vectors on specific tasks, wherein each edge activation vector comprises k numerical values;

And then, carrying out gene mutation treatment (technical term in genetic algorithm), namely reversing the numerical value of a certain position of the activation vector according to a certain probability to produce p new edge activation vectors.

And returning to the step of calculating the fitness function value, continuing to execute until convergence, such as the repetition number reaches a certain set value, and selecting the edge activation vector with the optimal yield (such as the edge activation vector with the maximum fitness function value) as the edge activation vector with the task characteristics.

In an exemplary embodiment, in the process of optimizing the obtained edge activation vector according to the fitness function, a specific task may be that a certain scene task is to recommend related articles to a user, and operation data (such as reading, praise and forwarding) of the articles by the user may be used as input of an algorithm under the task.

In an exemplary embodiment, the above process of optimizing the obtained edge activation vector according to the fitness function, proportionally selecting may include selecting the edge activation vector according to the fitness value calculated by the edge activation vector, for example, the higher the fitness, the greater the chance that the edge activation vector is selected, and so on.

In an exemplary embodiment, in the process of optimizing the obtained edge activation vector according to the fitness function, the certain probability may be a probability value set in advance by the user, and the certain position may be a position selected randomly. In an exemplary embodiment, the values of a certain position of the activation vectors are inverted with a certain probability, i.e. a position is randomly selected, and the values of the two activation vectors in this position are replaced with a fixed probability.

In one illustrative example, the present step may include:

the degree of association between two nodes in the iso-graph may be obtained using a graph algorithm in the related art, such as simrank.

In an illustrative example, in optimizing the obtained edge activation vector according to the graph algorithm in the related art in this step, the genetic algorithm is not used to learn the edge activation vector any more, and the edge activation vector is used as a part of training parameters in the graph algorithm to directly learn the edge activation vector by using the graph algorithm.

The method uses the edge activation vector to simplify and compress the large-scale heterogeneous graph, deletes some edges and nodes which are invalid with the subsequent task, and simplifies the scale of the graph to different degrees according to different activation vectors. Therefore, the subsequent simplified abnormal composition can be connected with the method for carrying out the graph embedding learning on the abnormal composition in the related technology, so that the operation efficiency is improved while the application effect is ensured.

Step 102, performing random walk on the abnormal graph after the edge activation vector optimization to generate a node sequence required by graph embedding algorithm learning, and learning graph embedding representation of specific task characteristics based on the node sequence generated by the random walk.

The specific implementation of this step may use the method provided in the related art, and the specific implementation is not used to limit the protection scope of the present application. It is emphasized here that the heterogeneous map at this time is a heterogeneous map simplified by the present application using the edge activation vector.

The application simplifies the large-scale heterogeneous graph based on the edge activation vector, greatly compresses the scale of the heterogeneous graph, and practical application proves that the total number of edges in the heterogeneous graph can be simplified by about 76 percent and the total number of nodes can be simplified by about 65 percent, and simultaneously, the application solves the problems of low operation efficiency on the heterogeneous graph directly caused by overlarge scale of the heterogeneous graph, irrational property of the artificial designated edge weight in the process of converting the heterogeneous graph into the homogeneous graph, introduction of invalid edges and the like. That is, by adopting the information mining method, the operation time is obviously reduced, and meanwhile, better effects are obtained compared with the related technology method, so that the user experience is improved.

The information mining method provided by the application can be used for realizing the subsequent sorting tasks based on the mined information, information recommendation and the like.

The application further provides a device for realizing information mining, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the computer program realizes the steps of any one of the information mining methods when being executed by the processor.

Fig. 2 is a flowchart of an information recommendation method according to the present application, as shown in fig. 2, including:

Step 200, obtaining an edge activation vector in the heterogram, and optimizing the obtained edge activation vector at least according to the association degree between two nodes in the heterogram to obtain the edge activation vector with task characteristics.

The specific implementation of this step is described in step 100 to step 101, and will not be described here again.

Step 201, performing random walk on the abnormal graph after the edge activation vector optimization to generate a node sequence required by graph embedding algorithm learning, and learning graph embedding representation of specific task characteristics based on the node sequence generated by the random walk.

And 202, constructing a user portrait according to the learned graph embedded representation of the specific task characteristics, and recommending information to the user according to the constructed user portrait.

The present application also provides a computer-readable storage medium storing computer-executable instructions for performing the information recommendation method of any one of the above.

The application further provides a device for realizing information recommendation, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the computer program realizes the steps of any information recommendation method when being executed by the processor.

Although the embodiments of the present application are described above, the embodiments are only used for facilitating understanding of the present application, and are not intended to limit the present application. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is to be determined by the appended claims.

Claims

1. An information mining method, comprising:

acquiring an edge activation vector in the heterogram;

optimizing the obtained edge activation vector at least according to the association degree between two nodes in the heterogram to obtain an edge activation vector with task characteristics;

the edge activation vector is used for indicating vectors with the same dimension as the number of the types of the edges of the heterogeneous graph;

the optimization of the obtained edge activation vector at least according to the association degree between two nodes in the heterogram comprises the steps of performing characteristic cross processing and gene mutation processing on the edge activation vector according to the association degree between two nodes in the heterogram;

the specific task characteristics are used for indicating that article information is recommended to a user, and operation data of the user on the article information is used for generating the graph embedded representation.

2. The information mining method of claim 1, wherein the acquiring the edge activation vector in the iso-graph comprises:

3. The information mining method of claim 1, wherein the degree of association between two nodes in the heterogeneous graph is obtained by a fitness function for evaluating the fitness of different edge activation vectors in the heterogeneous graph.

4. The information mining method of claim 3, wherein optimizing the obtained edge activation vector according to the fitness function comprises:

Performing the feature cross processing in pairs on the p/2 edge activation vector pairs selected, and then performing the gene mutation processing to generate p new edge activation vectors;

And returning to the step of calculating the fitness function value, continuing to execute until convergence, and selecting the edge activation vector with the optimal yield as the edge activation vector with the task characteristic.

5. The information mining method according to claim 2, wherein a graph algorithm is used to obtain the degree of association between two nodes in the heterogeneous graph.

6. A computer-readable storage medium storing computer-executable instructions for performing the information mining method of any one of claims 1-5.

7. An apparatus for implementing information mining comprising a memory and a processor, wherein the memory has stored therein instructions executable by the processor for performing the steps of the information mining method of any of claims 1-5.

8. An information recommendation method, comprising:

Acquiring an edge activation vector in an abnormal graph, and optimizing the acquired edge activation vector at least according to the association degree between two nodes in the abnormal graph to obtain an edge activation vector with task characteristics;

Constructing a user portrait according to the embedded representation of the learned specific task characteristics, and recommending information to a user according to the constructed user portrait;

9. The information recommendation method of claim 8, wherein the degree of association between two nodes in the heterogeneous graph is obtained by an fitness function for evaluating the fitness of different edge activation vectors in the heterogeneous graph.

10. The information recommendation method of claim 9, wherein optimizing the obtained edge activation vector according to at least the fitness function comprises:

performing the feature cross processing in pairs on the p/2 edge activation vector pairs selected, and then performing the gene mutation processing to generate k new edge activation vectors;

11. A computer-readable storage medium storing computer-executable instructions for performing the information recommendation method of any one of claims 8 to 10.

12. An apparatus for implementing information recommendation, comprising a memory and a processor, wherein the memory stores instructions executable by the processor for performing the steps of the information recommendation method of any of claims 8-10.