CN111538846A

CN111538846A - Third-party library recommendation method based on hybrid collaborative filtering

Info

Publication number: CN111538846A
Application number: CN202010298379.4A
Authority: CN
Inventors: 李兵; 陈健; 王健; 赵玉琦; 姚力; 熊燚铭
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2020-08-14

Abstract

The invention discloses a third-party library recommendation method based on mixed collaborative filtering, which comprises the following steps: obtaining feature training data sets of the application and the third-party library according to the published application and third-party library data; training by using an unsupervised learning method to obtain a theme model; extracting entities from the application and third-party database data to construct a knowledge graph and vectorizing the knowledge graph; inputting application data to be recommended into a topic model to generate an application neighbor list; obtaining a content-based scoring list of the application to be recommended by utilizing the calling information of the application to the third-party library; inputting the application to be recommended and the third-party library list to be recommended into a knowledge graph to obtain an entity vector list; calculating the similarity of the entity vectors to obtain a knowledge graph-based rating list of the applications to be recommended; and after fusion, sorting to obtain a recommendation list based on mixed recommendation. The hybrid recommendation method provided by the invention avoids the defects of a single recommendation method, effectively solves the problems of data sparseness and cold start, and improves the recommendation accuracy.

Description

Third-party library recommendation method based on hybrid collaborative filtering

技术领域technical field

本发明涉及计算机技术，尤其涉及一种基于混合协同过滤的第三方库推荐方法。The invention relates to computer technology, in particular to a third-party library recommendation method based on hybrid collaborative filtering.

背景技术Background technique

随着互联网技术的飞速发展，软件及应用已经成为人们日常生活中不可或缺的一部分。第三方库在应用的开发过程中发挥了至关重要的作用。它们可以缩短开发时间，提高开发效率并提高开发质量。但是，随着第三方库的迅速增加，即使对于有经验的开发人员，选择合适的第三方库也是一项耗时且费力的工作。如何从大量复杂的第三方库中选择满足需求的库，成为应用程序开发中的一个难题。推荐系统能够有效地改善信息过载的问题，广泛应用于网络中的诸多场景。当前第三方库的推荐方法主要有基于内容的推荐方法和基于协同过滤的推荐方法，这些方法存在如下问题：With the rapid development of Internet technology, software and applications have become an indispensable part of people's daily life. Third-party libraries play a vital role in the development of applications. They can shorten development time, increase development efficiency, and improve development quality. However, with the rapid proliferation of third-party libraries, choosing the right third-party library is a time-consuming and laborious task even for experienced developers. How to choose a library that meets the needs from a large number of complex third-party libraries has become a difficult problem in application development. Recommendation systems can effectively improve the problem of information overload and are widely used in many scenarios in the network. The current recommendation methods of third-party libraries mainly include content-based recommendation methods and collaborative filtering-based recommendation methods. These methods have the following problems:

1、基于内容的推荐方法，是根据内容发现第三方库和应用之间的相似性，然后基于应用的内容来给应用推荐相似的第三方库。该方法能够有效地处理冷启动问题和数据稀疏问题，并且具有很好的可解释性，但是该方法无法推荐具有潜在调用关系的第三方库，而且第三方库的特征选择、抽取和匹配是难点。1. The content-based recommendation method is to discover the similarity between the third-party library and the application according to the content, and then recommend similar third-party libraries to the application based on the content of the application. This method can effectively deal with the cold start problem and data sparse problem, and has good interpretability, but this method cannot recommend third-party libraries with potential calling relationships, and the feature selection, extraction and matching of third-party libraries are difficult .

2、基于协同过滤的推荐方法，是先找到与待推荐应用相似的一些应用，然后把这些应用调用的而待推荐应用未调用的第三方库推荐给该应用。由于这种将应用聚类的方式需要利用应用的历史第三方库调用信息，所以存在数据稀疏和冷启动问题。2. The recommendation method based on collaborative filtering is to first find some applications similar to the application to be recommended, and then recommend to the application the third-party libraries called by these applications but not called by the application to be recommended. Since this method of clustering applications needs to use the historical third-party library call information of the applications, there are data sparse and cold-start problems.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题在于针对现有技术中的缺陷，提供一种基于混合协同过滤的第三方库推荐方法。The technical problem to be solved by the present invention is to provide a third-party library recommendation method based on hybrid collaborative filtering, aiming at the defects in the prior art.

本发明解决其技术问题所采用的技术方案是：一种基于混合协同过滤的第三方库推荐方法，包括以下步骤：The technical solution adopted by the present invention to solve the technical problem is: a third-party library recommendation method based on hybrid collaborative filtering, comprising the following steps:

1)从应用和第三方库服务器中获取已发布的应用和第三方库数据；所述应用和第三方库数据包括应用和第三方库的文本描述信息、应用对第三方库的调用信息以及应用和第三方库自身的结构化语义信息；1) Obtain the published application and third-party library data from the application and third-party library server; the application and third-party library data include text description information of the application and the third-party library, the calling information of the application to the third-party library, and the application and the structured semantic information of the third-party library itself;

2)将步骤1)中应用和第三方库的文本描述信息通过自然语言处理方法进行预处理并向量化，得到应用和第三方库的基于内容的特征训练数据集；2) Preprocess and quantify the text description information of the application and the third-party library in step 1) by the natural language processing method to obtain the content-based feature training data set of the application and the third-party library;

3)将步骤2)中应用和第三方库的基于内容的特征训练数据集作为语料库，使用无监督学习方法训练得到应用和第三方库的主题模型；3) using the content-based feature training data set of the application and the third-party library in step 2) as the corpus, and using the unsupervised learning method to train to obtain the topic model of the application and the third-party library;

4)将步骤1)中应用对第三方库的调用信息按每个应用对每个第三方库调用关系构建应用对第三方库的调用交互矩阵；4) construct the call interaction matrix of the application to the third-party library according to the call information of the application to the third-party library in step 1) according to the calling relationship of each application to the third-party library;

5)从步骤1)所得的应用和第三方库自身的结构化语义信息中抽取出实体及实体间的关系，保存到图数据库中形成知识图谱；5) extract entities and the relationship between entities from the application obtained in step 1) and the structured semantic information of the third-party library itself, and save it into a graph database to form a knowledge map;

6)利用知识图谱表示学习方法将步骤5)所得的知识图谱映射到一个低维空间得到每个实体和关系的向量化表示；6) Using the knowledge graph representation learning method to map the knowledge graph obtained in step 5) to a low-dimensional space to obtain a vectorized representation of each entity and relationship;

7)获取待推荐的应用的文本描述信息和其自身的结构化信息；7) Obtain the text description information of the application to be recommended and its own structured information;

8)将步骤7)得到的待推荐应用的文本描述信息输入到步骤3)得到的主题模型，通过文本描述信息的相似度比较，得到待推荐应用基于内容的近邻列表；8) Input the text description information of the application to be recommended obtained in step 7) into the topic model obtained in step 3), and obtain the content-based neighbor list of the application to be recommended by comparing the similarity of the text description information;

9)根据待推荐应用基于内容的近邻列表对于待推荐第三方库的调用交互信息，利用协同过滤方法按相似度进行加权求和取均值，得到待推荐应用对于待推荐第三方库基于内容的评分列表；9) According to the call interaction information of the application to be recommended based on the content-based neighbor list to the third-party library to be recommended, the collaborative filtering method is used to perform a weighted sum and average value according to the similarity, and obtain the content-based score of the application to be recommended for the third-party library to be recommended. list;

10)将步骤7)得到的待推荐应用的结构化信息和待推荐第三方库列表输入到知识图谱中，通过实体识别方法找到与其匹配的实体，即可得到对应的通过知识图谱表示学习方法生成的实体向量列表；10) Input the structured information of the application to be recommended and the list of third-party libraries to be recommended obtained in step 7) into the knowledge map, and find the matching entity through the entity recognition method, and then the corresponding knowledge map representation learning method can be obtained. list of entity vectors of ;

11)计算待推荐应用的实体向量化表示和待推荐第三方库的实体向量化表示之间的相似度，作为待推荐应用对于待推荐第三方库基于知识图谱的评分列表；11) Calculate the similarity between the entity vectorized representation of the application to be recommended and the entity vectorized representation of the third-party library to be recommended, as a rating list based on the knowledge graph of the application to be recommended for the third-party library to be recommended;

12)将步骤9)得到的待推荐应用对于待推荐第三方库基于内容和协同过滤的评分列表和步骤11)得到的待推荐应用对于待推荐第三方库基于知识图谱的评分列表进行融合，得到待推荐应用对于待推荐第三方库的基于混合推荐的评分列表；12) Integrate the content-based and collaborative filtering scoring list of the application to be recommended obtained in step 9) with the third-party library to be recommended based on the knowledge graph scoring list obtained in step 11) to obtain the third-party library to be recommended. A list of mixed recommendation-based ratings of the application to be recommended for the third-party library to be recommended;

13)将待推荐应用对于所有待推荐第三方库的评分按降序排列，获得待推荐应用基于混合推荐的Top-N第三方库推荐列表。13) Arrange the scores of the applications to be recommended with respect to all the third-party libraries to be recommended in descending order, and obtain a Top-N third-party library recommendation list of the applications to be recommended based on mixed recommendations.

按上述方案，所述步骤2)中通过自然语言处理方法进行预处理包括分词、去除停用词、标点符号和低频词汇以及词干化。According to the above solution, the preprocessing performed by the natural language processing method in step 2) includes word segmentation, removal of stop words, punctuation marks and low-frequency words, and stemming.

按上述方案，所述步骤2)中向量化采用包括TF-IDF、Doc2Bow、LDA在内的文本向量化方法。According to the above scheme, the vectorization in step 2) adopts text vectorization methods including TF-IDF, Doc2Bow, and LDA.

按上述方案，所述步骤4)中构建应用对第三方库的调用交互矩阵，具体如下：According to the above scheme, in the described step 4), the call interaction matrix of the application to the third-party library is constructed, and the details are as follows:

将步骤1)得到的应用对第三方库的调用信息按应用ID进行分类，生成每个应用的偏好向量，具体的偏好关系定义为：对于应用a和第三方库l，如果a调用了l，则y_al＝1，否则y_al＝1，M个应用对于N个第三方库的偏好向量构成了应用对第三方库的调用关系交互矩阵

Classify the application information obtained in step 1) to the third-party library by application ID, and generate a preference vector for each application. The specific preference relationship is defined as: for application a and third-party library l, if a calls l, Then y _al = 1, otherwise y _al = 1, the preference vectors of M applications for N third-party libraries constitute the interaction matrix of the calling relationship between applications and third-party libraries

按上述方案，所述步骤8)中，获得待推荐应用基于内容的近邻列表，具体如下：According to the above scheme, in the step 8), the content-based neighbor list of the application to be recommended is obtained, as follows:

8.1)将待推荐应用的文本描述信息经过自然语言方法预处理后，使用文本向量化方法转换为向量；8.1) After the text description information of the application to be recommended is preprocessed by the natural language method, the text vectorization method is used to convert it into a vector;

8.2)将获得的向量输入到步骤3)得到的主题模型中，通过相似度比较，取前k个相似度最高的应用，作为待推荐应用基于内容的邻居列表N(a)＝*a₁,a₂,…,a_k+，其中，相似度sim(a,a_i)采用以下方法计算：余弦相似度法、欧氏距离法、皮尔逊相关系数法。8.2) Input the obtained vector into the topic model obtained in step 3), and through the similarity comparison, take the top k applications with the highest similarity as the content-based neighbor list of the application to be recommended N(a)=*a ₁ , a ₂ ,..., _ak +, wherein, the similarity sim(a, a _i ) is calculated by the following methods: cosine similarity method, Euclidean distance method, and Pearson correlation coefficient method.

按上述方案，所述步骤9)中利用协同过滤方法按相似度进行加权求和取均值，得到待推荐应用对于待推荐第三方库基于内容的评分列表，具体如下：According to the above scheme, in the step 9), the collaborative filtering method is used to perform weighted summation and take the average value according to the similarity, to obtain the content-based score list of the application to be recommended for the third-party library to be recommended, as follows:

根据待推荐应用基于内容的邻居列表N(a)对于待推荐第三方库l的调用信息，利用协同过滤方法按公式：

计算得到待推荐应用a对于待推荐第三方库l的基于内容和协同过滤的评分S₁(a,l)，对待推荐第三方库列表中的所有第三方库按该上述公式计算评分，得到待推荐应用基于内容和协同过滤的第三方库评分列表。According to the call information of the content-based neighbor list N(a) of the application to be recommended for the third-party library 1 to be recommended, the collaborative filtering method is used according to the formula:

Calculate the content and collaborative filtering-based score S ₁ (a, 1) of the application a to be recommended for the third-party library 1 to be recommended, and calculate the scores for all third-party libraries in the list of recommended third-party libraries according to the above formula, and obtain the score to be recommended. Recommended application content-based and collaborative filtering third-party library scoring lists.

按上述方案，所述步骤11)中计算待推荐应用的实体向量化表示和待推荐第三方库的实体向量化表示之间的相似度，作为待推荐应用对于待推荐第三方库基于知识图谱的评分S₂(a,l)，其中，相似度采用以下方式计算：余弦相似度法、欧氏距离法、皮尔逊相关系数法。According to the above solution, in the step 11), the similarity between the entity vectorized representation of the application to be recommended and the entity vectorized representation of the third-party library to be recommended is calculated as the knowledge graph based on the knowledge graph of the application to be recommended for the third-party library to be recommended. Score S ₂ (a, l), wherein the similarity is calculated by the following methods: cosine similarity method, Euclidean distance method, and Pearson correlation coefficient method.

按上述方案，所述步骤12)中进行融合采用的方式加权融合或特征融合。According to the above scheme, weighted fusion or feature fusion is adopted in the fusion method in step 12).

本发明产生的有益效果是：The beneficial effects that the present invention produces are:

1.本发明设计了一种第三方库混合推荐方法，来规避单一推荐方法存在的缺陷，在有效解决数据稀疏和冷启动问题的同时，能提升推荐的准确率等指标；1. The present invention designs a third-party library hybrid recommendation method to avoid the defects of a single recommendation method, while effectively solving the problems of data sparse and cold start, and improving the accuracy of the recommendation and other indicators;

2.利用混合推荐算法，充分利用了应用和第三方库的不同维度的信息，能够挖掘出具有潜在调用关系的第三方库，避免推荐结果被局限在一个狭小的范围里；2. Using the hybrid recommendation algorithm, the information of different dimensions of the application and the third-party library is fully utilized, and the third-party library with potential calling relationship can be mined, so as to avoid the recommendation results being limited to a narrow range;

3.引入知识图谱，对于推荐结果，可以利用知识图谱的拓扑结构关系进行解释。3. The knowledge graph is introduced. For the recommendation results, the topological structure relationship of the knowledge graph can be used to explain.

附图说明Description of drawings

下面将结合附图及实施例对本发明作进一步说明，附图中：The present invention will be further described below in conjunction with the accompanying drawings and embodiments, in which:

图1是本发明实施例的方法流程图；Fig. 1 is the method flow chart of the embodiment of the present invention;

图2是本发明实施例的实验结果比较示意图。FIG. 2 is a schematic diagram of the comparison of experimental results in an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

目前的一些应用推荐方法存在诸如数据稀疏、冷启动等问题，并且对于非直接关联的第三方库，也无法得到推荐，这在一定程度上造成了长尾效应。本发明综合考虑现有的基于内容和基于协同过滤的推荐方法，设计了一种以基于知识图谱的方法为主、基于内容和协同过滤的方法为辅，三种推荐方法相结合的第三方库推荐方法。Some current application recommendation methods have problems such as data sparseness and cold start, and cannot be recommended for third-party libraries that are not directly related, which causes a long tail effect to a certain extent. The present invention comprehensively considers the existing content-based and collaborative filtering-based recommendation methods, and designs a third-party library that is based on the knowledge map-based method, supplemented by the content-based and collaborative filtering method, and combines the three recommendation methods. recommended method.

如图1所示，本发明提供基于混合协同过滤的第三方库推荐方法包括以下步骤：As shown in Figure 1, the present invention provides a third-party library recommendation method based on hybrid collaborative filtering, comprising the following steps:

步骤1：从应用和第三方库服务器上获取已发布的应用和第三方库的数据，应用和第三方库的数据分为三个部分，分别是：应用和第三方库的非结构化文本描述信息，用于描述应用和第三方库具体的功能和应用场景以及使用说明等介绍；应用对第三方库的调用信息，记录了已发布的应用调用的第三方库列表；应用和第三方库自身的结构化语义信息，如应用的类型、开发者、价格、安装包体积，第三方库的类型、标签、版本等信息；Step 1: Obtain published application and third-party library data from the application and third-party library servers. The application and third-party library data are divided into three parts: unstructured text description of the application and third-party library Information, which is used to describe the specific functions and application scenarios of the application and the third-party library, as well as introductions such as usage instructions; the application's call information to the third-party library, which records the list of third-party libraries called by the published application; the application and the third-party library itself structured semantic information, such as application type, developer, price, installation package volume, type, label, version and other information of third-party libraries;

步骤2：将步骤1得到的应用和第三方库的非结构化文本描述信息，用自然语言方法进行数据清洗预处理，处理流程包括分词，去除停用词、标点符号和低频词汇，词干化等。利用TF-IDF、Doc2Bow、LDA等文本向量化方法对预处理之后的数据进行向量化处理，得到“应用和第三方库的特征训练数据集”；Step 2: The unstructured text description information of the application and third-party library obtained in Step 1 is used for data cleaning and preprocessing with natural language methods. The processing flow includes word segmentation, removal of stop words, punctuation marks and low-frequency words, and stemming. Wait. Use TF-IDF, Doc2Bow, LDA and other text vectorization methods to vectorize the preprocessed data to obtain a "feature training data set for applications and third-party libraries";

步骤3：使用无监督学习方法对“应用和第三方库的特征训练数据集”进行训练，得到应用和第三方库的主题模型，用于后续的文档相似度比较；Step 3: Use the unsupervised learning method to train the "feature training data set of the application and the third-party library" to obtain the topic model of the application and the third-party library for subsequent document similarity comparison;

步骤4：将步骤1得到的应用对第三方库的调用信息按应用ID进行分类，生成每个应用的偏好向量，具体的偏好关系定义为：对于应用a和第三方库l，如果a调用了l，则y_al＝1，否则y_al＝1，M个应用对于N个第三方库的偏好向量构成了应用对第三方库的调用关系交互矩阵

Step 4: Classify the call information of the application to the third-party library obtained in step 1 according to the application ID, and generate a preference vector for each application. The specific preference relationship is defined as: for application a and third-party library l, if a calls l, then y _al = 1, otherwise y _al = 1, the preference vectors of M applications to N third-party libraries constitute the interaction matrix of the calling relationship between applications and third-party libraries

步骤5：利用步骤1得到的应用和第三方库自身的结构化语义信息进行相关领域的知识图谱构建，具体的构建流程包括：实体识别、实体及关系的三元组抽取、实体消歧和知识补全等，对于抽取得到的三元组，利用图数据库进行存储，得到可用的知识图谱；Step 5: Use the application obtained in Step 1 and the structured semantic information of the third-party library to construct knowledge graphs in related fields. The specific construction process includes: entity recognition, entity and relationship triple extraction, entity disambiguation and knowledge Completion, etc. For the triples obtained by extraction, the graph database is used for storage, and the available knowledge graph is obtained;

步骤6：利用知识图谱表示学习方法，如：TransE、TransH、TransR等，将知识图谱映射到一个低维的向量空间，对于知识图谱中的每个实体和关系，都被表示成一个向量；Step 6: Use the knowledge graph to represent the learning method, such as: TransE, TransH, TransR, etc., map the knowledge graph to a low-dimensional vector space, and each entity and relationship in the knowledge graph is represented as a vector;

步骤7：从应用和第三方库服务器上获取到待推荐的应用的非结构化文本描述信息和自身的结构化信息；Step 7: Obtain the unstructured text description information of the application to be recommended and its own structured information from the application and the third-party library server;

步骤8：将待推荐应用的非结构化文本描述信息经过自然语言方法预处理后，使用文本向量化方法转换为向量，然后输入到步骤3得到的主题模型中，通过相似度比较，取前k个相似度最高的应用，作为待推荐应用基于内容的邻居列表N(a)＝*a₁,a₂,…,a_k+，这里的相似度sim(a,a_i)有多种度量方式可以选择，如余弦相似度、欧氏距离、皮尔逊相关系数等；Step 8: After the unstructured text description information of the application to be recommended is preprocessed by natural language method, it is converted into a vector using the text vectorization method, and then input into the topic model obtained in step 3, and the first k is selected by similarity comparison. The applications with the highest similarity, as the content-based neighbor list to be recommended N(a)=*a ₁ , a ₂ ,..., a _k +, where the similarity sim(a, a _i ) can be measured in multiple ways You can choose, such as cosine similarity, Euclidean distance, Pearson correlation coefficient, etc.;

步骤9：根据待推荐应用基于内容的邻居列表N(a)对于待推荐第三方库l的调用信息，利用协同过滤方法按公式：

计算得到待推荐应用a对于待推荐第三方库l的基于内容和协同过滤的评分S₁(a,l)，对待推荐第三方库列表中的所有第三方库按该方式计算评分，得到待推荐应用基于内容和协同过滤的评分列表；Step 9: According to the calling information of the content-based neighbor list N(a) of the application to be recommended for the third-party library 1 to be recommended, use the collaborative filtering method according to the formula:

Calculate the content and collaborative filtering-based score S ₁ (a, l) of the application a to be recommended for the third-party library l to be recommended, and calculate the scores in this way for all third-party libraries in the list of third-party libraries to be recommended, and obtain the to-be-recommended library Apply content-based and collaborative filtering scoring lists;

步骤10：将步骤7得到的待推荐应用的结构化信息和待推荐第三方库列表输入到知识图谱中，通过实体识别方法找到与其匹配的实体，即可得到对应的通过知识图谱表示学习方法生成的实体向量列表；Step 10: Input the structured information of the application to be recommended and the list of third-party libraries to be recommended obtained in step 7 into the knowledge graph, find the matching entity through the entity recognition method, and then obtain the corresponding generated by the knowledge graph representation learning method. list of entity vectors of ;

步骤11：计算待推荐应用的实体向量化表示和待推荐第三方库的实体向量化表示之间的相似度，作为待推荐应用对于待推荐第三方库基于知识图谱的评分S₂(a,l)，同样，这里的相似度有多种度量方式可供选择，本领域技术人员也可以采用其他现有知识图谱相关的推荐方法去计算该评分；Step 11: Calculate the similarity between the entity vectorized representation of the application to be recommended and the entity vectorized representation of the third-party library to be recommended, as the score S ₂ (a, l) of the application to be recommended for the third-party library to be recommended based on the knowledge graph ), similarly, the similarity here has a variety of measurement methods to choose from, and those skilled in the art can also use other existing knowledge graph-related recommendation methods to calculate the score;

步骤12：将步骤9得到的待推荐应用对于待推荐第三方库基于内容和协同过滤的评分列表和步骤11得到的待推荐应用对于待推荐第三方库基于知识图谱的评分列表进行融合，得到待推荐应用对于待推荐第三方库的基于混合推荐的评分列表，融合方式可以考虑加权融合、特征融合等；Step 12: Integrate the content-based and collaborative filtering scoring list of the application to be recommended obtained in step 9 for the third-party library to be recommended and the knowledge map-based scoring list of the application to be recommended obtained in step 11 to the third-party library to be recommended, to obtain the score to be recommended. Recommended application For the score list based on hybrid recommendation of the third-party library to be recommended, the fusion method can consider weighted fusion, feature fusion, etc.;

步骤13：将待推荐应用对于所有待推荐第三方库的评分按降序排列，生成基于混合推荐的Top-N推荐列表。Step 13: Arrange the scores of the applications to be recommended for all the third-party libraries to be recommended in descending order, and generate a Top-N recommendation list based on mixed recommendation.

关于效果的说明：A note about the effect:

本实施例中，步骤2和步骤8中采用LDA文本向量化方法对预处理之后的数据进行向量化处理，步骤12中采用加权融合的方法对步骤9得到的待推荐应用对于待推荐第三方库基于内容和协同过滤的评分列表和步骤11得到的待推荐应用对于待推荐第三方库基于知识图谱的评分列表进行融合，融合系数设置为0.5。In this embodiment, in steps 2 and 8, the LDA text vectorization method is used to perform vectorization processing on the preprocessed data, and in step 12, the weighted fusion method is used to evaluate the application to be recommended obtained in step 9 to the third-party library to be recommended. The rating list based on content and collaborative filtering and the application to be recommended obtained in step 11 are fused with the rating list based on the knowledge graph of the third-party library to be recommended, and the fusion coefficient is set to 0.5.

将本实施例方法与现有方法进行比较，比较结果如图2；The method of the present embodiment is compared with the existing method, and the comparison result is shown in Figure 2;

对比实验设置：实验数据集包含5274个应用和471个被调用的第三方库信息；比较方法包含协同过滤(CF)、LDA主题模型(LDA)、神经协同过滤(NCF)、基于知识图谱的方法(RippleNet)、和基于协同过滤和主题模型的混合方法(AppLibRec)；评估指标包含：精确率(Precision@N)、召回率(Recall@N)、F1值(F1@N)、平均精度均值(MAP@N)，和归一化折损累计增益(NDCG@N)。Comparative experimental settings: The experimental dataset contains 5274 applications and 471 called third-party library information; the comparison methods include collaborative filtering (CF), LDA topic model (LDA), neural collaborative filtering (NCF), knowledge graph-based methods (RippleNet), and a hybrid method based on collaborative filtering and topic models (AppLibRec); evaluation metrics include: precision (Precision@N), recall (Recall@N), F1 value (F1@N), mean precision ( MAP@N), and normalized impairment cumulative gain (NDCG@N).

实验结果表明，本实施例方法(TM-MKR)较所有现有的对比方法，在所有评估指标上都表现更佳。The experimental results show that the method of this embodiment (TM-MKR) performs better on all evaluation indicators than all existing comparison methods.

应当理解的是，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that, for those skilled in the art, improvements or changes can be made according to the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims

1. a third-party library recommendation method based on hybrid collaborative filtering, is characterized in that, comprises the following steps:

1) Obtain the published application and third-party library data from the application and third-party library server; the application and third-party library data include text description information of the application and the third-party library, the calling information of the application to the third-party library, and the application and the structured semantic information of the third-party library itself;

2) Preprocess and quantify the text description information of the application and the third-party library in step 1) by the natural language processing method to obtain the content-based feature training data set of the application and the third-party library;

3) using the content-based feature training data set of the application and the third-party library in step 2) as the corpus, and using the unsupervised learning method to train to obtain the topic model of the application and the third-party library;

4) construct the call interaction matrix of the application to the third-party library according to the call information of the application to the third-party library in step 1) according to the calling relationship of each application to the third-party library;

5) extract entities and the relationship between entities from the application obtained in step 1) and the structured semantic information of the third-party library itself, and save it into a graph database to form a knowledge map;

6) Using the knowledge graph representation learning method to map the knowledge graph obtained in step 5) to a low-dimensional space to obtain a vectorized representation of each entity and relationship;

7) Obtain the text description information of the application to be recommended and its own structured information;

8) Input the text description information of the application to be recommended obtained in step 7) into the topic model obtained in step 3), and obtain the content-based neighbor list of the application to be recommended by comparing the similarity of the text description information;

9) According to the content-based neighbor list of the application to be recommended, the call interaction information of the third-party library to be recommended is used to perform a weighted sum and average value according to the similarity using the collaborative filtering method to obtain the content-based and collaborative filtering of the application to be recommended. A filtered list of ratings;

10) Input the structured information of the application to be recommended and the list of third-party libraries to be recommended obtained in step 7) into the knowledge map, and find the matching entity through the entity recognition method, and then the corresponding knowledge map representation learning method can be obtained. list of entity vectors of ;

11) Calculate the similarity between the entity vectorized representation of the application to be recommended and the entity vectorized representation of the third-party library to be recommended, as a rating list based on the knowledge graph of the application to be recommended for the third-party library to be recommended;

12) Integrate the content-based and collaborative filtering scoring list of the application to be recommended obtained in step 9) with the third-party library to be recommended based on the knowledge graph scoring list obtained in step 11) to obtain the third-party library to be recommended. A list of mixed recommendation-based ratings of the application to be recommended for the third-party library to be recommended;

13) Arrange the scores of the applications to be recommended with respect to all the third-party libraries to be recommended in descending order, and obtain a Top-N third-party library recommendation list of the applications to be recommended based on mixed recommendations.

2. the third-party library recommendation method based on hybrid collaborative filtering according to claim 1, is characterized in that, in described step 2), carry out preprocessing by natural language processing method and comprise word segmentation, remove stop word, punctuation mark and low frequency Vocabulary and stemming.

3. The third-party library recommendation method based on hybrid collaborative filtering according to claim 1, wherein the vectorization in the step 2) adopts a text vectorization method including TF-IDF, Doc2Bow, and LDA.

4. the third-party library recommendation method based on hybrid collaborative filtering according to claim 1, is characterized in that, in described step 4), construct application to the call interaction matrix of third-party library, is specifically as follows:

5. The method for recommending a third-party library based on hybrid collaborative filtering according to claim 1, wherein in the step 8), a content-based neighbor list of the application to be recommended is obtained, specifically as follows:

8.1) After the text description information of the application to be recommended is preprocessed by the natural language method, the text vectorization method is used to convert it into a vector;

8.2) Input the obtained vector into the topic model obtained in step 3), through similarity comparison, take the top k applications with the highest similarity as the content-based neighbor list of the application to be recommended N(a)={a ₁ , a ₂ ,..., _ak }, where the similarity sim(a, a _i ) is calculated by the following methods: cosine similarity method, Euclidean distance method or Pearson correlation coefficient method.

6. The third-party library recommendation method based on hybrid collaborative filtering according to claim 5, is characterized in that, in described step 9), utilize collaborative filtering method to carry out weighted sum and take average value by similarity, obtain application to be recommended for application to be A list of content-based ratings for recommended third-party libraries, as follows:

According to the call information of the content-based neighbor list N(a) of the application to be recommended for the third-party library 1 to be recommended, the collaborative filtering method is used according to the formula:

7. the third-party library recommendation method based on hybrid collaborative filtering according to claim 1, is characterized in that, in described step 11), calculate the entity vectorization representation of application to be recommended and the entity vectorization representation of the third-party library to be recommended The similarity between the applications to be recommended is the knowledge graph-based score S ₂ (a, l) for the third-party library to be recommended, where the similarity is calculated by the following methods: cosine similarity method, Euclidean distance method or Pearson distance method Correlation coefficient method.

8 . The third-party library recommendation method based on hybrid collaborative filtering according to claim 1 , wherein the method used for fusion in the step 12) is weighted fusion or feature fusion. 9 .