CN106815297A - A kind of academic resources recommendation service system and method - Google Patents
A kind of academic resources recommendation service system and method Download PDFInfo
- Publication number
- CN106815297A CN106815297A CN201611130297.9A CN201611130297A CN106815297A CN 106815297 A CN106815297 A CN 106815297A CN 201611130297 A CN201611130297 A CN 201611130297A CN 106815297 A CN106815297 A CN 106815297A
- Authority
- CN
- China
- Prior art keywords
- academic
- resource
- model
- topic
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
提供一种学术资源推荐服务系统与方法,用基于LDA的主题爬虫在互联网上爬取学术资源,用基于LDA的文本分类模型按预定A个类别分类后存储于本地的学术资源数据库,还包括学术资源模型、资源质量值计算模型、用户兴趣模型,在用户的终端殖入跟踪软件模块,结合用户的兴趣学科和历史浏览行为数据,学术资源类型、学科主题分布、关键词分布和LDA潜在主题分布共四个维度分别对学术资源模型和用户兴趣模型进行建模,计算学术资源模型与用户兴趣偏好模型之间的相似度,再结合资源质量值计算推荐度,最后根据推荐度为用户进行学术资源Top‑N推荐;本发明根据用户身份、兴趣和浏览行为进行学术资源的个性化的精准推荐,提高科研人员的工作效率。
Provide an academic resource recommendation service system and method, crawl academic resources on the Internet with an LDA-based topic crawler, use an LDA-based text classification model to classify into predetermined A categories and store them in a local academic resource database, including academic Resource model, resource quality value calculation model, user interest model, colonize the tracking software module in the user's terminal, combine the user's interest subject and historical browsing behavior data, academic resource type, subject distribution, keyword distribution and LDA potential topic distribution A total of four dimensions are used to model the academic resource model and the user interest model, calculate the similarity between the academic resource model and the user interest preference model, and then combine the resource quality value to calculate the recommendation degree, and finally provide users with academic resources according to the recommendation degree. Top-N recommendation; the present invention performs personalized and accurate recommendation of academic resources according to user identity, interest and browsing behavior, and improves the work efficiency of scientific researchers.
Description
技术领域technical field
本发明涉及计算机应用技术领域,尤其涉及一种学术资源推荐服务系统与以资源推荐服务系统为相关用户提供学术资源推荐服务的方法。The invention relates to the field of computer application technology, in particular to an academic resource recommendation service system and a method for providing academic resource recommendation services to related users by using the resource recommendation service system.
背景技术Background technique
目前已经进入大数据时代,在学术资源领域尤其如此,每年都有数以亿计的各类学术资源产生。除了学术论文、专利之外,还有大量的学术会议、学术新闻和学术社区信息等各类学术资源实时涌现,这些类型的学术资源对于用户精准、高效地掌握感兴趣领域的科研现状意义重大。然而,科研用户平时有繁重的科研工作,这类学术资源具有大数据异质、异构且快速增长的特性,在学术资源中基于传统搜索引擎的方式很难查全、查准,搜索过程也显得繁琐,用户在查询感兴趣的学术资源往往需花费大量的时间和精力,影响其工作效率。At present, we have entered the era of big data, especially in the field of academic resources, where hundreds of millions of various academic resources are generated every year. In addition to academic papers and patents, there are also a large number of academic resources such as academic conferences, academic news, and academic community information emerging in real time. These types of academic resources are of great significance for users to accurately and efficiently grasp the current scientific research status in fields of interest. However, scientific research users usually have heavy scientific research work. This kind of academic resources have the characteristics of heterogeneity, heterogeneity and rapid growth of big data. It is difficult to search and identify academic resources based on traditional search engines, and the search process is also difficult. It is cumbersome, and users often need to spend a lot of time and energy in searching for academic resources they are interested in, which affects their work efficiency.
当前学术资源个性化推荐研究的对象主要专注于学术论文,推荐学术资源类型单一;不同的用户群体,即不同身份的用户对不同类型学术资源的关注程度不同,当前学术资源的个性化推荐研究并没有考虑这些因素,不能基于用户身份制定多策略的推荐方案。并且,当前学术资源推荐研究仅局限于推荐模块,本发明则为学术资源推荐提供系统化的服务,从学术资源的动态获取、整合和分类,到基于用户身份、行为和兴趣学科进行学术资源的个性化推荐,形成以资源整合和推荐为核心的一体化服务体系。The current research on personalized recommendation of academic resources mainly focuses on academic papers, and the types of recommended academic resources are single; different user groups, that is, users with different identities, pay different attention to different types of academic resources. Without considering these factors, a multi-strategy recommendation scheme based on user identity cannot be formulated. Moreover, the current academic resource recommendation research is limited to the recommendation module, and the present invention provides systematic services for academic resource recommendation, from dynamic acquisition, integration and classification of academic resources to academic resource selection based on user identity, behavior and interest subjects. Personalized recommendation, forming an integrated service system centered on resource integration and recommendation.
LDA(Latent Dirichlet Allocation)是一种文档主题生成模型,也称为一个三层贝叶斯概率模型,包含词、主题和文档三层结构。所谓生成模型,就是说,我们认为一篇文章的每个词都是通过“以一定概率选择了某个主题,并从这个主题中以一定概率选择某个词语”这样一个过程得到。主题是指某个定义好的专业领域或者兴趣领域,例如航天航空,生物医学,信息技术等,具体指一系列相关的词所组成的集合。文档到主题服从多项式分布,主题到词服从多项式分布。LDA是一种非监督机器学习技术,可以用来识别文档中潜藏的主题信息。它采用了词袋(bag of words)的方法,这种方法将每一篇文档视为一个词频向量,从而将文本信息转化为了易于建模的数字信息。每一篇文档代表了一些主题所构成的一个概率分布,而每一个主题又代表了很多单词所构成的一个概率分布。LDA主题模型是自然语言处理中主题挖掘的典型模型,可以从文本语料中抽取潜在的主题,提供一个量化研究主题的方法,已经被广泛的应用到学术资源的主题发现中,如研究热点挖掘、研究主题演化、研究趋势预测等。LDA (Latent Dirichlet Allocation) is a document topic generation model, also known as a three-layer Bayesian probability model, which includes three layers of words, topics and documents. The so-called generative model means that we believe that each word in an article is obtained through the process of "selecting a certain topic with a certain probability, and selecting a certain word from this topic with a certain probability". A topic refers to a well-defined professional field or field of interest, such as aerospace, biomedicine, information technology, etc., and specifically refers to a set of related words. Documents to topics follow a multinomial distribution, and topics to words follow a multinomial distribution. LDA is an unsupervised machine learning technique that can be used to identify hidden topic information in documents. It uses the bag of words method, which treats each document as a word frequency vector, thus converting text information into digital information that is easy to model. Each document represents a probability distribution composed of some topics, and each topic represents a probability distribution composed of many words. The LDA topic model is a typical model of topic mining in natural language processing. It can extract potential topics from text corpus and provide a method for quantifying research topics. It has been widely used in topic discovery of academic resources, such as research hotspot mining, Research topic evolution, research trend prediction, etc.
另外,随着互联网的发现,互联网充塞着大量的各种新闻、博客、会议记要等各种方式的信息文本,这类信息文本或多或少包括学术相关的信息内容,而且往往包括最新的学术研究信息,为各类相关学科人士所关心,而这类信息文本杂乱无序,往往学科重叠,自身一般没有分类信息,现有技术对这类信息文本往往很难做出正确的自动分类,各类相关学科人士采自行用传统搜索引擎的方式很难查全、查准,搜索过程也显得繁琐,用户在查询感兴趣的学术资源往往需花费大量的时间和精力,影响其工作效率。In addition, with the discovery of the Internet, the Internet is filled with a large number of information texts in various forms such as news, blogs, and meeting minutes. Such information texts more or less include academic-related information, and often include the latest Academic research information is concerned by people of various related disciplines, and this kind of information text is messy and disorderly, often with overlapping disciplines, and generally does not have classification information itself. It is often difficult to make correct automatic classification of this type of information text in the existing technology. It is difficult for people in various related disciplines to search thoroughly and accurately by using traditional search engines, and the search process is also cumbersome. It often takes a lot of time and energy for users to query academic resources they are interested in, which affects their work efficiency.
本发明正是为了解决上述技术问题。The present invention is just to solve above-mentioned technical problem.
发明内容Contents of the invention
本发明所要解决的技术问题是针对上述技术现状,提供一种学术资源推荐服务系统与以资源推荐服务系统为相关用户提供学术资源推荐服务的方法。The technical problem to be solved by the present invention is to provide an academic resource recommendation service system and a method for providing academic resource recommendation services for relevant users with the resource recommendation service system in view of the above-mentioned technical status.
本发明解决上述技术问题所采用的技术方案为:The technical solution adopted by the present invention to solve the problems of the technologies described above is:
一种学术资源推荐服务系统,所述学术资源为公布在互联网上的各种电子文本,所述学术资源推荐服务系统包括网络爬虫、文本分类模型、学术资源数据库,由网络爬虫在互联网上爬取学术资源,其特征在于,用文本分类模型按预定A个类别分类后存储于本地的学术资源数据库,提供学术资源数据库开放的API供展示及资源推荐模块调用,所述学术资源推荐服务系统还包括学术资源模型、资源质量值计算模型、用户兴趣模型,在用户的终端殖入跟踪软件模块,用于跟踪记录用户的网上浏览行为;基于不同群体用户的历史浏览行为数据,计算不同身份的用户对各个类型学术资源的关注程度,从资源类型、学科分布、关键词分布和LDA潜在主题分布共四个维度对学术资源建模,结合用户的兴趣学科和历史浏览行为数据,对用户的兴趣偏好建模,计算学术资源模型与用户兴趣偏好模型之间的相似度,再结合资源质量值计算推荐度,最后根据推荐度为用户进行学术资源Top-N推荐。An academic resource recommendation service system, the academic resources are various electronic texts published on the Internet, the academic resource recommendation service system includes a web crawler, a text classification model, and an academic resource database, crawled by the web crawler on the Internet The academic resources are characterized in that they are stored in a local academic resource database after being classified according to predetermined A categories with a text classification model, and an open API of the academic resource database is provided for display and resource recommendation module calls, and the academic resource recommendation service system also includes The academic resource model, the resource quality value calculation model, and the user interest model are embedded in the user's terminal with a tracking software module to track and record the user's online browsing behavior; The attention degree of various types of academic resources is modeled from the four dimensions of resource type, subject distribution, keyword distribution and LDA potential topic distribution. Combined with the user's interested subjects and historical browsing behavior data, the user's interest preference is modeled. model, calculate the similarity between the academic resource model and the user's interest preference model, and then calculate the recommendation degree based on the resource quality value, and finally make the Top-N recommendation of academic resources for users according to the recommendation degree.
所述网络爬虫为主题爬虫,还包括LDA主题模型,所述LDA主题模型是一个“文档-主题-词”的三层贝叶斯生成模型,预先为所述LDA主题模型配置一个语料库,语料库中包括训练语料,用训练语料按设定主题数K让LDA主题模型训练,利用LDA主题模型训练时的聚词功能在训练语料经LDA主题模型训练后获得按设定主题数K分别聚集成K个主题关联词集合,即得到主题爬虫本次爬行的K个主题文档;所述主题爬虫在普通网络爬虫的基础上进一步包括主题确定模块、相似度计算模块、URL优先级排序模块;所述主题爬虫为按学术主题数分布的多个分布式爬虫,每个分布式爬虫对应一个学术主题,各分布式爬虫同时获得多个学术主题的学术资源;主题爬虫每次爬行过程中,主题爬虫的主题确定模块确定目标主题及其主题文档,用所述主题文档指导主题相似度的计算,相似度计算模块对所爬取的页面上每个锚文本并结合该页面内容进行主题相似度计算及判断,剔除锚文本结合该页面的主题相似度小于设定阈值的超链接,选取锚文本结合该页面的主题相似度大于设定阈值的URL,由主题爬虫维护一个由已访问网页的超链接所指的未访问网页的URL队列,该URL队列根据相似度高低降序排列,主题爬虫按URL队列的排列顺序先后不断地访问各URL的网页,爬取相应学术资源,并不断地将所爬取的学术资源分类标签后存入数据库,针对本次爬行的主题文档,直至未访问队列URL为空;将所述主题爬虫每次所爬取的学术资源作为LDA主题模型训练用的新的语料;并不断重复主题爬虫爬行过程,使得各主题文档的所集合的主题关联词不断得以补充更新,所爬取的学术资源不断得以补充更新至一个人为认可的程度。The web crawler is a topic crawler, and also includes an LDA topic model. The LDA topic model is a three-layer Bayesian generation model of "document-topic-word". A corpus is pre-configured for the LDA topic model. In the corpus Including the training corpus, use the training corpus to train the LDA topic model according to the set topic number K, and use the word gathering function during the training of the LDA topic model to obtain K according to the set topic number K after the training corpus is trained by the LDA topic model Subject-related words set, promptly obtain K subject documents crawled by subject crawler this time; Described subject crawler further includes subject determination module, similarity calculation module, URL prioritization module on the basis of common web crawler; Described subject crawler is Multiple distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to an academic topic, and each distributed crawler obtains academic resources of multiple academic topics at the same time; during each crawling process of a topic crawler, the topic determination module of the topic crawler Determine the target theme and its theme document, use the theme document to guide the calculation of theme similarity, the similarity calculation module calculates and judges the theme similarity of each anchor text on the crawled page and combines the content of the page, and removes the anchor The text is combined with the hyperlink whose theme similarity of the page is less than the set threshold, and the selected anchor text is combined with the URL whose theme similarity of the page is greater than the set threshold, and the theme crawler maintains an unvisited page pointed by the hyperlink of the visited page The URL queue of the web page, the URL queue is arranged in descending order according to the similarity. The theme crawler continuously visits the web pages of each URL according to the order of the URL queue, crawls the corresponding academic resources, and continuously classifies and labels the crawled academic resources Afterwards, it is stored in the database, and for the theme documents crawled this time, until the queue URL is empty; the academic resources crawled by the theme crawlers are used as new corpus for LDA theme model training; and the theme crawlers are constantly repeated During the crawling process, the collected subject related words of each subject document can be continuously supplemented and updated, and the crawled academic resources can be continuously supplemented and updated to a degree recognized by humans.
所述语料库中还包括类别明确的验证语料,用于预先用验证语料按预定类别数A让所述文本分类模型进行分类验证,以获得文本分类模型对A个类别中的每个类别的分类准确率,作为文本分类模型对A个类别中的每个类别的归类可信度指标;该准确率为被文本分类模型分到某个类别的所有验证语料中属被正确分类的语料的比率,并预设分类准确率阈值。The corpus also includes a verification corpus with clear categories, which is used to use the verification corpus in advance to allow the text classification model to perform classification verification according to the predetermined number A of categories, so that the text classification model can accurately classify each category in the A categories. rate, as the classification credibility index of each category in the A category by the text classification model; the accuracy rate is the ratio of the correctly classified corpus among all the verification corpora classified into a certain category by the text classification model, And preset classification accuracy threshold.
将所有学科分为75个学科类别,即所述类别数A为75个类别,利用LDA主题模型训练时设定主题数K为100个,所述文本分类模型进行分类验证时预设分类准确率阈值为80%。Divide all subjects into 75 subject categories, that is, the number of categories A is 75 categories, the number of topics K is set to 100 when using the LDA topic model training, and the classification accuracy is preset when the text classification model performs classification verification The threshold is 80%.
一种以资源推荐服务系统为相关用户提供学术资源推荐服务的方法,所述学术资源为公布在互联网上的各种电子文本,包括使用网络爬虫在互联网上爬取学术资源,其特征在于,使用文本分类模型将所爬取的学术资源按预定A个类别进行分类后存储,形成学术资源数据库,提供学术资源数据库开放的API供展示及资源推荐模块调用,使用资源质量值计算模型、用户兴趣模型,在用户的终端殖入跟踪软件模块,用于跟踪记录用户的网上浏览行为;向用户推荐其相应的学术资源的过程包括冷启动推荐阶段与二次推荐阶段,冷启动推荐阶段基于兴趣学科为用户推荐符合其兴趣学科的优质资源,所述优质资源即为经资源质量值计算模型计算后比较所得的资源质量值高的学术资源,资源质量值为资源权威度、资源社区热度和资源时新度的算术平均值或加权平均值;二次推荐阶段,分别对用户兴趣模型和资源模型建模,计算用户兴趣模型与资源模型二者的相似性,再结合资源质量值计算推荐度,最后根据推荐度为用户进行学术资源Top-N推荐。A method for providing academic resource recommendation services for relevant users with a resource recommendation service system, the academic resources are various electronic texts published on the Internet, including using a web crawler to crawl academic resources on the Internet, characterized in that, using The text classification model classifies and stores the crawled academic resources according to predetermined A categories to form an academic resource database, provides an open API of the academic resource database for display and resource recommendation module calls, and uses resource quality value calculation models and user interest models , the tracking software module is implanted in the user's terminal to track and record the user's online browsing behavior; the process of recommending the corresponding academic resources to the user includes the cold start recommendation stage and the second recommendation stage. The cold start recommendation stage is based on the subject of interest as Users recommend high-quality resources that meet their interests. The high-quality resources are academic resources with high resource quality values calculated by the resource quality value calculation model. The resource quality values are resource authority, resource community popularity, and resource updates. In the secondary recommendation stage, the user interest model and the resource model are modeled separately, the similarity between the user interest model and the resource model is calculated, and then the recommendation degree is calculated by combining the resource quality value, and finally according to The recommendation degree is the Top-N recommendation of academic resources for users.
所述资源质量值Quality计算包括,资源权威度Authority的计算公式如下:The calculation of the resource quality value Quality includes that the calculation formula of the resource authority degree Authority is as follows:
其中Level是资源发表刊物级别被量化后的得分,将刊物级别分为5个等级,分数依次为1、0.8、0.6、0.4和0.2分。顶尖杂志或会议如Nature、Science得1分,第二级别的如ACM Transaction得0.8分,最低级别的得0.2分;Cite的计算公式如下:Among them, Level is the quantified score of resource publishing publication level, and the publication level is divided into 5 levels, and the scores are 1, 0.8, 0.6, 0.4 and 0.2 points. Top journals or conferences such as Nature and Science get 1 point, second-level ones like ACM Transaction get 0.8 points, and the lowest level ones get 0.2 points; the calculation formula of Cite is as follows:
Cite=Cites/maxCite (2)Cite=Cites/maxCite (2)
Cite是资源被引量的量化结果,Cites是资源的被引量,maxCite是资源来源数据库中最大的被引量;Cite is the quantitative result of resource citations, Cites is the citation of resources, and maxCite is the largest citation in the resource source database;
资源社区热度Popularity的计算公式如下:The formula for calculating the Popularity of the resource community is as follows:
Popularity=readTimes/maxReadTimes (3)Popularity=readTimes/maxReadTimes (3)
readTimes是论文的阅读次数,maxReadTimes是资源来源数据库中最大的阅读次数;readTimes is the number of readings of the paper, and maxReadTimes is the maximum number of readings in the resource source database;
资源的时新度Recentness计算方法相同,公式如下:The calculation method of the recentness of resources is the same, and the formula is as follows:
year和month分别是资源的发表年份和月份;minYear、minMonth、maxYear和maxMonth是该类资源的来源数据库中所有资源的最早和最晚发表年份和月份;year and month are the publication year and month of the resource respectively; minYear, minMonth, maxYear and maxMonth are the earliest and latest publication year and month of all resources in the source database of this type of resource;
资源质量值Quality计算方法如下:The resource quality value Quality is calculated as follows:
所述学术资源模型表示如下:The academic resource model is represented as follows:
Mr={Tr,Kr,Ct,Lr} (6)M r ={T r ,K r ,C t ,L r } (6)
其中,Tr为学术资源的学科分布向量,是该学术资源分布在A个学科类别的概率值,由贝叶斯多项式模型得到;Among them, T r is the subject distribution vector of academic resources, which is the probability value of the academic resources distributed in A subject category, which is obtained by the Bayesian polynomial model;
Kr={(kr1,ωr1),(kr2,ωr2),…,(krm,ωrm)},m为关键词个数,kri(1≤i≤m)表示单条学术资源第i个关键词,ωri为关键词kri的权重,通过改进后的tf-idf算法得到,计算公式如下:K r ={(k r1 ,ω r1 ),(k r2 ,ω r2 ),…,(k rm ,ω rm )}, m is the number of keywords, k ri (1≤i≤m) means a single academic The i-th keyword of the resource, ω ri is the weight of the keyword k ri , which is obtained through the improved tf-idf algorithm, and the calculation formula is as follows:
w(i,r)表示文档r中第i个关键词的权重,tf(i,r)表示第i个关键词在文档r中出现的频度,Z表示文档集的总篇数,L表示包含关键词i的文档数;Lr为潜在主题分布向量,Lr={lr1,lr2,lr3…,lrN1},N1是潜在主题数量;Ct为资源类型,t的取值可以为1,2,3,4,5即五大类学术资源:论文、专利、新闻、会议和图书;w(i,r) represents the weight of the i-th keyword in document r, tf(i,r) represents the frequency of the i-th keyword appearing in document r, Z represents the total number of documents in the document set, and L represents The number of documents containing keyword i; Lr is the potential topic distribution vector, L r = {l r1 , l r2 , l r3 ..., l rN1 }, N1 is the number of potential topics; Ct is the resource type, and the value of t can be 1, 2, 3, 4, 5 are five major categories of academic resources: papers, patents, news, conferences and books;
根据用户使用移动软件的行为特点,将用户对一个学术资源的操作行为分为打开、阅读、星级评价、分享和收藏,用户兴趣模型基于用户背景及浏览过的学术资源,根据用户的不同浏览行为,结合学术资源模型,构建用户兴趣模型,用户兴趣模型表示如下:According to the behavior characteristics of users using mobile software, the user's operation behavior on an academic resource is divided into opening, reading, star rating, sharing and collection. The user interest model is based on the user's background and the academic resources that have been browsed. According to the user's different browsing Behavior, combined with the academic resource model, constructs a user interest model. The user interest model is expressed as follows:
Mu={Tu,Ku,Ct,Lu} (8)M u ={T u ,K u ,C t ,L u } (8)
其中,Tu是用户一段时间内浏览过的某类学术资源的学科分布向量Tr经过用户行为后,形成的用户学科偏好分布向量,即Among them, T u is the subject distribution vector T r of a certain type of academic resources browsed by the user within a period of time, and the user's subject preference distribution vector is formed after the user's behavior, that is
其中,sum为用户产生过行为的学术资源总数,sj为用户对学术资源j产生行为后的“行为系数”,该值越大说明用户越喜欢该资源。Tjr表示第j篇资源的学科分布向量。sj的计算综合考虑了打开、阅读、评价、收藏和分享等行为,能够准确反映用户对资源的偏好程度。Among them, sum is the total number of academic resources that the user has acted on, and s j is the "behavior coefficient" after the user has acted on the academic resource j. The larger the value, the more the user likes the resource. T jr represents the subject distribution vector of the jth resource. The calculation of s j comprehensively considers behaviors such as opening, reading, evaluating, collecting and sharing, and can accurately reflect the user's preference for resources.
Ku={(ku1,ωu1),(ku2,ωu2),...,(kuN2,ωuN2)}是用户的关键词偏好分布向量,N2为关键词个数,kui(1≤i≤N2)表示第i个用户偏好关键词,ωui为关键词kui的权重,通过用户u一段时间内产生过行为的某类学术资源的“关键词分布向量”Kr计算得到。K u = {(k u1 , ω u1 ), (k u2 , ω u2 ),..., (k uN2 , ω uN2 )} is the user's keyword preference distribution vector, N 2 is the number of keywords, k ui (1≤i≤N 2 ) represents the i-th user preference keyword, ω ui is the weight of keyword k ui , and the "keyword distribution vector" K r is calculated.
Kjr′=sj*Kjr (10)K jr ′=s j *K jr (10)
根据公式10可以计算出每篇资源新的关键词分布向量,再选取所有资源新的关键词分布向量的TOP-N2作为用户关键词偏好分布向量Ku;The new keyword distribution vector of each resource can be calculated according to formula 10, and then TOP-N2 of the new keyword distribution vector of all resources is selected as the user keyword preference distribution vector K u ;
Lu为用户的LDA潜在主题偏好分布向量,由学术资源的LDA潜在主题分布向量Lr={lr1,lr2,lr3...,lrN1}计算得到,方法同Tu:L u is the user's LDA latent topic preference distribution vector, which is calculated from the LDA latent topic distribution vector of academic resources L r = {l r1 , l r2 , l r3 ..., l rN1 }, the method is the same as T u :
用户兴趣与资源模型二者的相似性计算如下:The similarity between user interests and resource models is calculated as follows:
学术资源模型表示:The Academic Resource Model says:
Mr={Tr,Kr,Ct,Lr} (12)M r = {T r , K r , C t , L r } (12)
用户兴趣模型表示:The user interest model represents:
Mu={Tu,Ku,Ct,Lu} (13)M u = {T u , K u , C t , L u } (13)
用户学科偏好分布向量Tu与学术资源学科分布向量Tr的相似度通过余弦相似度计算,即:The similarity between user subject preference distribution vector T u and academic resource subject distribution vector T r is calculated by cosine similarity, namely:
用户LDA潜在主题偏好分布向量Lu与学术资源LDA潜在主题分布向量Lr的相似度通过余弦相似度计算,即:The similarity between the user LDA latent topic preference distribution vector L u and the academic resource LDA latent topic distribution vector L r is calculated by cosine similarity, namely:
用户关键词偏好分布向量Ku与学术资源关键词分布向量Kr的相似度计算通过Jaccard Similarity进入计算:The similarity calculation between the user keyword preference distribution vector K u and the academic resource keyword distribution vector K r is calculated through Jaccard Similarity:
则用户兴趣模型与学术资源模型的相似度为:Then the similarity between the user interest model and the academic resource model is:
其中,σ+ρ+τ=1,具体权重分配由实验训练得到。Among them, σ+ρ+τ=1, and the specific weight distribution is obtained from experimental training.
引入推荐度Recommendation_degree概念,某一学术资源的推荐度越大说明该资源越符合用户的兴趣偏好,且资源越优质,推荐度计算公式如下:Introduce the concept of Recommendation_degree. The higher the recommendation degree of an academic resource, the more suitable the resource is for the user's interests and preferences, and the higher the quality of the resource. The calculation formula of the recommendation degree is as follows:
Recommendation_degree=λ1Sim(Mu,Mn)+λ2Quality(λ1+λ2=1) (18)Recommendation_degree=λ 1 Sim(M u ,M n )+λ 2 Quality(λ 1 +λ 2 =1) (18)
二次推荐阶段便是根据学术资源的推荐度进行Top-N推荐。The second recommendation stage is to make Top-N recommendations based on the recommendation degree of academic resources.
所述网络爬虫包括定址爬虫与主题爬虫,还包括LDA主题模型,所述LDA主题模型是一个“文档-主题-词”的三层贝叶斯生成模型,预先为所述LDA主题模型配置一个语料库,语料库中包括训练语料,用训练语料按设定主题数K让LDA主题模型训练,利用LDA主题模型训练时的聚词功能在训练语料经LDA主题模型训练后获得按设定主题数K分别聚集成K个主题关联词集合,即得到主题爬虫本次爬行的K个主题文档;所述主题爬虫在普通网络爬虫的基础上进一步包括主题确定模块、相似度计算模块、URL优先级排序模块;所述主题爬虫为按学术主题数分布的多个分布式爬虫,每个分布式爬虫对应一个学术主题,各分布式爬虫同时获得多个学术主题的学术资源;主题爬虫每次爬行过程中,主题爬虫的主题确定模块确定目标主题及其主题文档,用所述主题文档指导主题相似度的计算,相似度计算模块对所爬取的页面上每个锚文本并结合该页面内容进行主题相似度计算及判断,剔除锚文本结合该页面的主题相似度小于设定阈值的超链接,选取锚文本结合该页面的主题相似度大于设定阈值的URL,由主题爬虫维护一个由已访问网页的超链接所指的未访问网页的URL队列,该URL队列根据相似度高低降序排列,主题爬虫按URL队列的排列顺序先后不断地访问各URL的网页,爬取相应学术资源,并不断地将所爬取的学术资源分类标签后存入数据库,针对本次爬行的主题文档,直至未访问队列URL为空;将所述主题爬虫每次所爬取的学术资源作为LDA主题模型训练用的新的语料;并不断重复主题爬虫爬行过程,使得各主题文档的所集合的主题关联词不断得以补充更新,所爬取的学术资源不断得以补充更新至一个人为认可的程度。The web crawler includes addressable crawler and topic crawler, and also includes an LDA topic model. The LDA topic model is a three-layer Bayesian generation model of "document-topic-word", and a corpus is pre-configured for the LDA topic model , the corpus includes the training corpus, use the training corpus to train the LDA topic model according to the set topic number K, use the word gathering function during the training of the LDA topic model to obtain the training corpus according to the set topic number K after the training corpus is trained by the LDA topic model Form K subject-associated word collections, promptly obtain the K theme documents crawled by the theme crawler this time; The theme crawler further includes a theme determination module, a similarity calculation module, and a URL prioritization module on the basis of a common web crawler; Theme crawlers are multiple distributed crawlers distributed according to the number of academic topics. Each distributed crawler corresponds to an academic theme, and each distributed crawler obtains academic resources of multiple academic topics at the same time; The theme determination module determines the target theme and its theme document, uses the theme document to guide the calculation of theme similarity, and the similarity calculation module performs theme similarity calculation and judgment on each anchor text on the crawled page combined with the content of the page , remove the hyperlinks whose anchor text combined with the theme similarity of the page is less than the set threshold, select the URL whose anchor text combined with the theme similarity of the page is greater than the set threshold, and the theme crawler maintains a hyperlink pointed to by the visited web page The URL queue of the unvisited webpages, the URL queue is arranged in descending order according to the similarity, the topic crawler continuously visits the webpages of each URL according to the order of the URL queue, crawls the corresponding academic resources, and continuously crawls the crawled academic resources After the resource classification label is stored in the database, for the theme document crawled this time, until the URL of the unvisited queue is empty; the academic resources crawled by the theme crawler each time are used as new corpus for LDA theme model training; and continuously Repeating the crawling process of the topic crawler enables the collection of topic related words in each topic document to be continuously supplemented and updated, and the crawled academic resources are continuously supplemented and updated to an artificially recognized level.
所述语料库中还包括类别明确的验证语料,用于预先用验证语料按预定类别数A让所述文本分类模型进行分类验证,以获得文本分类模型对A个类别中的每个类别的分类准确率,作为文本分类模型对A个类别中的每个类别的归类可信度指标;该准确率为被文本分类模型分到某个类别的所有验证语料中属被正确分类的语料的比率,并预设分类准确率阈值;用所述文本分类模型对每一篇待分类文本进行文本分类时具体包括以下步骤:The corpus also includes a verification corpus with clear categories, which is used to use the verification corpus in advance to allow the text classification model to perform classification verification according to the predetermined number A of categories, so that the text classification model can accurately classify each category in the A categories. rate, as the classification credibility index of each category in the A category by the text classification model; the accuracy rate is the ratio of the correctly classified corpus among all the verification corpora classified into a certain category by the text classification model, And preset classification accuracy rate threshold; When using the text classification model to carry out text classification for each text to be classified, specifically include the following steps:
步骤一、对每一篇待分类文本进行预处理,预处理包括分词、去停留词,并保留专有名词,分别计算该文本的经预处理后的所有词的特性权重,词的特性权重数值与在该文本中出现的次数成正比,与在所述训练语料中出现的次数成反比,将计算所得的词集按其特性权重数值大小降序排列,提取每一篇待分类文本原始词集的前面部分作为其特征词集;Step 1. Preprocess each text to be classified. Preprocessing includes word segmentation, removing stop words, and retaining proper nouns. Calculate the feature weights and feature weight values of all words in the text after preprocessing Proportional to the number of times it appears in the text, and inversely proportional to the number of times it appears in the training corpus, the calculated word set is arranged in descending order according to its characteristic weight value, and the original word set of each text to be classified is extracted. The front part is used as its feature word set;
步骤二、使用文本分类模型,选取每一篇待分类文本原始特征词集用来分别计算该篇文本可能归属预定A个类别中各个类别的概率值,选取概率值最大的类别作为该篇文本分类类别;Step 2. Using the text classification model, select the original feature word set of each text to be classified to calculate the probability value that the text may belong to each of the predetermined A categories, and select the category with the highest probability value as the text classification category;
步骤三、对步骤二的文本分类结果进行判断,如果文本分类模型对该类别的分类准确率数值达到设定阈值就直接输出结果;如果文本分类模型对该类别的分类准确率数值未达到设定阈值,就进入步骤四;Step 3. Judging the text classification results of step 2. If the classification accuracy value of the text classification model for this category reaches the set threshold, the result will be output directly; if the classification accuracy value of the text classification model for this category does not reach the set threshold threshold, go to step 4;
步骤四、将每一篇经预处理的文本输入所述LDA主题模型,用LDA主题模型计算出该篇文本对应所设定的K个主题中的每个主题的权重值,选取权重值最大的主题,并将预先经LDA主题模型训练后所得到的该主题下的主题关联词中的前Y个词加入至该篇文本的原始特征词集之中共同作为扩充后的特征词集,再次使用文本分类模型,分别计算该篇文本可能归属预定A个类别中各个类别的概率值,选取概率值最大的类别作为该篇文本最终分类类别。Step 4, input each preprocessed text into the LDA topic model, use the LDA topic model to calculate the weight value of each topic in the K topics corresponding to the text, and select the one with the largest weight value topic, and add the first Y words in the topic related words under the topic obtained after the LDA topic model training in advance to the original feature word set of the text as the expanded feature word set, and use the text again The classification model calculates the probability values that the text may belong to each of the predetermined A categories, and selects the category with the highest probability value as the final classification category of the text.
所述文本分类模型的主要计算公式为:The main calculation formula of the text classification model is:
其中P(cj|x1,x2,...,xn)表示特征词(cj|x1,x2,...,xn)同时出现时该文本属于类别cj的概率;其中P(cj)表示训练文本集中,属于类别cj的文本占总数的比率,P(x1,x2,...,xn|cj)表示如果待分类文本属于类别cj,则这篇文本的特征词集为(x1,x2,...,xn)的概率,P(c1,c2,...,cn)表示给定的所有类别的联合概率。Where P(c j |x 1 , x 2 ,..., x n ) represents the probability that the text belongs to category cj when the feature word (c j |x 1 , x 2 ,..., x n ) appears at the same time; Among them, P(c j ) represents the ratio of the text belonging to category c j in the training text set, and P(x 1 , x 2 ,..., x n |c j ) indicates that if the text to be classified belongs to category c j , Then the feature word set of this text is the probability of (x 1 , x 2 ,..., x n ), P(c 1 , c 2 ,..., c n ) represents the joint probability of all given categories .
本发明所述的面向多类型学术资源的资源推荐服务系统具有如下特点:The resource recommendation service system for multi-type academic resources described in the present invention has the following characteristics:
(1)本发明实现了多种类型,比如学术论文、专利、学术会议和学术新闻等类型学术资源的动态获取,并基于主题爬虫模块高效地获取目标学术资源。(1) The present invention realizes the dynamic acquisition of various types of academic resources such as academic papers, patents, academic conferences and academic news, and efficiently acquires target academic resources based on the theme crawler module.
(2)本发明实现了对多种类型学术资源基于学科属性进行主题分类的工作。(2) The present invention realizes the subject classification of various types of academic resources based on subject attributes.
(3)不同用户群体对不同类型学术资源的关注程度有所差异,本发明实现了基于不同用户群体的多策略学术资源推荐方案,为不同身份的用户按不同比例推荐各个类型学术资源。(3) Different user groups pay different attention to different types of academic resources. The present invention implements a multi-strategy academic resource recommendation scheme based on different user groups, and recommends various types of academic resources in different proportions for users with different identities.
(4)基于用户浏览习惯,本发明实现了基于用户不同行为进行多种类型学术资源的个性化推荐工作。(4) Based on user browsing habits, the present invention realizes personalized recommendation of various types of academic resources based on different behaviors of users.
本发明根据用户身份、兴趣和浏览行为进行学术资源的个性化推荐,可以更精准地向用户推荐学术资源,大大提高科研人员的工作效率,为科研工作者更好的进行科学研究创造方便、快捷的信息获取环境,有效地化解学术资源信息过载与用户资源获取之间的矛盾。The present invention performs personalized recommendation of academic resources according to user identity, interest and browsing behavior, can recommend academic resources to users more accurately, greatly improves the work efficiency of scientific researchers, and creates convenience and speed for scientific research workers to better conduct scientific research. The information acquisition environment effectively resolves the contradiction between information overload of academic resources and user resource acquisition.
另外,本发明采用基于LDA的学术资源获取方法与分类方法,通过LDA主题模型,深度挖掘主题语义信息,为学术资源的主题爬虫构造良好的指导基础,将机器学习融入到学术资源的获取方法中,提高学术资源获取的质量和效率;主题爬虫所得的学术资源又用于LDA主题更新,可随时更新主题模型,跟进学术发展的趋势,为科研工作者提供相关领域前沿资源;本发明提出的基于选择性特征扩展的文本分类方法适合复杂的应用场景,有选择的对信息量少的数据增加主题信息,同时避免对信息量充足的数据增加噪音,为文本分类模型的优化提供了一种思路,且具有场景适应性强,结果可用性高,分类模型易于更新和维护的特点。In addition, the present invention adopts the LDA-based academic resource acquisition method and classification method, and uses the LDA topic model to deeply mine topic semantic information, construct a good guidance foundation for topic crawlers of academic resources, and integrate machine learning into the academic resource acquisition method , improve the quality and efficiency of academic resource acquisition; the academic resources obtained by the theme crawler are used for LDA theme update, the theme model can be updated at any time, follow up the trend of academic development, and provide scientific researchers with cutting-edge resources in related fields; the invention proposes The text classification method based on selective feature expansion is suitable for complex application scenarios. It selectively adds topic information to data with less information, and avoids adding noise to data with sufficient information, which provides a way for the optimization of text classification models. , and has the characteristics of strong scene adaptability, high availability of results, and easy update and maintenance of classification models.
附图说明Description of drawings
图1为本发明整个学术资源推荐服务系统的框架示意图;Fig. 1 is the frame schematic diagram of the whole academic resource recommendation service system of the present invention;
图2为LDA模型示意图;Fig. 2 is the schematic diagram of LDA model;
图3为某一文本预处理前的文本示意图;FIG. 3 is a schematic diagram of a text before preprocessing of a certain text;
图4为某一文本预处理后的文本示意图;FIG. 4 is a schematic diagram of a text after preprocessing of a certain text;
图5为训练语料经LDA主题模型训练后主题与主题文档示意图;Fig. 5 is the schematic diagram of topic and topic document after training corpus through LDA topic model training;
图6为本发明采用基于LDA的学术资源获取方法的流程示意图;Fig. 6 is a schematic flow diagram of the present invention using an LDA-based academic resource acquisition method;
图7为本发明采用基于LDA的文本分类方法的流程示意图;Fig. 7 is the schematic flow chart that the present invention adopts the text classification method based on LDA;
图8为三次实验在部分学科上的查全率示意图;Figure 8 is a schematic diagram of the recall rate of three experiments in some subjects;
图9为三次实验在部分学科上的查准率示意图Figure 9 is a schematic diagram of the precision rate of three experiments in some subjects
图10为本发明推荐流程示意图。Fig. 10 is a schematic diagram of the recommendation process of the present invention.
具体实施方式detailed description
以下详细说明本发明的具体实施方式。Specific embodiments of the present invention will be described in detail below.
本发明学术资源推荐服务系统,如图1所示,包括网络爬虫、文本分类模型、学术资源数据库,由网络爬虫在互联网上爬取学术资源,用文本分类模型按预定A个类别分类后存储于本地的学术资源数据库,提供学术资源数据库开放的API供展示及资源推荐模块调用;本发明学术资源推荐服务系统还包括学术资源模型、资源质量值计算模型、用户兴趣模型,在用户的终端殖入跟踪软件模块,用于跟踪记录用户的网上浏览行为;基于不同群体用户的历史浏览行为数据,计算不同身份的用户对各个类型学术资源的关注程度,从资源类型、学科分布、关键词分布和LDA潜在主题分布共四个维度对学术资源建模,结合用户的兴趣学科和历史浏览行为数据,对用户兴趣偏好建模,计算学术资源模型与用户兴趣模型之间的相似度,再结合资源质量值计算推荐度,最后根据推荐度为用户进行学术资源Top-N推荐。根据教育部《研究生学科专业目录》中的学科门类,将所有一级学科整理为75个学科类别,即所述类别数A为75个类别。The academic resource recommendation service system of the present invention, as shown in Figure 1, includes a web crawler, a text classification model, and an academic resource database, crawls academic resources on the Internet by the web crawler, uses the text classification model to classify and store in predetermined A categories The local academic resource database provides an open API of the academic resource database for display and resource recommendation module calls; the academic resource recommendation service system of the present invention also includes an academic resource model, a resource quality value calculation model, and a user interest model, which are colonized in the user's terminal The tracking software module is used to track and record users' online browsing behavior; based on the historical browsing behavior data of different groups of users, calculate the degree of attention of users with different identities to various types of academic resources, from resource types, subject distribution, keyword distribution and LDA There are four dimensions of potential topic distribution to model academic resources, combine user interest subjects and historical browsing behavior data, model user interest preferences, calculate the similarity between academic resource models and user interest models, and combine resource quality values Calculate the recommendation degree, and finally recommend Top-N academic resources for users according to the recommendation degree. According to the subject categories in the "Catalogue of Postgraduate Disciplines" issued by the Ministry of Education, all first-level subjects are organized into 75 subject categories, that is, the number of categories A is 75 categories.
一、学术资源的获取1. Acquisition of academic resources
本发明网络爬虫主要为主题爬虫,还包括相应的LDA主题模型,LDA主题模型是一个“文档-主题-词”的三层贝叶斯生成模型,如图2所示;预先用训练语料按设定主题数K让LDA主题模型训练,当然训练前需对每一篇训练语料进行预处理,预处理包括分词、去停留词;利用LDA主题模型训练时的聚词功能在训练语料经LDA主题模型训练后获得按设定主题数K分别聚集成K个主题关联词集合,主题关联词集合也称作主题文档;利用LDA主题模型训练时可设定主题数K为50至200个,优选主题数K为100个;可从网上随机爬取各个学科各种形式的文献,篇幅很长但有规范摘要的论文之类的文献可仅取其摘要,也可使用现成的数据库,作为训练语料,文献篇数应当达到相当数量规模,至少几万篇,多至几百万篇。如选取主题数K为100,LDA主题模型运算训练过程中就会将训练语料的所有单词分别聚集成100个主题关联词集合,即100个主题文档;我们可以根据各个集合词的含义人为的命名各个主题名称,也可以不命名各个主题名称,而仅以数字编号或代号以示分别,其中3个主题文档如图5所示。The web crawler of the present invention is mainly a topic crawler, and also includes a corresponding LDA topic model, and the LDA topic model is a three-layer Bayesian generation model of "document-topic-word", as shown in Figure 2; Set the topic number K to let the LDA topic model train. Of course, each training corpus needs to be preprocessed before training. The preprocessing includes word segmentation and de-staying words; use the word gathering function during the training of the LDA topic model to pass through the LDA topic model in the training corpus. After training, K sets of subject-related words are gathered according to the set number of topics K, which are also called topic documents; when using the LDA topic model for training, the number of topics K can be set to 50 to 200, and the number of topics K is preferably 100; various forms of documents in various disciplines can be randomly crawled from the Internet, and documents such as papers that are very long but have standardized abstracts can only be extracted from their abstracts, or ready-made databases can be used as training corpus, the number of documents It should reach a considerable scale, at least tens of thousands of articles, and as many as several million articles. If the number of topics K is selected as 100, all words in the training corpus will be aggregated into 100 topic-related word sets during the training process of the LDA topic model, that is, 100 topic documents; we can artificially name each set according to the meaning of each set of words The topic name, or not naming each topic name, but only using numbers or codes to indicate the difference. Three of the topic documents are shown in Figure 5.
主题爬虫在普通网络爬虫的基础上进一步包括主题确定模块、相似度计算模块、URL优先级排序模块;所述主题爬虫为按学术主题数分布的多个分布式爬虫,每个分布式爬虫对应一个学术主题,各分布式爬虫同时获得多个学术主题的学术资源;主题爬虫每次爬行过程中,主题爬虫的主题确定模块确定目标主题及其主题文档,用主题文档指导主题相似度的计算,相似度计算模块对所爬取的页面上每个锚文本并结合该页面内容进行主题相似度计算及判断,剔除锚文本结合该页面的主题相似度小于设定阈值的超链接,选取锚文本结合该页面的主题相似度大于设定阈值的URL,由主题爬虫维护一个由已访问网页的超链接所指的未访问网页的URL队列,该URL队列根据相似度高低降序排列,主题爬虫按URL队列的排列顺序先后不断地访问各URL的网页,爬取相应学术资源,并不断地将所爬取的学术资源分类标签后存入数据库,针对本次爬行的主题文档,直至未访问队列URL为空;将主题爬虫每次所爬取的学术资源作为LDA主题模型训练用的新的语料;并不断重复主题爬虫爬行过程,使得各主题文档的所集合的主题关联词不断得以补充更新,所爬取的学术资源不断得以补充更新至一个人为认可的程度。The topic crawler further includes a topic determination module, a similarity calculation module, and a URL prioritization module on the basis of an ordinary web crawler; the topic crawler is a plurality of distributed crawlers distributed according to the number of academic topics, and each distributed crawler corresponds to a Academic topics, each distributed crawler obtains academic resources of multiple academic topics at the same time; during each crawling process of the topic crawler, the topic determination module of the topic crawler determines the target topic and its topic document, and uses the topic document to guide the calculation of topic similarity. The degree calculation module calculates and judges the topic similarity of each anchor text on the crawled page combined with the content of the page, eliminates the hyperlinks whose anchor text combined with the topic similarity of the page is less than the set threshold, selects the anchor text combined with the hyperlink For URLs whose theme similarity of the page is greater than the set threshold, the theme crawler maintains a URL queue of unvisited webpages pointed to by the hyperlinks of the visited webpages. The order of arrangement is to continuously visit the webpages of each URL, crawl the corresponding academic resources, and continuously store the crawled academic resources into the database after classifying and labeling them. For the theme documents crawled this time, until the unvisited queue URL is empty; The academic resources crawled by the topic crawler each time are used as new corpus for LDA topic model training; and the crawling process of the topic crawler is repeated continuously, so that the topic associated words collected in each topic document can be continuously supplemented and updated, and the crawled academic resources Resources are constantly being replenished to an artificially acceptable level.
为了便于操作,可以将学术资源的摘要作为训练语料库,通过LDA主题模型计算得到主题及主题文档,主题文档指导主题爬虫爬行过程中主题相似度的计算,后将爬取的内容存储到数据库中,作为LDA训练模型新的语料,提供学术资源数据库开放的API供展示调用;具体步骤如下:For the convenience of operation, the summary of academic resources can be used as the training corpus, and the topics and topic documents can be calculated through the LDA topic model. As a new corpus for the LDA training model, the open API of the academic resource database is provided for display and calling; the specific steps are as follows:
步骤一、下载并预处理现有的多个领域的学术资源的摘要,根据学术领域人为分成不同类别,分别作为LDA多个主题的训练语料;Step 1. Download and preprocess the summaries of existing academic resources in multiple fields, artificially divide them into different categories according to the academic fields, and use them as training corpora for multiple topics of LDA;
步骤二、输入LDA主题模型参数,LDA主题模型参数包括K,α,β,其中K的值表示主题数,α的值表示各个主题在取样之前的权重分布,β的值表示各个主题对词的先验分布,训练得到多个主题更细分的主题及主题文档,每个主题文档用于指导一个爬虫;Step 2. Input the LDA topic model parameters. The LDA topic model parameters include K, α, and β, where the value of K represents the number of topics, the value of α represents the weight distribution of each topic before sampling, and the value of β represents the weight of each topic pair word. Prior distribution, training to get more subdivided topics and topic documents of multiple topics, each topic document is used to guide a crawler;
步骤三、每个爬虫从选取的优质的种子URL开始,维护一个爬取URL队列,通过不断计算网页中的文本与网页中锚文本链接所指的文本与主题的相似度,根据相似度排序更新爬取URL队列,并抓取与主题最相关的网页内容;Step 3. Each crawler starts from the selected high-quality seed URL, maintains a crawling URL queue, and continuously calculates the similarity between the text in the web page and the text and topic pointed to by the anchor text link in the web page, and sorts and updates according to the similarity Crawl the URL queue and grab the most relevant web page content;
步骤四、主题爬虫获取的学术资源,打上对应主题标签后,存储到数据库中,并作为训练LDA的新语料,用于主题文档更新;Step 4. The academic resources obtained by the topic crawler are tagged with corresponding topic tags, stored in the database, and used as new corpus for training LDA for updating topic documents;
步骤五、提供学术资源数据库开放的API,供展示调用。Step 5: Provide the open API of the academic resource database for display and call.
其中步骤一包括如下具体子步骤:Wherein step one includes following specific sub-steps:
(a)语料搜集:下载现有多个领域的学术资源的摘要,作为训练语料;(a) Corpus collection: download summaries of existing academic resources in multiple fields as training corpus;
(b)文本预处理:提取摘要,中文分词,去除停用词;(b) Text preprocessing: extract summary, Chinese word segmentation, remove stop words;
(c)分类入语料库:根据学术领域人为分成不同类别,分别作为LDA多个主题的训练语料。(c) Classification into the corpus: artificially divided into different categories according to the academic field, and used as the training corpus for multiple topics of LDA.
其中步骤三包括如下具体子步骤:Step three includes the following specific sub-steps:
(a)初始种子URL选取面向特定主题的较好的种子站点;(a) the initial seed URL selects a better seed site facing a specific theme;
(b)提取网页内容:下载优先级高的URL所指向的页面,根据HTML标签抽取所需内容和URL信息;(b) extracting webpage content: download the page pointed to by the URL with high priority, and extract the required content and URL information according to the HTML tag;
(c)主题相关度分析判定,决定页面的取舍;本发明主要采用将现有的VSM技术和SSRM技术相结合来计算主题相关度;(c) analysis and determination of topic relevance, and decision of page selection; the present invention mainly adopts combining existing VSM technology and SSRM technology to calculate topic relevance;
(d)对未访问网页URL的重要程度进行排序;(d) sorting the importance of URLs of unvisited webpages;
(e)重复(b)~(d)过程,直至未访问队列URL为空。(e) Repeat steps (b) to (d) until the unvisited queue URL is empty.
其中子步骤(c)中,主题爬虫在爬经每篇电子文献进行主题相关度分析判定时,采用将VSM和SSRM两种主题相似度计算算法相结合的广义向量空间模型GVSM,来计算经爬页面的主题相关度,决定页面的取舍。In sub-step (c), when the topic crawler crawls through each electronic document to analyze and judge the topic relevance, it uses the generalized vector space model GVSM, which combines the VSM and SSRM two topic similarity calculation algorithms, to calculate the crawled The topic relevance of the page determines the choice of the page.
主题是由一组语义上相关的词及表示该词与主题相关的权重来表示,即主题Z={(w1,p1),(w2,p2),…,(wn,pn)},其中第i个词wi是与主题Z相关的词,p1为该词与Z的相关度的衡量,在LDA中表示为Z={(w1,p(w1|zj)),(w2,p(w2|zj)),…,(wn,p(wn|zj))},其中wi∈W,p(wi|zj)为主题为Zj时选择词为wi的概率,zj为第j个主题。A topic is represented by a group of semantically related words and a weight indicating that the word is related to the topic, that is, topic Z={(w 1 ,p 1 ),(w 2 ,p 2 ),…,(w n ,p n )}, where the i-th word w i is a word related to topic Z, p 1 is the measure of the correlation between the word and Z, expressed as Z={(w 1 , p(w 1 |z j )), (w 2 , p(w 2 |z j )), …, (w n , p(w n |z j ))}, where w i ∈ W, p(w i |z j ) is The probability of choosing word w i when the topic is Z j , z j is the jth topic.
主题文档生成过程为模型的一种概率取样的过程,包括如下具体子步骤:The subject document generation process is a probability sampling process of the model, including the following specific sub-steps:
(a)对文集中的任一文档d,生成文档长度N,N~Poisson(ε),服从泊松分布;(a) For any document d in the corpus, generate document length N, N~Poisson(ε), obey Poisson distribution;
(b)对文集中的任一文档d,生成一个θ~Dirichlet(α),服从狄利克雷分布;(b) For any document d in the corpus, generate a θ~Dirichlet(α), which obeys the Dirichlet distribution;
(c)文档d中的第i个词wi的生成:首先,生成一个主题zj~Multinomial(θ),服从多项式分布;然后,对主题zj,生成一个离散变量服从狄利克雷分布;最后生成使得概率最大的一个词。LDA模型如图3所示。(c) Generation of the i-th word wi in document d: First, generate a topic z j ~Multinomial(θ), which obeys multinomial distribution; then, for topic z j , generate a discrete variable Obey the Dirichlet distribution; finally generate such that The most probable word. The LDA model is shown in Figure 3.
其中,α的值表示各个主题在取样之前的权重分布,β的值表示各个主题对词的先验分布。Among them, the value of α represents the weight distribution of each topic before sampling, and the value of β represents the prior distribution of each topic to words.
LDA模型中所有的变量及其服从的分布如下:All variables in the LDA model and their distributions are as follows:
整个模型通过积分可能存在的变量,实际上可以变为P(w|Z)的联合分布。其中w指词,且可观测。Z是话题的变量,是模型的目标产物。可以看出α,β都是模型的初始参数。那么通过对其中存在的变量积分可以得到:The entire model can actually be transformed into a joint distribution of P(w|Z) by integrating possible variables. where w refers to words and is observable. Z is the variable of the topic, which is the target product of the model. It can be seen that α and β are the initial parameters of the model. Then by integrating the variables present in it, we get:
其中,N是词表长度,w是词,对θ~Dirichlet(α),中θ积分得:Among them, N is the vocabulary length, w is a word, for θ~Dirichlet(α), Integrate θ to get:
其中,表示特征词w分配给主题j的次数,表示分配给主题j的特征词数,表示文本d中分配给主题j的特征词数,表示文本d中所有分配了主题的特征词数。in, Indicates the number of times feature word w is assigned to topic j, Indicates the number of feature words assigned to topic j, Indicates the number of feature words assigned to topic j in text d, Indicates the number of all feature words assigned topics in text d.
从上可以看出,影响LDA建模的三个变量主要为α,β和话题数目K。为了选择比较好的话题数目,首先固定了α,β的取值,然后计算对其他变量积分后的式子的值的变化。It can be seen from the above that the three variables that affect LDA modeling are mainly α, β and the number of topics K. In order to select a better number of topics, the values of α and β are fixed first, and then the change of the value of the formula after integrating other variables is calculated.
采用LDA模型对文本集进行主题建模时,主题数目K对LDA模型拟合文本集的性能影响很大,因此需预先设定主题数。本文通过衡量不同主题数下的分类效果来确定最优主题数,并与使用Perplexity值确定模型最佳拟合时的分类效果进行比较,本文方法一方面能获得更直观准确的最优主题数,另一方面通过Perplexity值确定的最优主题数可以找出对应的分类效果与实际结果的差距。Perplexity值公式为:When the LDA model is used to model the text set, the number of topics K has a great influence on the performance of the LDA model fitting the text set, so the number of topics needs to be set in advance. This paper determines the optimal number of topics by measuring the classification effect under different number of topics, and compares it with the classification effect when using the Perplexity value to determine the best fit of the model. On the one hand, the method in this paper can obtain a more intuitive and accurate optimal number of topics. On the other hand, the optimal number of topics determined by the Perplexity value can find out the gap between the corresponding classification effect and the actual result. The formula for the Perplexity value is:
其中,M为文本集中的文本数,Nm为第m篇文本的长度,P(dm)为LDA模型产生第m篇文本的概率,公式为:Among them, M is the number of texts in the text set, N m is the length of the mth text, P(d m ) is the probability of the LDA model generating the mth text, the formula is:
本发明主题爬虫在通用爬虫的基础上增加了三个模块:主题确定模块、相似度计算模块、URL优先级排序模块,从而完成了对爬取页面的过滤和主题匹配,最终获得与主题高度相关的内容。The theme crawler of the present invention adds three modules on the basis of the general crawler: a theme determination module, a similarity calculation module, and a URL priority sorting module, thereby completing the filtering and theme matching of crawled pages, and finally obtaining a highly relevant theme Content.
1、主题确定模块:主题爬虫在工作前要确定该主题爬虫的相关主题词集,即建立主题文档。主题词集的确定通常有两种,一种是人工确定,另一种是通过初始页面集抽取所得。人工确定主题词集,关键词的训选取具有主观性,而初始页面抽取的关键词高噪音和低覆盖率。主题词的个数作为主题向量的维数,而相应的权值则为主题向量的各个分量值。记主题词集向量为:K={k1,k2,…,kn},n为主题词的个数。1. Theme determination module: Before the theme crawler works, it needs to determine the relevant subject word set of the theme crawler, that is, to establish the theme document. There are usually two ways to determine the subject term set, one is manually determined, and the other is extracted from the initial page set. Manually determine the subject word set, the training selection of keywords is subjective, and the keywords extracted from the initial page have high noise and low coverage. The number of topic words is used as the dimension of the topic vector, and the corresponding weight is the value of each component of the topic vector. Note that the keyword set vector is: K={k1, k2, ..., kn}, where n is the number of keyword.
2、相似度计算模块:为了保证爬虫获取的网页能够尽量向主题靠拢,必须对网页进行过滤,将主题相关度较低的网页(小于设定的阈值)剔除,这样就不会在下一步爬行中处理该页面中的链接。因为一个页面的主题相关度如果很低,说明该网页很可能只是偶尔出现某些关键词,而页面的主题可能和指定主题几乎没有什么关系,处理其中的链接意义很小,这是主题爬虫和普通爬虫的根本区别。普通爬虫是根据设定的搜索深度,对所有链接进行处理,结果返回了大量无用的网页,而且进一步增加了工作量。将整篇文本用于相似度对比显然是一个不可行的办法,通常需要将文本的进行提炼和抽取,转化为适合比对和计算的数据结构,同时要保证尽可能的体现文本的主题。通常的主题爬虫采用的特征选取是VSM,也涉及TF-IDF算法。本文运用的是基于《知网》的语义相似度计算,通过对文档和主题词文档的词语之间的相似度计算,得到整篇文章与主题的相似度值。2. Similarity calculation module: In order to ensure that the web pages obtained by the crawler can be as close as possible to the topic, the web pages must be filtered, and the web pages with low topic relevance (less than the set threshold) are eliminated, so that they will not be crawled in the next step. Process links in this page. Because if the topic relevance of a page is very low, it means that some keywords are likely to appear on the page occasionally, and the topic of the page may have little to do with the specified topic, and the meaning of processing the links in it is very small. This is the theme crawler and The fundamental difference between ordinary reptiles. Ordinary crawlers process all links according to the set search depth, and as a result, a large number of useless web pages are returned, which further increases the workload. It is obviously not feasible to use the entire text for similarity comparison. Usually, it is necessary to refine and extract the text and transform it into a data structure suitable for comparison and calculation, while ensuring that the theme of the text is reflected as much as possible. The feature selection used by common theme crawlers is VSM, which also involves the TF-IDF algorithm. This article uses the semantic similarity calculation based on "HowNet". By calculating the similarity between the words in the document and the keyword document, the similarity value between the entire article and the topic is obtained.
3、URL优先级排序模块:URL优先级排序模块主要是从未访问的URL中筛选出与主题相似度高的潜在页面,根据相似度的高低进行排序,相似度越高的优先级越高,尽可能优先访问相似度高的,以保证访问的页面高主题相关。对未访问URL进行排序时,可以结合URL所在页面和URL锚文本(描述URL的文本)的相似度作为优先级排序的影响因素。3. URL priority sorting module: The URL priority sorting module is mainly to screen out potential pages with high similarity to the topic from unvisited URLs, and sort them according to the level of similarity. The higher the similarity, the higher the priority. Prioritize visits with high similarity as much as possible to ensure that the visited pages are highly topic-related. When sorting unvisited URLs, the similarity between the page where the URL is located and the URL anchor text (text describing the URL) can be combined as an influencing factor for prioritization.
本发明利用《知网》对每个词的语义信息的定义来计算词语之间相似度。在知网中,对于两个词语W1和W2,,设W1有个概念:W2有m个概念:W1和W2的相似度是W1的每个概念与W2的每个概念的相似度的最大值,公式如The present invention utilizes the definition of the semantic information of each word by "HowNet" to calculate the similarity between words. In HowNet, for two words W 1 and W 2 , let W 1 have a concept: W2 has m concepts: The similarity of W 1 and W 2 is each concept of W 1 Every concept with W 2 The maximum value of the similarity, the formula is as
这样,两个词语间的相似度可以转化为概念之间的相似度计算,知网中所有概念都最终归结于义原的表示,所以概念间相似度的计算也可以归结于与之对应的义原间相似度的计算。假设概念c1和概念c2分别有p和q个义原,分别记为 概念c1和概念c2的相似度是c1的每个义原和c2的每个义原的相似度的最大值,公式为:In this way, the similarity between two words can be transformed into the calculation of the similarity between concepts. All concepts in HowNet are finally attributed to the representation of the original meaning, so the calculation of the similarity between concepts can also be attributed to the corresponding meaning. Calculation of the similarity between the original. Suppose concept c 1 and concept c 2 have p and q sememes respectively, denoted as The similarity between concept c 1 and concept c 2 is that each sememe of c 1 and each sememe of c 2 The maximum value of the similarity, the formula is:
《知网》中所有概念都最终归结于义原的表示,所以概念之间相似度的计算也可以归结于与之对应的义原间相似度的计算。由于所有的义原根据上下位关系构成了一个树状的义原层次体系,故可采用义原在义原层次体系中的语义距离来计算义原相似度,进而得出概念相似度[27]。假设两个义原和在义原层次体系中的路径距离为Dis(s1,s2),则义原的相似度计算公式为:All concepts in "HowNet" are finally attributed to the representation of sememes, so the calculation of similarity between concepts can also be attributed to the calculation of similarity between corresponding sememes. Since all sememes constitute a tree-like sememe hierarchy system according to the upper and lower relations, the semantic distance of sememes in the sememe hierarchy system can be used to calculate the sememe similarity, and then obtain the concept similarity[27] . Assuming that two sememes and the path distance in the sememe hierarchy are Dis(s 1 , s 2 ), the formula for calculating the similarity of the sememes is:
其中Dis(s1,s2)是s1和s2在义原层次体系中的路径长度,这里利用的是义原上下位关系,它是一个正整数。Among them, Dis(s 1 , s 2 ) is the path length of s 1 and s 2 in the sememe hierarchy system, and the sememe-hypernym relationship is used here, which is a positive integer.
本发明主题爬虫的设计是以普通爬虫为基础,进一步功能扩充。在对网页的整个处理过程中步骤:初始种子URL确定、提取网页内容,主题相关度分析、URL排序。The design of the subject reptile of the present invention is based on common reptiles, and further expands its functions. Steps in the whole process of webpage processing: determination of initial seed URL, extraction of webpage content, topic correlation analysis, and URL sorting.
(a)初始种子URL选取面向特定主题的较好的种子站点,使主题爬虫能够顺利展开爬行工作。(a) The initial seed URL selects a better seed site for a specific topic, so that the topic crawler can crawl smoothly.
(b)提取网页内容:下载优先级高的URL所指向的页面,根据HTML标签抽取所需内容和URL信息。(b) Extract web page content: download the page pointed to by the URL with high priority, and extract the required content and URL information according to the HTML tags.
(c)主题相关度分析是主题爬虫的核心模块,它决定页面的取舍。本发明主要采用将现有的VSM技术和SSRM技术相结合的广义向量空间模型GVSM来计算主题相关度。(c) Topic correlation analysis is the core module of the topic crawler, which determines the selection of pages. The present invention mainly adopts the generalized vector space model GVSM which combines the existing VSM technology and SSRM technology to calculate the subject correlation.
主题相关度分析,用TF-IDF抽取文本关键词,并算出词的权重,对网页进行相关度分析。For topic correlation analysis, TF-IDF is used to extract text keywords, and the weight of words is calculated to analyze the relevance of web pages.
TF-IDF相关计算:TF-IDF related calculations:
其中wdi为词i在文档d中的权重,tfi为词i的词频,idfi为词i的逆文档频率,fi为词i在文档d中出现的次数,fmax为在文档d所有词中出现频率最高的次数,N为所有文档数,Ni为包含词i的文档数。TF-IDF仍是当前最有效的提取关键词和计算词的权值的方法。Where w di is the weight of word i in document d, tf i is the word frequency of word i, idf i is the inverse document frequency of word i, f i is the number of times word i appears in document d, and f max is the word i in document d The number of occurrences with the highest frequency among all words, N is the number of all documents, and N i is the number of documents containing word i. TF-IDF is still the most effective method to extract keywords and calculate the weight of words.
VSM主题相关度计算:VSM topic correlation calculation:
其中为文档d的词向量,为主题t的词向量,wdi,wti为词i在文档d和主题t的TF-IDF值,n为文档d和主题t中出现的共同词的个数。该算法只考虑文档中出现相同词的频率向量,以此作为文档相似度判断,并未考虑到词与词之间语义上存在的关系,例如近义词,同义词等,从而影响了相似度的准确度。in is the word vector of document d, is the word vector of topic t, w di and w ti are the TF-IDF values of word i in document d and topic t, and n is the number of common words that appear in document d and topic t. This algorithm only considers the frequency vector of the same word in the document as a judgment of document similarity, and does not take into account the semantic relationship between words, such as synonyms, synonyms, etc., which affects the accuracy of similarity .
SSRM主题相关度计算:SSRM topic correlation calculation:
其中wdi,wti为词i在文档d和主题t的TF-IDF值,n,m分别为文档d和主题t的词数,Semij为词i和词j的语义相似度。Where w di and w ti are the TF-IDF values of word i in document d and topic t, n and m are the number of words in document d and topic t respectively, and Sem ij is the semantic similarity between word i and word j.
其中C1,C2是两个概念,相当于词w1和词w1,Sem(C1,C2)为概念C1和概念C2的语义相似度,C3是C1和C2享有的最低共同概念,Path(C1,C3)为C1到C3路径上的节点数,Path(C2,C3)为C2到C3路径上的节点数,Depth(C3)为在一些不同的本体中,C3到根结点路径上的节点数。采用SSRM的算法,只考虑了语义上的关系,如果存在两篇文章中的词都是近义词或同义词,那么这篇文档相似度会计算的1,即完全相同,这显然是欠缺准确的。Among them, C 1 and C 2 are two concepts, equivalent to word w 1 and word w 1 , Sem(C 1 , C 2 ) is the semantic similarity between concept C 1 and concept C 2 , and C 3 is C 1 and C 2 The lowest common concept shared, Path(C 1 , C 3 ) is the number of nodes on the path from C 1 to C 3 , Path(C 2 , C 3 ) is the number of nodes on the path from C 2 to C 3 , Depth(C 3 ) is the number of nodes on the path from C 3 to the root node in some different ontologies. Using the SSRM algorithm, only the semantic relationship is considered. If the words in two articles are all synonyms or synonyms, then the similarity of this document will be calculated as 1, that is, they are exactly the same, which is obviously not accurate.
本发明采用结合VSM和SSRM计算相似度的方法,也称作广义向量空间模型,简称GVSM,其计算式为:The present invention adopts the method for calculating similarity in combination with VSM and SSRM, also known as generalized vector space model, referred to as GVSM, and its calculation formula is:
其中Sim(dk,t)为文档dk的主题相似度,本发明兼顾文档词频因素和词与词之间的语义联系,采用将VSM与SSRM相结合的方法,有效提高主题相似度计算的精准度。Where Sim(d k , t) is the subject similarity of document d k , the present invention takes into account the word frequency factor of the document and the semantic connection between words, and adopts the method of combining VSM and SSRM to effectively improve the calculation efficiency of subject similarity precision.
(d)对未访问网页URL的重要程度进行排序。采用的是以下公式对URL进行排序:(d) Ranking the importance of URLs of unvisited webpages. The following formula is used to sort the URLs:
其中priority(h)为未访问的超链接h的优先值,N为包含h的检索网页数,Sim(fp,t)为网页p(包含超链接h)全文的主题相似度,Sim(ah,t)为超链接h的锚文本的主题相似度,λ为调节全文与锚文本的权重值。公式中的相似度计算同样采用VSM和SSRM相结合的方法,优化了未爬取URL链接队列的优先级排序,同样有效提高了主题学术资源获取的准确性。Among them, priority(h) is the priority value of unvisited hyperlink h, N is the number of retrieved webpages containing h, Sim(f p , t) is the subject similarity of the full text of webpage p (including hyperlink h), Sim(a h , t) is the topic similarity of the anchor text of the hyperlink h, and λ is the weight value for adjusting the full text and the anchor text. The similarity calculation in the formula also adopts the combination of VSM and SSRM, which optimizes the priority ranking of the uncrawled URL link queue, and also effectively improves the accuracy of subject academic resource acquisition.
本发明主题爬虫是专为抓取某个主题资源而出现的网络信息抓取工具,相比于通通常的网络爬虫,主题爬虫目的在于抓取与特定主题内容相关的网页信息,需要通过计算网页与主题的相关程度来判断是否抓取该网页,并且维护一个待爬取URL队列,根据URL的优先级对页面进行访问,以保证相关度高的页面优先被访问。The theme crawler of the present invention is a network information grabbing tool specially designed for grabbing certain theme resources. The degree of relevance to the topic determines whether to crawl the webpage, and maintains a queue of URLs to be crawled, and visits the pages according to the priority of the URLs to ensure that pages with high relevance are accessed first.
目前的主题爬虫存在着一些缺陷:(1)主题爬虫在工作前要确定该主题爬虫的相关主题词集。主题词集的确定通常有两种,一种是人工确定,另一种是通过初始页面分析所得。人工确定方法存在一定的主观性;而通过初始页面提取关键字的方法,一般在主题覆盖率上有所不足。两种传统的方法都会在主题爬虫进行网页主题相似度计算时造成不小的偏差。(2)目前基于文本启发式主题爬虫的核心是页面相似度计算,判断当前爬取网页是否与主题相近,除了与主题确定模块的精确度有关外,最主要的就是相似度计算算法,通常采用的是VSM(向量空间模型),基于不同词之间是不相关的假设,以词向量来表示文本,通过共有词频计算文档间的相似度,这种算法往往忽略了词语词之间的语义关系,降低了在语义上高度相关文章的相似值。There are some deficiencies in the current topic crawler: (1) the topic crawler should determine the relevant subject word set of the topic crawler before working. There are usually two ways to determine the keyword set, one is manually determined, and the other is obtained through initial page analysis. There is a certain degree of subjectivity in the manual determination method; and the method of extracting keywords through the initial page is generally insufficient in topic coverage. The two traditional methods will cause a lot of deviation when the topic crawler calculates the similarity of webpage topics. (2) The core of the current text-based heuristic theme crawler is page similarity calculation. To judge whether the currently crawled webpage is similar to the theme, in addition to the accuracy of the theme determination module, the most important thing is the similarity calculation algorithm, usually using VSM (Vector Space Model), based on the assumption that different words are not related, uses word vectors to represent text, and calculates the similarity between documents through the common word frequency. This algorithm often ignores the semantic relationship between words. , which reduces the similarity value of highly semantically related articles.
本发明主题爬虫的设计是以通用爬虫为基础,增加三个核心模块:主题确定模块,主题相似度计算模块和待爬取URL排序模块。针对以上不足,本发明提出基于主题模型LDA的主题爬虫,并改进主题相似度算法和URL优先级排序算法,从爬取的初始和爬取的过程提高主题爬虫的内容质量与准确度。主要贡献点:(1)通过LDA主题模型,深度挖掘语料主题语义信息,为主题爬虫构造良好的指导基础,将机器学习融入到资源的获取方法中,提高资源获取的准度和质量。(2)在主题爬虫主题相似度计算模块,采用将基于《知网》的语义相似度计算的方法,平衡余弦相似度和语义相似度,达到更好的主题匹配效果。The theme crawler of the present invention is designed on the basis of a general crawler, adding three core modules: a theme determination module, a theme similarity calculation module and a URL sorting module to be crawled. Aiming at the above deficiencies, the present invention proposes a theme crawler based on the theme model LDA, and improves the theme similarity algorithm and the URL prioritization algorithm, and improves the content quality and accuracy of the theme crawler from the initial crawling and crawling process. Main contributions: (1) Through the LDA topic model, deeply mine the semantic information of the corpus topic, construct a good guidance foundation for the topic crawler, integrate machine learning into the resource acquisition method, and improve the accuracy and quality of resource acquisition. (2) In the topic similarity calculation module of the topic crawler, the method of semantic similarity calculation based on "HowNet" is adopted to balance the cosine similarity and semantic similarity to achieve better topic matching effect.
二、学术资源的分类2. Classification of academic resources
本发明采用基于LDA的文本分类方法,如图7所示,使用贝叶斯概率计算模型作为文本分类模型,提取最能体现该篇待分类文本特性的一组特征词作为用于输入文本分类模型的特征词集,原始特征词集就是将原始词集按特性权重排序后的前面部分,用文本分类模型计算所述特征词组合归属预定A个类别中的每个类别的概率,取概率值最大的类别作为其所属类别;根据教育部《研究生学科专业目录》中的学科门类,将所有学科分为75个学科类别,即所述类别数A为75个类别。使用以上所述的LDA主题模型与经其训练所得的100个主题文档来辅助文本分类模型进行文本分类。还预先用类别明确的验证语料按预定类别数A让文本分类模型进行分类验证,以获得文本分类模型对A个类别中的每个类别的分类准确率,作为文本分类模型对A个类别中的每个类别的归类可信度指标;该准确率为被文本分类模型分到某个类别的所有验证语料中属被正确分类的语料的比率,并预设分类准确率阈值;文本分类模型进行分类验证时预设分类准确率阈值为80%较为合适。用文本分类模型对每一篇待分类文本进行文本分类时具体包括以下步骤:The present invention adopts the text classification method based on LDA, as shown in Figure 7, uses the Bayesian probability calculation model as the text classification model, extracts a group of feature words that can best reflect the characteristics of the text to be classified as the input text classification model The feature word set, the original feature word set is the front part after the original word set is sorted according to the characteristic weight, and the text classification model is used to calculate the probability that the feature word combination belongs to each of the predetermined A categories, and the probability value is the largest According to the subject categories in the "Catalogue of Postgraduate Disciplines" of the Ministry of Education, all subjects are divided into 75 subject categories, that is, the number of categories A is 75 categories. Use the LDA topic model mentioned above and the 100 topic documents trained by it to assist the text classification model for text classification. Also use the verification corpus with clear categories in advance to allow the text classification model to perform classification verification according to the predetermined number of categories A, so as to obtain the classification accuracy rate of the text classification model for each category in the A categories, as the text classification model for A categories. The classification credibility index of each category; the accuracy rate is the ratio of the correctly classified corpus among all the verification corpora classified into a certain category by the text classification model, and the classification accuracy threshold is preset; the text classification model performs It is more appropriate to preset the classification accuracy threshold to be 80% during classification verification. When using the text classification model to classify each text to be classified, it specifically includes the following steps:
步骤一、对每一篇待分类文本分别计算该文本的经预处理后的所有词的特性权重,词的特性权重数值与在该文本中出现的次数成正比,与在所述训练语料中出现的次数成反比,将计算所得的词集按其特性权重数值大小降序排列,提取每一篇待分类文本原始词集的前面部分作为其特征词集。Step 1. For each text to be classified, calculate the feature weights of all the words of the text after preprocessing. The feature weight value of the word is proportional to the number of times it appears in the text, and is proportional to the number of times it appears in the training corpus. Inversely proportional to the number of times, the calculated word sets are arranged in descending order according to their characteristic weight values, and the front part of the original word set of each text to be classified is extracted as its feature word set.
步骤二、使用文本分类模型,选取每一篇待分类文本原始特征词集用来分别计算该篇文本可能归属预定A个类别中各个类别的概率值,选取概率值最大的类别作为该篇文本分类类别;Step 2. Using the text classification model, select the original feature word set of each text to be classified to calculate the probability value that the text may belong to each of the predetermined A categories, and select the category with the highest probability value as the text classification category;
步骤三、对步骤二的文本分类结果进行判断,如果文本分类模型对该类别的分类准确率数值达到设定阈值就直接输出结果;如果文本分类模型对该类别的分类准确率数值未达到设定阈值,就进入步骤四;Step 3. Judging the text classification results of step 2. If the classification accuracy value of the text classification model for this category reaches the set threshold, the result will be output directly; if the classification accuracy value of the text classification model for this category does not reach the set threshold threshold, go to step 4;
步骤四、将每一篇经预处理的文本输入所述LDA主题模型,用LDA主题模型计算出该篇文本对应所设定的K个主题中的每个主题的权重值,选取权重值最大的主题,并将预先经LDA主题模型训练后所得到的该主题下的主题关联词中的前Y个词加入至该篇文本的原始特征词集之中共同作为扩充后的特征词集,再次使用文本分类模型,分别计算该篇文本可能归属预定A个类别中各个类别的概率值,选取概率值最大的类别作为该篇文本最终分类类别。具体可取10至20个词,如取主题关联词中的前15个词加入至该篇文本的原始特征词集之中共同作为扩充后的特征词集;即使新加入的词与原始特征词有重复也没关系。Step 4, input each preprocessed text into the LDA topic model, use the LDA topic model to calculate the weight value of each topic in the K topics corresponding to the text, and select the one with the largest weight value topic, and add the first Y words in the topic related words under the topic obtained after the LDA topic model training in advance to the original feature word set of the text as the expanded feature word set, and use the text again The classification model calculates the probability values that the text may belong to each of the predetermined A categories, and selects the category with the highest probability value as the final classification category of the text. Specifically, 10 to 20 words can be selected. For example, the first 15 words in the subject-related words are added to the original feature word set of the text as the expanded feature word set; even if the newly added words are repeated with the original feature words does not matter.
文本分类模型的主要计算公式为:The main calculation formula of the text classification model is:
其中P(cj|x1,x2,...,xn)表示特征词(x1,x2,…,xn)同时出现时该文本属于类别cj的概率;其中P(cj)表示训练文本集中,属于类别cj的文本占总数的比率,P(x1,x2,...,xn|cj)表示如果待分类文本属于类别cj,则这篇文本的特征词集为(x1,x2,...,xn)的概率,P(c1,c2,...,cn)表示给定的所有类别的联合概率。where P(c j |x 1 , x 2 ,...,x n ) represents the probability that the text belongs to category cj when the feature words (x1, x2,..., xn) appear at the same time; where P(c j ) represents the training In the text set, the ratio of the text belonging to category c j to the total number, P(x 1 , x 2 ,..., x n |c j ) means that if the text to be classified belongs to category c j , then the feature word set of this text is the probability of (x 1 , x 2 , ..., x n ), and P(c 1 , c 2 , ..., c n ) represents the joint probability of all classes given.
显然,对于给定的所有类别,分母P(c1,c2,...,cn)是一个常数,模型分类结果为(1)式中概率最大的类别,求解(6)式的最大值可转化为求解下式的最大值Obviously, for all given categories, the denominator P(c 1 , c 2 ,..., c n ) is a constant, and the classification result of the model is the category with the highest probability in formula (1). Solving the maximum probability of formula (6) The value can be transformed into the maximum value for solving the following formula
又根据贝叶斯假设,文本特征向量属性x1,x2,...,xn独立同分布,其联合概率分布等于各个属性特征概率分布的乘积,即:According to the Bayesian hypothesis, the text feature vector attributes x 1 , x 2 , ..., x n are independent and identically distributed, and their joint probability distribution is equal to the product of the probability distribution of each attribute feature, namely:
P(x1,x2,...,xn|cj)=ΠiP(xi|cj) (36)P(x 1 , x 2 ,..., x n |c j )=Π i P(x i |c j ) (36)
所以(7)式变为:So formula (7) becomes:
即为所求的用于分类的分类函数。 That is, the classification function used for classification is sought.
分类函数中的概率值P(cj)和P(xi|cj)还是未知的,因此,为了计算分类函数的最大值,(9)式中的先验概率值分别估计如下:The probability values P(c j ) and P( xi |c j ) in the classification function are still unknown, therefore, in order to calculate the maximum value of the classification function, the prior probability values in (9) are estimated as follows:
其中,N(C=cj)表示训练文本中属于cj类别的样本数量;N表示训练样本总数量。Among them, N(C=c j ) represents the number of samples belonging to category c j in the training text; N represents the total number of training samples.
其中,N(Xi=xi,C=cj)表示类别cj中包含属性xi的训练样本数量;N(C=cj)表示类别cj中的训练样本数量;M表示训练样本集合中经过去除无用词后关键词的数量。Among them, N(X i = xi , C=c j ) represents the number of training samples containing attribute x i in category c j ; N(C=c j ) represents the number of training samples in category c j ; M represents training samples The number of keywords in the collection after removing useless words.
LDA是Blei等人在2003年提出的一种对离散数据集建模的统计主题模型,是一个“文档-主题-词”的三层贝叶斯生成模型。最初的模型只对“文档-主题”概率分布引入一个超参数使其服从Dirichlet分布,随后Griffiths等人对“主题-词”概率分布也引入了一个超参数使其服从Dirichlet分布。LDA模型如图2所示。其中:N为这篇文档的单词数量,M为文档集中的文档数量,K为主题个数,为主题-词的概率分布,θ为文档-主题的概率分布,Z为隐含变量表示主题,W为词,α为θ的超参,β为的超参。LDA is a statistical topic model for modeling discrete data sets proposed by Blei et al. in 2003. It is a three-layer Bayesian generative model of "document-topic-word". The original model only introduced a hyperparameter to the "document-topic" probability distribution to make it obey the Dirichlet distribution, and then Griffiths et al. introduced a hyperparameter to the "topic-word" probability distribution to make it obey the Dirichlet distribution. The LDA model is shown in Figure 2. Among them: N is the number of words in this document, M is the number of documents in the document set, K is the number of topics, is the probability distribution of topic-word, θ is the probability distribution of document-topic, Z is the hidden variable representing the topic, W is the word, α is the hyperparameter of θ, β is super parameter.
LDA主题模型把一篇文档看作是一组词的集合,词与词之间没有先后顺序,而且一篇文档可以包含多个主题,文档中每个词都是由某个主题生成,同一个词也可以属于不同的主题,因此LDA主题模型是一种典型的词袋模型。The LDA topic model regards a document as a set of words, there is no order between words, and a document can contain multiple topics, each word in the document is generated by a topic, the same Words can also belong to different topics, so the LDA topic model is a typical bag-of-words model.
训练LDA模型的关键是隐含变量分布的推断,即获取目标文本的隐含文本-主题分布θ和主题-词分布若给定模型参数α,β,文本d的随机变量θ、z和w的联合分布为:The key to training the LDA model is the inference of the hidden variable distribution, that is, to obtain the hidden text-topic distribution θ and topic-word distribution of the target text If the model parameters α, β are given, the joint distribution of the random variables θ, z and w of the text d is:
由于上式同时存在多个隐含变量,直接计算θ,是不可能的,所以需要对参数进行估计推断,目前常见的参数估计算法有期望最大化(Expectation Maximization,EM)、变分贝叶斯推理和Gibbs抽样。本文采用Gibbs抽样进行模型参数的推断,Griffiths指出Gibbs抽样在Perplexity值和训练速度等方面均优于变分贝叶斯推理和EM算法。EM算法由于其似然函数局部最大化问题往往导致模型找到的是局部最优解,而变分贝叶斯推理得到的模型与真实情况有所偏差,Gibbs抽样能快速有效的从大规模数据集中提取主题信息,成为目前最流行的LDA模型提取算法。Since there are multiple hidden variables in the above formula at the same time, directly calculate θ, It is impossible, so it is necessary to estimate and infer the parameters. The current common parameter estimation algorithms include Expectation Maximization (EM), variational Bayesian inference and Gibbs sampling. In this paper, Gibbs sampling is used to infer model parameters. Griffiths pointed out that Gibbs sampling is superior to variational Bayesian inference and EM algorithm in terms of Perplexity value and training speed. Due to the local maximization of the likelihood function of the EM algorithm, the model often finds a local optimal solution, while the model obtained by variational Bayesian inference deviates from the real situation. Gibbs sampling can quickly and effectively extract data from large-scale data sets. Extracting topic information has become the most popular LDA model extraction algorithm.
MCMC是一套从复杂概率分布中抽取样本值的近似迭代方法,Gibbs抽样作为MCMC的一种简单实现形式,目的是构造收敛于特定分布的Markov链,并从链中抽取接近目标概率分布值的样本。在训练过程中,算法只对主题变量zi进行抽样,其条件概率计算公式如下:MCMC is a set of approximate iterative methods for extracting sample values from complex probability distributions. Gibbs sampling is a simple implementation form of MCMC. sample. During the training process, the algorithm only samples the subject variable zi , and its conditional probability calculation formula is as follows:
其中,等式左边含义为:当前词wi在已知其他词各自所属主题的条件下,该词属于主题k的概率;等式右边ni-1为第k个主题下第i个单词个数减去1;nk-1为该文档第k个主题的个数减去1;第一个乘子为wi这个词在k号主题下的概率;第二个乘子为第k个主题在该篇文档里的概率。Among them, the meaning on the left side of the equation is: the probability that the current word w i belongs to the topic k under the condition that the current word w i knows the topics to which other words belong; the n i -1 on the right side of the equation is the i-th word under the k-th topic number minus 1; n k -1 is the number of topics k in the document minus 1; the first multiplier is the probability of the word w i under topic k; the second multiplier is the kth topic The probability that the topic is in the document.
Gibbs抽样具体步骤为:The specific steps of Gibbs sampling are as follows:
1)初始化,为每个词wi随机分配主题,zi是词的主题,将zi初始化为1到K之间的一个随机整数,i从1到N,N为文本集的特征词记号,此为Markov链的初始态;1) Initialization, randomly assign topics to each word w i , z i is the topic of the word, initialize z i to a random integer between 1 and K, i is from 1 to N, and N is the characteristic word token of the text set , this is the initial state of the Markov chain;
2)i从1循环到N,根据公式(2)计算当前词wi属于各个主题的概率,并依此概率对词wi重新抽样主题,获得Markov链的下一状态;2) i loops from 1 to N, calculates the probability that the current word w i belongs to each topic according to formula (2), and resamples the topic for word w i according to this probability, and obtains the next state of the Markov chain;
迭代步骤2)足够次数后,认为Markov链已达稳态,至此这篇文档的每个词都有一个特定的所属主题;对于每篇文档,文本-主题分布θ和主题-词分布的值可按下列公式估算:After iterating step 2) enough times, it is considered that the Markov chain has reached a steady state. So far, each word in this document has a specific topic; for each document, the text-topic distribution θ and topic-word distribution The value of can be estimated by the following formula:
其中,表示特征词w分配给主题k的次数,表示分配给主题k的特征词数,表示文本d中分配给主题k的特征词数,表示文本d中所有分配了主题的特征词数。in, Indicates the number of times feature word w is assigned to topic k, Indicates the number of feature words assigned to topic k, Indicates the number of feature words assigned to topic k in text d, Indicates the number of all feature words assigned topics in text d.
作为文本分类模型可信度指标的分类准确率,是通过概率来计算的,具体公式如下:The classification accuracy rate as the credibility index of the text classification model is calculated by probability. The specific formula is as follows:
其中,i表示类别,Ni表示分类器正确预测i类别的次数,Mi表示分类器预测i类别的总次数。Among them, i represents the category, N i represents the number of times the classifier correctly predicts the i category, and M i represents the total number of times the classifier predicts the i category.
可采用查准率P,查全率R和两者的综合评价指标F1作为最终的评价指标,查准率P衡量的是正确判定该类别的测试样本占判定为该类别的测试样本的比例,查全率R衡量的是正确判定该类别测试样本占该类别所有测试样本的比例。以某类别Ci为例,n++表示正确判定样本属于类别Ci的数量,n+-表示不属于但却被判定为类别Ci的样本数,n-+表示属于但被判定为不属于类别Ci的样本数。对于类别Ci而言,查全率R、查准率P和综合指标F1值为:The precision rate P, the recall rate R and the comprehensive evaluation index F1 of the two can be used as the final evaluation index. The precision rate P measures the proportion of the test samples that are correctly judged as the category to the test samples that are judged as the category. , the recall rate R measures the correct determination of the proportion of test samples of this category to all test samples of this category. Taking a certain category C i as an example, n ++ indicates the number of samples that are correctly judged to belong to category C i , n +- indicates the number of samples that do not belong to but are judged to be in category C i , n -+ indicates that they belong to but are judged not to be The number of samples belonging to class C i . For category C i , the values of recall rate R, precision rate P and comprehensive index F1 are:
发明者曾进行了三组实验:实验一,基于原始特征集进行分类器性能测试;实验二,基于扩充后的特征集进行分类器性能测试;实验三,基于选择性特征扩展后的特征集进行分类器性能测试,其中可信度阈值设置为0.8。表2为三次实验在部分学科上的查全率和查准率:The inventor has conducted three sets of experiments: Experiment 1, classifier performance test based on the original feature set; Experiment 2, classifier performance test based on the expanded feature set; Experiment 3, based on the feature set after selective feature expansion. Classifier performance testing, where the confidence threshold is set to 0.8. Table 2 shows the recall and precision of the three experiments in some subjects:
表2部分学科的查全率和查准率Table 2 Recall rate and precision rate of some subjects
由表2可知,基于原始特征集进行实验时,历史学查全率较高,而查准率较低,说明有较多的不属于历史学学科的数据被分类器归为了历史学,同时发现科学技术史学科查全率较低,说明有很多本属于这个学科的数据被归为了其他学科,由于这两个学科主题十分类似,这很有可能是分类器把较多属于科学技术史的数据归类为历史学。类似的情况同样出现在了地质资源与地质工程学科和地质学学科上。基于扩展后特征集对上面的问题有所改善,但对之前识别度高的学科产生了影响。而进行选择性特征扩展一方面避免了对识别度高的学科产生影响,另一方面对本身由于信息量不足引起识别度低的学科有一定程度上的改善。It can be seen from Table 2 that when the experiment is performed based on the original feature set, the history recall rate is higher, but the precision rate is lower, indicating that there are more data that do not belong to the subject of history and are classified as history by the classifier. At the same time, it is found that The recall rate of the history of science and technology subject is low, indicating that a lot of data belonging to this subject has been classified into other subjects. Since the themes of these two subjects are very similar, it is likely that the classifier has more data belonging to the history of science and technology. Classified as History. A similar situation also occurs in the disciplines of geological resources and geological engineering and geology. Based on the expanded feature set, the above problems have been improved, but it has had an impact on the previously highly recognized subjects. On the one hand, the selective feature expansion avoids the impact on the subjects with high recognition degree, and on the other hand, it improves the subjects with low recognition degree due to insufficient information to a certain extent.
根据上面的实验结果可以计算出三次实验各自的平均查全率、平均查准率和平均F1值。结果如下:According to the above experimental results, the average recall rate, average precision rate and average F1 value of the three experiments can be calculated. The result is as follows:
表3实验对比Table 3 Experimental comparison
由表3可以看出,面对复杂的分类场景,本发明基于选择性特征扩展的方法相比于基于原始特征集或基于扩展后的特征集的方法具有更好的适应性,平均查全率、平均查准率和平均F1值明显高于其它方案,能够达到较好的实用效果。It can be seen from Table 3 that in the face of complex classification scenarios, the method based on selective feature expansion in the present invention has better adaptability than the method based on the original feature set or the extended feature set, and the average recall rate , the average precision rate and the average F 1 value are obviously higher than other schemes, and can achieve better practical results.
图6为三次实验在部分学科上的查全率示意图;图7为三次实验在部分学科上的查准率示意图。Figure 6 is a schematic diagram of the recall rate of three experiments on some subjects; Figure 7 is a schematic diagram of the precision rate of three experiments on some subjects.
由于大数据时代的到来,资源分类面临的挑战越来越大,不同的应用场景需要采用不同的分类技术,不存在一项技术适合所有的分类任务。本发明提出的基于选择性特征扩展的方法适合复杂的应用场景,有选择的对信息量少的数据增加主题信息,同时避免对信息量充足的数据增加噪音,并且本发明方法具有普遍的适应性。With the advent of the big data era, resource classification is facing more and more challenges. Different application scenarios require different classification technologies, and no one technology is suitable for all classification tasks. The method based on selective feature expansion proposed by the present invention is suitable for complex application scenarios, selectively adding topic information to data with less information, while avoiding adding noise to data with sufficient information, and the method of the present invention has universal adaptability .
三、学术资源的推荐3. Recommendation of academic resources
本发明向用户推荐其相应的学术资源的过程包括冷启动推荐阶段与二次推荐阶段,冷启动推荐阶段基于兴趣学科为用户推荐符合其兴趣学科的优质资源,所述优质资源即为经资源质量值计算模型计算后比较所得的资源质量值高的学术资源,资源质量值为资源权威度、资源社区热度和资源时新度的算术平均值或加权平均值;二次推荐阶段,分别对用户兴趣模型和资源模型建模,计算用户兴趣与资源模型二者的相似性,再结合资源质量值计算推荐度,最后根据推荐度为用户进行学术资源Top-N推荐。The process of the present invention recommending corresponding academic resources to users includes a cold-start recommendation stage and a secondary recommendation stage. In the cold-start recommendation stage, users are recommended high-quality resources that meet their interest subjects based on the subjects of interest. The value calculation model calculates and compares academic resources with high resource quality values. The resource quality value is the arithmetic mean or weighted mean of resource authority, resource community popularity and resource freshness; in the second recommendation stage, the user interest The model and resource model are modeled, and the similarity between the user's interest and the resource model is calculated, and then the recommendation degree is calculated based on the resource quality value, and finally the Top-N recommendation of academic resources is made for the user according to the recommendation degree.
1、冷启动阶段推荐算法:1. Recommended algorithm in the cold start phase:
表4五大类资源的属性和衡量标准Table 4 Attributes and measurement standards of the five categories of resources
优质的学术资源能够吸引和留住新用户。在冷启动阶段,本文拟向用户推荐符合其兴趣学科的优质资源。优质资源即质量值高的学术资源,质量值的衡量标准主要包括权威度、社区热度和时新度等属性。五大类资源的属性和衡量标准如表4所示。Quality academic resources attract and retain new users. In the cold start stage, this paper intends to recommend high-quality resources that match the disciplines of interest to users. High-quality resources refer to academic resources with high quality value. The measurement criteria of quality value mainly include attributes such as authority, community popularity, and novelty. The attributes and measurement standards of the five categories of resources are shown in Table 4.
论文权威度Authority的计算公式如下:The formula for calculating the authoritative degree of a paper is as follows:
Level是论文发表刊物级别被量化后的得分。本文将刊物级别分为5个等级,分数依次为1、0.8、0.6、0.4和0.2分。顶尖杂志或会议如Nature、Science得1分,第二级别的如ACM Transaction得0.8分,最低级别的得0.2分。Cite的计算公式如下:Level is the quantified score of the publication level of the paper. In this paper, the publication level is divided into 5 levels, and the scores are 1, 0.8, 0.6, 0.4 and 0.2 points. Top journals or conferences such as Nature and Science get 1 point, second-level ones like ACM Transaction get 0.8 points, and the lowest level ones get 0.2 points. The calculation formula of Cite is as follows:
Cite=Cites/maxCite. (2)Cite=Cites/maxCite. (2)
Cite是论文被引量的量化结果,Cites是论文的被引量,maxCite是论文来源数据库中最大的被引量。Cite is the quantified result of the paper's cited quantity, Cites is the paper's cited quantity, and maxCite is the maximum cited quantity in the paper's source database.
其他四类资源的权威度计算与论文类似,只是量化方法不同而已。The calculation of the authority of the other four types of resources is similar to that of papers, but the quantification method is different.
论文社区热度Popularity的计算公式如下:The formula for calculating the Popularity of the paper community is as follows:
Popularity=readTimes/maxReadTimes. (3)Popularity=readTimes/maxReadTimes. (3)
readTimes是论文的阅读次数,maxReadTimes是论文来源数据库中最大的阅读次数。readTimes is the number of times the paper has been read, and maxReadTimes is the maximum number of times the paper has been read in the source database.
所有资源的时新度Recentness计算方法相同,公式如下:The calculation method of the recentness of all resources is the same, and the formula is as follows:
year和month分别是资源的发表年份和月份。minYear、minMonth、maxYear和maxMonth是该类资源的来源数据库中所有资源的最早和最晚发表年份和月份。year and month are the publication year and month of the resource, respectively. minYear, minMonth, maxYear, and maxMonth are the earliest and latest published year and month of all resources in the source database of this type of resource.
论文质量值Quality计算方法如下:The calculation method of the quality value of the paper is as follows:
2、二次推荐阶段的算法:2. The algorithm of the second recommendation stage:
本阶段采用融合用户行为和资源内容的推荐方法,分别对用户兴趣模型和资源模型建模,计算二者的相似性,再结合资源质量值计算推荐度,最后根据推荐度进行推荐。In this stage, the recommendation method that integrates user behavior and resource content is adopted to model the user interest model and resource model respectively, calculate the similarity between the two, and then calculate the recommendation degree based on the resource quality value, and finally make recommendations based on the recommendation degree.
学术资源模型表示如下:The academic resource model is represented as follows:
Mr={Tr,Kr,Ct,Lr} (6)M r = {T r , K r , C t , L r } (6)
其中,Tr为学术资源的学科分布向量,是该学术资源分布在75个学科的概率值,由贝叶斯多项式模型得到。Among them, T r is the subject distribution vector of academic resources, which is the probability value of the distribution of academic resources in 75 subjects, obtained from the Bayesian polynomial model.
Kr={(kr1,ωr1),(kr2,ωr2),...,(krm,ωrm)},m为关键词个数,kri(1≤i≤m)表示单条学术资源第i个关键词,ωri为关键词kri的权重,通过改进后的tf-idf算法得到,计算公式如下:K r = {(k r1 , ω r1 ), (k r2 , ω r2 ),..., (k rm , ω rm )}, m is the number of keywords, k ri (1≤i≤m) means For the i-th keyword of a single academic resource, ω ri is the weight of the keyword k ri , obtained through the improved tf-idf algorithm, and the calculation formula is as follows:
w(i,r)表示文档r中第i个关键词的权重,tf(i,r)表示第i个关键词在文档r中出现的频度,Z表示文档集的总篇数,L表示包含关键词i的文档数。w(i, r) represents the weight of the i-th keyword in document r, tf(i, r) represents the frequency of the i-th keyword appearing in document r, Z represents the total number of documents in the document set, and L represents The number of documents containing keyword i.
Lr为LDALDA潜在主题分布向量,Lr={lr1,lr2,lr3...,lrN1},N1是潜在主题数量。L r is the LDALDA potential topic distribution vector, L r = {l r1 , l r2 , l r3 ..., l rN1 }, N1 is the number of potential topics.
Ct为资源类型,t的取值可以为1,2,3,4,5即五大类学术资源:学术论文、学术专利学术新闻、学术会议和学术图书。C t is the resource type, and the value of t can be 1, 2, 3, 4, 5, that is, five types of academic resources: academic papers, academic patents, academic news, academic conferences and academic books.
根据用户使用移动软件的行为特点,将用户对一个学术资源的操作行为分为打开、阅读、星级评价、分享和收藏,星级评价属于显式行为,其它的属于隐式行为。显式行为能够明确的反映用户兴趣偏好程度,如星级评价,评分越高说明用户越喜欢该资源;隐式行为虽不能明确反映用户兴趣偏好,但其蕴含的信息量和信息价值往往比显式反馈更多更高。According to the behavior characteristics of users using mobile software, the user's operation behavior on an academic resource is divided into opening, reading, star rating, sharing, and collection. Star rating is an explicit behavior, and the others are implicit behaviors. Explicit behaviors can clearly reflect the degree of user interest and preference, such as star ratings. The higher the score, the more the user likes the resource; although the implicit behavior cannot clearly reflect the user's interest preference, the amount of information and information value it contains is often more than the explicit behavior. Feedback is more and higher.
用户兴趣模型主要基于用户背景及浏览过的学术资源。根据用户的不同浏览行为,结合学术资源模型,可构建用户兴趣模型,此模型将随用户兴趣变化而动态调整。用户兴趣模型表示如下:The user interest model is mainly based on user background and browsed academic resources. According to different browsing behaviors of users, combined with the academic resource model, a user interest model can be constructed, and this model will be dynamically adjusted as the user interest changes. The user interest model is expressed as follows:
Mu={Tu,Ku,Ct,Lu} (8)M u ={T u ,K u ,C t ,L u } (8)
其中,Tu是用户一段时间内浏览过的某类学术资源的学科分布向量Tr经过用户行为后,形成的用户学科偏好分布向量,即Among them, T u is the subject distribution vector T r of a certain type of academic resources browsed by the user within a period of time, and the user's subject preference distribution vector is formed after the user's behavior, that is
其中,sum为用户产生过行为的学术资源总数,sj为用户对学术资源j产生行为后的“行为系数”,该值越大说明用户越喜欢该资源。Tjr表示第j篇资源的学科分布向量。sj的计算综合考虑了打开、阅读、评价、收藏和分享等行为,能够准确反映用户对资源的偏好程度。Among them, sum is the total number of academic resources that the user has acted on, and s j is the "behavior coefficient" after the user has acted on the academic resource j. The larger the value, the more the user likes the resource. T jr represents the subject distribution vector of the jth resource. The calculation of s j comprehensively considers behaviors such as opening, reading, evaluating, collecting and sharing, and can accurately reflect the user's preference for resources.
Ku={(ku1,ωu1),(ku2,ωu2),...,(kuN2,ωuN2)}是用户关键词偏好分布向量,N2为关键词个数,kui(1≤i≤N2)表示第i个用户偏好关键词,ωui为关键词kui的权重,通过用户u一段时间内产生过行为的所有学术资源的“关键词分布向量”Kr计算得到。K u = {(k u1 , ω u1 ), (k u2 , ω u2 ),..., (k uN2 , ω uN2 )} is the user keyword preference distribution vector, N 2 is the number of keywords, k ui (1≤i≤N 2 ) represents the i-th user's preferred keyword, ω ui is the weight of keyword k ui , and is calculated through the "keyword distribution vector" K r of all academic resources that user u has performed in a period of time get.
Kjr′=sj*Kjr (10)K jr ′=s j *K jr (10)
根据公式10可以计算出每篇学术资源新的关键词分布向量,再选取所有资源新的关键词分布向量的TOP-N2作为用户的关键词偏好分布向量Ku。The new keyword distribution vector of each academic resource can be calculated according to formula 10, and TOP-N2 of the new keyword distribution vector of all resources is selected as the user's keyword preference distribution vector K u .
Lu为用户的LDA潜在主题偏好分布向量,由学术资源的LDA潜在主题分布向量Lr={lr1,lr2,lr3...,lrN1}计算得到,方法同Tu.L u is the user's LDA latent topic preference distribution vector, which is calculated from the LDA latent topic distribution vector of academic resources L r = {l r1 , l r2 , l r3 ..., l rN1 }, the method is the same as T u .
行为系数的计算:s表示行为系数,T是阅读时间阈值,δ是一个调节参数,加入阅读时间阈值,旨在防止误点击,所以此值很小。如果用户阅读资源j的时间小于阈值T,则认为用户是误点击,s=0。在用户愿意花较长时间阅读即阅读时间大于等于T的条件下,如果用户做出评价且评价值大于其之前所有评价的均值mean,则认为其喜欢j,将s增加δ。如果用户对j进行了收藏或分享,说明用户很喜欢j,将s增加δ。本发明认为阅读、评价、收藏、分享是由浅入深地反映用户的兴趣偏好。s的取值主要取决于初始值和调节参数δ,我们想将用户的所有行为映射为一个0到2的值,所以初始值为1,调节参数δ=0.333333。Calculation of the behavior coefficient: s represents the behavior coefficient, T is the reading time threshold, and δ is an adjustment parameter. The reading time threshold is added to prevent accidental clicks, so this value is very small. If the time spent by the user reading resource j is less than the threshold T, it is considered that the user clicked by mistake, s=0. Under the condition that the user is willing to spend a long time reading, that is, the reading time is greater than or equal to T, if the user makes an evaluation and the evaluation value is greater than the mean mean of all his previous evaluations, it is considered that he likes j, and s is increased by δ. If the user collects or shares j, it means that the user likes j very much, and increases s by δ. The present invention considers that reading, evaluating, collecting, and sharing reflect user's interests and preferences from shallow to deep. The value of s mainly depends on the initial value and the adjustment parameter δ. We want to map all user behaviors to a value from 0 to 2, so the initial value is 1, and the adjustment parameter δ=0.333333.
学术资源模型与用户兴趣模型相似度计算:Calculation of similarity between academic resource model and user interest model:
学术资源模型表示:The Academic Resource Model says:
Mr={Tr,Kr,Ct,Lr} (12)M r = {T r , K r , C t , L r } (12)
用户兴趣模型表示:The user interest model represents:
Mu={Tu,Ku,Ct,Lu} (13)M u = {T u , K u , C t , L u } (13)
用户学科偏好分布向量Tu与学术资源学科分布向量Tr的相似度通过余弦相似度计算,即:The similarity between user subject preference distribution vector T u and academic resource subject distribution vector T r is calculated by cosine similarity, namely:
用户的LDA潜在主题分布向量Lu与学术资源的LDA潜在主题分布向量Lr的相似度通过余弦相似度计算,即:The similarity between the user’s LDA latent topic distribution vector L u and the academic resource’s LDA latent topic distribution vector L r is calculated by cosine similarity, namely:
用户关键词偏好分布向量Ku与学术资源关键词分布向量Kr的相似度计算通过Jaccard Similarity计算:The similarity calculation between user keyword preference distribution vector K u and academic resource keyword distribution vector K r is calculated by Jaccard Similarity:
则用户兴趣模型与学术资源模型的相似度为:Then the similarity between the user interest model and the academic resource model is:
其中,σ+ρ+τ=1,具体权重分配由实验训练得到。Among them, σ+ρ+τ=1, and the specific weight distribution is obtained from experimental training.
为了向用户推荐其感兴趣的优质资源,引入推荐度Recommendation_degree概念,某一学术资源的推荐度越大说明该资源越符合用户的兴趣偏好,且资源越优质。推荐度计算公式如下:In order to recommend high-quality resources that users are interested in, the concept of Recommendation_degree is introduced. The higher the recommendation degree of a certain academic resource, the more suitable the resource is for the user's interest preference, and the higher the quality of the resource. The recommendation calculation formula is as follows:
Recommendation_degree=λ1Sim(Mu,Mn)+λ2Quality(λ1+λ2=1) (18)Recommendation_degree=λ 1 Sim(M u , M n )+λ 2 Quality(λ 1 +λ 2 =1) (18)
二次推荐阶段便是根据学术资源的推荐度进行Top-N推荐。The second recommendation stage is to make Top-N recommendations based on the recommendation degree of academic resources.
整个推荐过程如图10所示,从图2可知,系统整体的推荐流程包括三部分:资源模型的构建、冷启动阶段的推荐和二次推荐过程,它们的具体步骤如下:The entire recommendation process is shown in Figure 10. From Figure 2, we can see that the overall recommendation process of the system includes three parts: the construction of the resource model, the recommendation in the cold start phase, and the secondary recommendation process. Their specific steps are as follows:
资源模型的构建过程:The construction process of the resource model:
1)通过网络爬虫和数据接口技术获取五类学术资源数据;1) Obtain five types of academic resource data through web crawler and data interface technology;
2)解析并提取每条学术资源的相关信息,插入资源库;2) Analyze and extract the relevant information of each academic resource, and insert it into the resource library;
3)对资源库中的每条数据进行预处理,包括分词和去停留词;3) Preprocessing each piece of data in the resource library, including word segmentation and removing stop words;
4)通过已经训练好的三类模型计算每条资源的学科分布、关键词分布和LDA潜在主题分布,三类模型分别是贝叶斯多项式模型、VSM和LDA模型;4) Calculate the subject distribution, keyword distribution and LDA potential topic distribution of each resource through the three types of models that have been trained. The three types of models are Bayesian polynomial model, VSM and LDA model;
5)根据学科分布向量得到资源的学科类别,资源的学科类别是学科分布向量中概率较大的前3个学科;5) According to the subject distribution vector, the subject category of the resource is obtained, and the subject category of the resource is the first three subjects with higher probability in the subject distribution vector;
6)计算每条资源的质量值;6) Calculate the quality value of each resource;
7)将学科分布向量、关键词分布向量、LDA潜在主题分布向量、学科类别和质量值插入资源库。7) Insert the subject distribution vector, keyword distribution vector, LDA latent topic distribution vector, subject category and quality value into the resource library.
冷启动阶段的推荐过程:Recommended procedure for the cold start phase:
1)选择符合用户兴趣学科的学术资源1) Select academic resources that meet the user's interest disciplines
2)根据学术资源的质量值进行优质资源推荐。2) Recommend high-quality resources according to the quality value of academic resources.
二次推荐阶段的推荐过程:The recommendation process in the second recommendation stage:
1)取得用户的浏览记录,计算“行为系数”;1) Obtain the user's browsing records and calculate the "behavior coefficient";
2)构建用户兴趣模型;2) Construct user interest model;
3)计算资源模型与用户兴趣模型的相似度;3) Calculate the similarity between the resource model and the user interest model;
4)根据相似度和质量值计算推荐度;4) Calculate the recommendation degree according to the similarity and quality value;
5)根据资源的推荐度进行Top-N推荐。5) Perform Top-N recommendation according to the recommendation degree of resources.
为方便后续计算,我们提前构建了资源模型,当用户首次使用本系统时,我们采用冷启动阶段的推荐策略为其推荐学术资源;当用户的行为数据达到一定数量后,便采用二次推荐策略为其推荐学术资源。In order to facilitate subsequent calculations, we built a resource model in advance. When users use the system for the first time, we adopt the recommendation strategy in the cold start phase to recommend academic resources for them; when the user's behavior data reaches a certain amount, we adopt the second recommendation strategy Recommend academic resources for it.
本发明主要根据学术资源及用户数据的不断积累变化而提出对应的推荐策略。冷启动阶段为用户推荐符合其兴趣学科的优质资源;二次推荐阶段从资源类型、学科分布、关键词分布和LDA潜在主题分布共四个维度对各类学术资源建模,根据用户行为对用户兴趣偏好建模,最后根据资源推荐度进行Top-N推荐。The present invention mainly proposes corresponding recommendation strategies based on the continuous accumulation and changes of academic resources and user data. In the cold start stage, users are recommended high-quality resources that meet their interests; in the secondary recommendation stage, various academic resources are modeled from four dimensions: resource type, subject distribution, keyword distribution, and LDA potential topic distribution. Interest preference modeling, and finally Top-N recommendation based on resource recommendation.
实验结果表明,本发明所采用的学术资源推荐策略,能充分迎合用户的兴趣学科,在提升资源的CTR方面取得明显的效果;二次推荐阶段,从实验结果可知,本发明所采用的建模方法下的推荐策略在Precision方面明显高于目前两种常用资源建模方式下的推荐策略。The experimental results show that the academic resource recommendation strategy adopted by the present invention can fully cater to the subjects of interest of users, and achieve obvious effects in improving the CTR of resources; in the secondary recommendation stage, it can be seen from the experimental results that the modeling used in the present invention The recommendation strategy under the method is significantly higher than the recommendation strategy under the current two common resource modeling methods in terms of Precision.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201611130297.9A CN106815297B (en) | 2016-12-09 | 2016-12-09 | Academic resource recommendation service system and method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201611130297.9A CN106815297B (en) | 2016-12-09 | 2016-12-09 | Academic resource recommendation service system and method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN106815297A true CN106815297A (en) | 2017-06-09 |
| CN106815297B CN106815297B (en) | 2020-04-10 |
Family
ID=59107077
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201611130297.9A Active CN106815297B (en) | 2016-12-09 | 2016-12-09 | Academic resource recommendation service system and method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106815297B (en) |
Cited By (72)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107247751A (en) * | 2017-05-26 | 2017-10-13 | 武汉大学 | Content recommendation method based on LDA topic models |
| CN107590232A (en) * | 2017-09-07 | 2018-01-16 | 北京师范大学 | A kind of resource recommendation system and method based on Network Study Environment |
| CN107818145A (en) * | 2017-10-18 | 2018-03-20 | 南京邮数通信息科技有限公司 | A kind of user behavior tag along sort extracting method based on dynamic reptile |
| CN107833061A (en) * | 2017-11-17 | 2018-03-23 | 中农网购(江苏)电子商务有限公司 | One kind is for retail Intelligent agricultural product allocator |
| CN107908669A (en) * | 2017-10-17 | 2018-04-13 | 广东广业开元科技有限公司 | A kind of big data news based on parallel LDA recommends method, system and device |
| CN108038765A (en) * | 2017-12-23 | 2018-05-15 | 临泉县烜槿餐饮管理有限公司 | A kind of Catering Management formula order dishes system caught based on video |
| CN108090131A (en) * | 2017-11-23 | 2018-05-29 | 北京洪泰同创信息技术有限公司 | It teaches the method for pushing of auxiliary resource data and teaches the pusher of auxiliary resource data |
| CN108255992A (en) * | 2017-12-29 | 2018-07-06 | 广州贝睿信息科技有限公司 | It is a kind of paint originally can be readability assessment recommend method |
| CN108280114A (en) * | 2017-07-28 | 2018-07-13 | 淮阴工学院 | A kind of user's literature reading interest analysis method based on deep learning |
| CN108446273A (en) * | 2018-03-15 | 2018-08-24 | 哈工大机器人(合肥)国际创新研究院 | Kalman filtering term vector learning method based on Di's formula process |
| CN108595593A (en) * | 2018-04-19 | 2018-09-28 | 南京大学 | Meeting research hotspot based on topic model and development trend information analysis method |
| CN108600306A (en) * | 2018-03-20 | 2018-09-28 | 成都星环科技有限公司 | A kind of intelligent content supplying system |
| CN108717445A (en) * | 2018-05-17 | 2018-10-30 | 南京大学 | A kind of online social platform user interest recommendation method based on historical data |
| CN108897860A (en) * | 2018-06-29 | 2018-11-27 | 中国科学技术信息研究所 | Information-pushing method, device, electronic equipment and computer readable storage medium |
| CN109189892A (en) * | 2018-09-17 | 2019-01-11 | 北京点网聚科技有限公司 | A kind of recommended method and device based on article review |
| CN109213908A (en) * | 2018-08-01 | 2019-01-15 | 浙江工业大学 | A kind of academic meeting paper supplying system based on data mining |
| CN109325179A (en) * | 2018-09-17 | 2019-02-12 | 青岛海信网络科技股份有限公司 | Method and device for promoting content |
| CN109344319A (en) * | 2018-11-01 | 2019-02-15 | 中国搜索信息科技股份有限公司 | Content temperature prediction technique on a kind of line based on integrated study |
| CN109492157A (en) * | 2018-10-24 | 2019-03-19 | 华侨大学 | Based on RNN, the news recommended method of attention mechanism and theme characterizing method |
| CN109672706A (en) * | 2017-10-16 | 2019-04-23 | 百度在线网络技术(北京)有限公司 | A kind of information recommendation method, device, server and storage medium |
| CN109801146A (en) * | 2019-01-18 | 2019-05-24 | 北京工业大学 | A kind of resource service recommended method and system based on Demand perference |
| CN110008334A (en) * | 2017-08-04 | 2019-07-12 | 腾讯科技(北京)有限公司 | A kind of information processing method, device and storage medium |
| CN110020189A (en) * | 2018-06-29 | 2019-07-16 | 武汉掌游科技有限公司 | A kind of article recommended method based on Chinese Similarity measures |
| CN110020110A (en) * | 2017-09-15 | 2019-07-16 | 腾讯科技(北京)有限公司 | Media content recommendations method, apparatus and storage medium |
| US10387115B2 (en) | 2015-09-28 | 2019-08-20 | Yandex Europe Ag | Method and apparatus for generating a recommended set of items |
| US10387513B2 (en) | 2015-08-28 | 2019-08-20 | Yandex Europe Ag | Method and apparatus for generating a recommended content list |
| US10394420B2 (en) | 2016-05-12 | 2019-08-27 | Yandex Europe Ag | Computer-implemented method of generating a content recommendation interface |
| CN110209822A (en) * | 2019-06-11 | 2019-09-06 | 中译语通科技股份有限公司 | Sphere of learning data dependence prediction technique based on deep learning, computer |
| CN110245080A (en) * | 2019-05-28 | 2019-09-17 | 厦门美柚信息科技有限公司 | Generate the method and device of scrnario testing use-case |
| CN110297882A (en) * | 2019-03-01 | 2019-10-01 | 阿里巴巴集团控股有限公司 | Training corpus determines method and device |
| US10430481B2 (en) | 2016-07-07 | 2019-10-01 | Yandex Europe Ag | Method and apparatus for generating a content recommendation in a recommendation system |
| CN110309411A (en) * | 2018-03-15 | 2019-10-08 | 中国移动通信集团有限公司 | A resource recommendation method and device |
| WO2019192352A1 (en) * | 2018-04-03 | 2019-10-10 | 阿里巴巴集团控股有限公司 | Video-based interactive discussion method and apparatus, and terminal device |
| US10452731B2 (en) | 2015-09-28 | 2019-10-22 | Yandex Europe Ag | Method and apparatus for generating a recommended set of items for a user |
| CN110490547A (en) * | 2019-08-13 | 2019-11-22 | 北京航空航天大学 | Office system intellectualized technology |
| CN110598151A (en) * | 2019-09-09 | 2019-12-20 | 河南牧业经济学院 | A method and system for judging the effect of news dissemination |
| CN110688476A (en) * | 2019-09-23 | 2020-01-14 | 腾讯科技(北京)有限公司 | Text recommendation method and device based on artificial intelligence |
| CN110866181A (en) * | 2019-10-12 | 2020-03-06 | 平安国际智慧城市科技股份有限公司 | Resource recommendation method, device and storage medium |
| CN110866106A (en) * | 2019-10-10 | 2020-03-06 | 重庆金融资产交易所有限责任公司 | Text recommendation method and related equipment |
| USD882600S1 (en) | 2017-01-13 | 2020-04-28 | Yandex Europe Ag | Display screen with graphical user interface |
| CN111177372A (en) * | 2019-12-06 | 2020-05-19 | 绍兴市上虞区理工高等研究院 | Scientific and technological achievement classification method, device, equipment and medium |
| US10674215B2 (en) | 2018-09-14 | 2020-06-02 | Yandex Europe Ag | Method and system for determining a relevancy parameter for content item |
| CN111241403A (en) * | 2020-01-15 | 2020-06-05 | 华南师范大学 | Deep learning-based team recommendation method, system and storage medium |
| CN111241318A (en) * | 2020-01-03 | 2020-06-05 | 北京字节跳动网络技术有限公司 | Method, device, equipment and storage medium for selecting object to push cover picture |
| CN111325006A (en) * | 2020-03-17 | 2020-06-23 | 北京百度网讯科技有限公司 | Information interaction method and device, electronic equipment and storage medium |
| US10706325B2 (en) | 2016-07-07 | 2020-07-07 | Yandex Europe Ag | Method and apparatus for selecting a network resource as a source of content for a recommendation system |
| CN111563177A (en) * | 2020-05-15 | 2020-08-21 | 深圳掌酷软件有限公司 | Theme wallpaper recommendation method and system based on cosine algorithm |
| CN111625439A (en) * | 2020-06-01 | 2020-09-04 | 杭州弧途科技有限公司 | Method for analyzing viscosity of app user based on log data of user behavior |
| CN111651675A (en) * | 2020-06-09 | 2020-09-11 | 杨鹏 | UCL-based user interest topic mining method and device |
| CN112052330A (en) * | 2019-06-05 | 2020-12-08 | 上海游昆信息技术有限公司 | Application keyword distribution method and device |
| CN112287199A (en) * | 2020-10-29 | 2021-01-29 | 黑龙江稻榛通网络技术服务有限公司 | A big data center processing system based on cloud server |
| CN112559901A (en) * | 2020-12-11 | 2021-03-26 | 百度在线网络技术(北京)有限公司 | Resource recommendation method and device, electronic equipment, storage medium and computer program product |
| CN112667899A (en) * | 2020-12-30 | 2021-04-16 | 杭州智聪网络科技有限公司 | Cold start recommendation method and device based on user interest migration and storage equipment |
| US11086888B2 (en) | 2018-10-09 | 2021-08-10 | Yandex Europe Ag | Method and system for generating digital content recommendation |
| CN113268683A (en) * | 2021-04-15 | 2021-08-17 | 南京邮电大学 | Academic literature recommendation method based on multiple dimensions |
| CN113360776A (en) * | 2021-07-19 | 2021-09-07 | 西南大学 | Scientific and technological resource recommendation method based on cross-table data mining |
| CN113420058A (en) * | 2021-07-01 | 2021-09-21 | 宁波大学 | Conversational academic conference recommendation method based on combination of user historical behaviors |
| CN113536085A (en) * | 2021-06-23 | 2021-10-22 | 西华大学 | Topic word search crawler scheduling method and system based on combined prediction method |
| CN113568882A (en) * | 2021-08-03 | 2021-10-29 | 重庆仓舟网络科技有限公司 | OSS-based resource sharing method and system |
| CN113779954A (en) * | 2021-01-29 | 2021-12-10 | 北京京东拓先科技有限公司 | Similar text recommendation method and device and electronic equipment |
| CN113921016A (en) * | 2021-10-15 | 2022-01-11 | 阿波罗智联(北京)科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
| US11263217B2 (en) | 2018-09-14 | 2022-03-01 | Yandex Europe Ag | Method of and system for determining user-specific proportions of content for recommendation |
| US11276079B2 (en) | 2019-09-09 | 2022-03-15 | Yandex Europe Ag | Method and system for meeting service level of content item promotion |
| US11276076B2 (en) | 2018-09-14 | 2022-03-15 | Yandex Europe Ag | Method and system for generating a digital content recommendation |
| US11288333B2 (en) | 2018-10-08 | 2022-03-29 | Yandex Europe Ag | Method and system for estimating user-item interaction data based on stored interaction data by using multiple models |
| CN114254103A (en) * | 2021-11-26 | 2022-03-29 | 广东电力信息科技有限公司 | Conference summary generation method based on theme generation model |
| CN114492389A (en) * | 2020-11-12 | 2022-05-13 | 中移动信息技术有限公司 | Corpus type determining method, apparatus, device and storage medium |
| CN114519097A (en) * | 2022-04-21 | 2022-05-20 | 宁波大学 | Academic paper recommendation method for heterogeneous information network enhancement |
| CN117575745A (en) * | 2024-01-17 | 2024-02-20 | 山东正禾大教育科技有限公司 | Course teaching resource individual recommendation method based on AI big data |
| CN118522430A (en) * | 2024-07-25 | 2024-08-20 | 上海交通大学医学院附属仁济医院 | Science popularization information initiative matching and pushing system based on Internet hospital |
| CN118820389A (en) * | 2024-09-18 | 2024-10-22 | 戎行技术有限公司 | Keyword-based data association storage method and device |
| CN119226603A (en) * | 2024-08-23 | 2024-12-31 | 山东省大数据中心 | A vertical search engine ranking method and system based on multi-dimensional features |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103324761A (en) * | 2013-07-11 | 2013-09-25 | 广州市尊网商通资讯科技有限公司 | Product database forming method based on Internet data and system |
| CN104680453A (en) * | 2015-02-28 | 2015-06-03 | 北京大学 | Course recommendation method and system based on students' attributes |
| CN103336793B (en) * | 2013-06-09 | 2015-08-12 | 中国科学院计算技术研究所 | A kind of personalized article recommends method and system thereof |
-
2016
- 2016-12-09 CN CN201611130297.9A patent/CN106815297B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103336793B (en) * | 2013-06-09 | 2015-08-12 | 中国科学院计算技术研究所 | A kind of personalized article recommends method and system thereof |
| CN103324761A (en) * | 2013-07-11 | 2013-09-25 | 广州市尊网商通资讯科技有限公司 | Product database forming method based on Internet data and system |
| CN104680453A (en) * | 2015-02-28 | 2015-06-03 | 北京大学 | Course recommendation method and system based on students' attributes |
Non-Patent Citations (1)
| Title |
|---|
| 高洁: ""高质量学术资源推荐方法的研究与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (103)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10387513B2 (en) | 2015-08-28 | 2019-08-20 | Yandex Europe Ag | Method and apparatus for generating a recommended content list |
| US10387115B2 (en) | 2015-09-28 | 2019-08-20 | Yandex Europe Ag | Method and apparatus for generating a recommended set of items |
| US10452731B2 (en) | 2015-09-28 | 2019-10-22 | Yandex Europe Ag | Method and apparatus for generating a recommended set of items for a user |
| US10394420B2 (en) | 2016-05-12 | 2019-08-27 | Yandex Europe Ag | Computer-implemented method of generating a content recommendation interface |
| US10706325B2 (en) | 2016-07-07 | 2020-07-07 | Yandex Europe Ag | Method and apparatus for selecting a network resource as a source of content for a recommendation system |
| US10430481B2 (en) | 2016-07-07 | 2019-10-01 | Yandex Europe Ag | Method and apparatus for generating a content recommendation in a recommendation system |
| USD892847S1 (en) | 2017-01-13 | 2020-08-11 | Yandex Europe Ag | Display screen with graphical user interface |
| USD890802S1 (en) | 2017-01-13 | 2020-07-21 | Yandex Europe Ag | Display screen with graphical user interface |
| USD980246S1 (en) | 2017-01-13 | 2023-03-07 | Yandex Europe Ag | Display screen with graphical user interface |
| USD892846S1 (en) | 2017-01-13 | 2020-08-11 | Yandex Europe Ag | Display screen with graphical user interface |
| USD882600S1 (en) | 2017-01-13 | 2020-04-28 | Yandex Europe Ag | Display screen with graphical user interface |
| CN107247751A (en) * | 2017-05-26 | 2017-10-13 | 武汉大学 | Content recommendation method based on LDA topic models |
| CN107247751B (en) * | 2017-05-26 | 2020-01-14 | 武汉大学 | LDA topic model-based content recommendation method |
| CN108280114A (en) * | 2017-07-28 | 2018-07-13 | 淮阴工学院 | A kind of user's literature reading interest analysis method based on deep learning |
| CN108280114B (en) * | 2017-07-28 | 2022-01-28 | 淮阴工学院 | Deep learning-based user literature reading interest analysis method |
| CN110008334A (en) * | 2017-08-04 | 2019-07-12 | 腾讯科技(北京)有限公司 | A kind of information processing method, device and storage medium |
| CN107590232B (en) * | 2017-09-07 | 2019-12-06 | 北京师范大学 | A resource recommendation system and method based on network learning environment |
| CN107590232A (en) * | 2017-09-07 | 2018-01-16 | 北京师范大学 | A kind of resource recommendation system and method based on Network Study Environment |
| CN110020110A (en) * | 2017-09-15 | 2019-07-16 | 腾讯科技(北京)有限公司 | Media content recommendations method, apparatus and storage medium |
| CN110020110B (en) * | 2017-09-15 | 2023-04-07 | 腾讯科技(北京)有限公司 | Media content recommendation method, device and storage medium |
| CN109672706B (en) * | 2017-10-16 | 2022-06-14 | 百度在线网络技术(北京)有限公司 | Information recommendation method and device, server and storage medium |
| CN109672706A (en) * | 2017-10-16 | 2019-04-23 | 百度在线网络技术(北京)有限公司 | A kind of information recommendation method, device, server and storage medium |
| CN107908669A (en) * | 2017-10-17 | 2018-04-13 | 广东广业开元科技有限公司 | A kind of big data news based on parallel LDA recommends method, system and device |
| CN107818145A (en) * | 2017-10-18 | 2018-03-20 | 南京邮数通信息科技有限公司 | A kind of user behavior tag along sort extracting method based on dynamic reptile |
| CN107833061A (en) * | 2017-11-17 | 2018-03-23 | 中农网购(江苏)电子商务有限公司 | One kind is for retail Intelligent agricultural product allocator |
| CN108090131A (en) * | 2017-11-23 | 2018-05-29 | 北京洪泰同创信息技术有限公司 | It teaches the method for pushing of auxiliary resource data and teaches the pusher of auxiliary resource data |
| CN108038765A (en) * | 2017-12-23 | 2018-05-15 | 临泉县烜槿餐饮管理有限公司 | A kind of Catering Management formula order dishes system caught based on video |
| CN108255992A (en) * | 2017-12-29 | 2018-07-06 | 广州贝睿信息科技有限公司 | It is a kind of paint originally can be readability assessment recommend method |
| CN108446273A (en) * | 2018-03-15 | 2018-08-24 | 哈工大机器人(合肥)国际创新研究院 | Kalman filtering term vector learning method based on Di's formula process |
| CN110309411A (en) * | 2018-03-15 | 2019-10-08 | 中国移动通信集团有限公司 | A resource recommendation method and device |
| CN108446273B (en) * | 2018-03-15 | 2021-07-20 | 哈工大机器人(合肥)国际创新研究院 | Kalman filtering word vector learning method based on Dield process |
| CN108600306A (en) * | 2018-03-20 | 2018-09-28 | 成都星环科技有限公司 | A kind of intelligent content supplying system |
| WO2019192352A1 (en) * | 2018-04-03 | 2019-10-10 | 阿里巴巴集团控股有限公司 | Video-based interactive discussion method and apparatus, and terminal device |
| CN108595593B (en) * | 2018-04-19 | 2021-11-23 | 南京大学 | Topic model-based conference research hotspot and development trend information analysis method |
| CN108595593A (en) * | 2018-04-19 | 2018-09-28 | 南京大学 | Meeting research hotspot based on topic model and development trend information analysis method |
| CN108717445A (en) * | 2018-05-17 | 2018-10-30 | 南京大学 | A kind of online social platform user interest recommendation method based on historical data |
| CN108897860B (en) * | 2018-06-29 | 2022-05-27 | 中国科学技术信息研究所 | Information pushing method and device, electronic equipment and computer readable storage medium |
| CN108897860A (en) * | 2018-06-29 | 2018-11-27 | 中国科学技术信息研究所 | Information-pushing method, device, electronic equipment and computer readable storage medium |
| CN110020189A (en) * | 2018-06-29 | 2019-07-16 | 武汉掌游科技有限公司 | A kind of article recommended method based on Chinese Similarity measures |
| CN109213908A (en) * | 2018-08-01 | 2019-01-15 | 浙江工业大学 | A kind of academic meeting paper supplying system based on data mining |
| US11263217B2 (en) | 2018-09-14 | 2022-03-01 | Yandex Europe Ag | Method of and system for determining user-specific proportions of content for recommendation |
| US11276076B2 (en) | 2018-09-14 | 2022-03-15 | Yandex Europe Ag | Method and system for generating a digital content recommendation |
| US10674215B2 (en) | 2018-09-14 | 2020-06-02 | Yandex Europe Ag | Method and system for determining a relevancy parameter for content item |
| CN109325179B (en) * | 2018-09-17 | 2020-12-04 | 青岛海信网络科技股份有限公司 | Method and device for promoting content |
| CN109189892A (en) * | 2018-09-17 | 2019-01-11 | 北京点网聚科技有限公司 | A kind of recommended method and device based on article review |
| CN109325179A (en) * | 2018-09-17 | 2019-02-12 | 青岛海信网络科技股份有限公司 | Method and device for promoting content |
| US11288333B2 (en) | 2018-10-08 | 2022-03-29 | Yandex Europe Ag | Method and system for estimating user-item interaction data based on stored interaction data by using multiple models |
| US11086888B2 (en) | 2018-10-09 | 2021-08-10 | Yandex Europe Ag | Method and system for generating digital content recommendation |
| CN109492157A (en) * | 2018-10-24 | 2019-03-19 | 华侨大学 | Based on RNN, the news recommended method of attention mechanism and theme characterizing method |
| CN109492157B (en) * | 2018-10-24 | 2021-08-31 | 华侨大学 | News recommendation method and topic representation method based on RNN and attention mechanism |
| CN109344319A (en) * | 2018-11-01 | 2019-02-15 | 中国搜索信息科技股份有限公司 | Content temperature prediction technique on a kind of line based on integrated study |
| CN109344319B (en) * | 2018-11-01 | 2021-08-24 | 中国搜索信息科技股份有限公司 | Online content popularity prediction method based on ensemble learning |
| CN109801146A (en) * | 2019-01-18 | 2019-05-24 | 北京工业大学 | A kind of resource service recommended method and system based on Demand perference |
| CN109801146B (en) * | 2019-01-18 | 2020-12-29 | 北京工业大学 | A method and system for resource service recommendation based on demand preference |
| CN110297882A (en) * | 2019-03-01 | 2019-10-01 | 阿里巴巴集团控股有限公司 | Training corpus determines method and device |
| CN110245080A (en) * | 2019-05-28 | 2019-09-17 | 厦门美柚信息科技有限公司 | Generate the method and device of scrnario testing use-case |
| CN110245080B (en) * | 2019-05-28 | 2022-08-16 | 厦门美柚股份有限公司 | Method and device for generating scene test case |
| CN112052330A (en) * | 2019-06-05 | 2020-12-08 | 上海游昆信息技术有限公司 | Application keyword distribution method and device |
| CN112052330B (en) * | 2019-06-05 | 2021-11-26 | 上海游昆信息技术有限公司 | Application keyword distribution method and device |
| CN110209822B (en) * | 2019-06-11 | 2021-12-21 | 中译语通科技股份有限公司 | Academic field data correlation prediction method based on deep learning and computer |
| CN110209822A (en) * | 2019-06-11 | 2019-09-06 | 中译语通科技股份有限公司 | Sphere of learning data dependence prediction technique based on deep learning, computer |
| CN110490547A (en) * | 2019-08-13 | 2019-11-22 | 北京航空航天大学 | Office system intellectualized technology |
| US11276079B2 (en) | 2019-09-09 | 2022-03-15 | Yandex Europe Ag | Method and system for meeting service level of content item promotion |
| CN110598151A (en) * | 2019-09-09 | 2019-12-20 | 河南牧业经济学院 | A method and system for judging the effect of news dissemination |
| CN110688476A (en) * | 2019-09-23 | 2020-01-14 | 腾讯科技(北京)有限公司 | Text recommendation method and device based on artificial intelligence |
| CN110866106A (en) * | 2019-10-10 | 2020-03-06 | 重庆金融资产交易所有限责任公司 | Text recommendation method and related equipment |
| CN110866181A (en) * | 2019-10-12 | 2020-03-06 | 平安国际智慧城市科技股份有限公司 | Resource recommendation method, device and storage medium |
| CN110866181B (en) * | 2019-10-12 | 2022-04-22 | 平安国际智慧城市科技股份有限公司 | Resource recommendation method, device and storage medium |
| CN111177372A (en) * | 2019-12-06 | 2020-05-19 | 绍兴市上虞区理工高等研究院 | Scientific and technological achievement classification method, device, equipment and medium |
| CN111241318B (en) * | 2020-01-03 | 2021-04-13 | 北京字节跳动网络技术有限公司 | Method, device, equipment and storage medium for selecting object to push cover picture |
| CN111241318A (en) * | 2020-01-03 | 2020-06-05 | 北京字节跳动网络技术有限公司 | Method, device, equipment and storage medium for selecting object to push cover picture |
| CN111241403B (en) * | 2020-01-15 | 2023-04-18 | 华南师范大学 | Deep learning-based team recommendation method, system and storage medium |
| CN111241403A (en) * | 2020-01-15 | 2020-06-05 | 华南师范大学 | Deep learning-based team recommendation method, system and storage medium |
| CN111325006A (en) * | 2020-03-17 | 2020-06-23 | 北京百度网讯科技有限公司 | Information interaction method and device, electronic equipment and storage medium |
| CN111325006B (en) * | 2020-03-17 | 2023-05-05 | 北京百度网讯科技有限公司 | An information interaction method, device, electronic device and storage medium |
| CN111563177B (en) * | 2020-05-15 | 2023-05-23 | 深圳掌酷软件有限公司 | A method and system for recommending theme wallpapers based on cosine algorithm |
| CN111563177A (en) * | 2020-05-15 | 2020-08-21 | 深圳掌酷软件有限公司 | Theme wallpaper recommendation method and system based on cosine algorithm |
| CN111625439B (en) * | 2020-06-01 | 2023-07-04 | 杭州弧途科技有限公司 | Method for analyzing app user viscosity based on log data of user behaviors |
| CN111625439A (en) * | 2020-06-01 | 2020-09-04 | 杭州弧途科技有限公司 | Method for analyzing viscosity of app user based on log data of user behavior |
| CN111651675A (en) * | 2020-06-09 | 2020-09-11 | 杨鹏 | UCL-based user interest topic mining method and device |
| CN111651675B (en) * | 2020-06-09 | 2023-07-04 | 杨鹏 | UCL-based user interest topic mining method and device |
| CN112287199A (en) * | 2020-10-29 | 2021-01-29 | 黑龙江稻榛通网络技术服务有限公司 | A big data center processing system based on cloud server |
| CN114492389A (en) * | 2020-11-12 | 2022-05-13 | 中移动信息技术有限公司 | Corpus type determining method, apparatus, device and storage medium |
| CN112559901A (en) * | 2020-12-11 | 2021-03-26 | 百度在线网络技术(北京)有限公司 | Resource recommendation method and device, electronic equipment, storage medium and computer program product |
| CN112667899A (en) * | 2020-12-30 | 2021-04-16 | 杭州智聪网络科技有限公司 | Cold start recommendation method and device based on user interest migration and storage equipment |
| CN113779954A (en) * | 2021-01-29 | 2021-12-10 | 北京京东拓先科技有限公司 | Similar text recommendation method and device and electronic equipment |
| CN113268683B (en) * | 2021-04-15 | 2023-05-16 | 南京邮电大学 | Academic literature recommendation method based on multiple dimensions |
| CN113268683A (en) * | 2021-04-15 | 2021-08-17 | 南京邮电大学 | Academic literature recommendation method based on multiple dimensions |
| CN113536085A (en) * | 2021-06-23 | 2021-10-22 | 西华大学 | Topic word search crawler scheduling method and system based on combined prediction method |
| CN113420058A (en) * | 2021-07-01 | 2021-09-21 | 宁波大学 | Conversational academic conference recommendation method based on combination of user historical behaviors |
| CN113360776A (en) * | 2021-07-19 | 2021-09-07 | 西南大学 | Scientific and technological resource recommendation method based on cross-table data mining |
| CN113360776B (en) * | 2021-07-19 | 2023-07-21 | 西南大学 | Technology resource recommendation method based on cross-table data mining |
| CN113568882A (en) * | 2021-08-03 | 2021-10-29 | 重庆仓舟网络科技有限公司 | OSS-based resource sharing method and system |
| CN113921016A (en) * | 2021-10-15 | 2022-01-11 | 阿波罗智联(北京)科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
| CN114254103A (en) * | 2021-11-26 | 2022-03-29 | 广东电力信息科技有限公司 | Conference summary generation method based on theme generation model |
| CN114519097B (en) * | 2022-04-21 | 2022-07-19 | 宁波大学 | Academic paper recommendation method for heterogeneous information network enhancement |
| CN114519097A (en) * | 2022-04-21 | 2022-05-20 | 宁波大学 | Academic paper recommendation method for heterogeneous information network enhancement |
| CN117575745A (en) * | 2024-01-17 | 2024-02-20 | 山东正禾大教育科技有限公司 | Course teaching resource individual recommendation method based on AI big data |
| CN117575745B (en) * | 2024-01-17 | 2024-04-30 | 山东正禾大教育科技有限公司 | Course teaching resource individual recommendation method based on AI big data |
| CN118522430A (en) * | 2024-07-25 | 2024-08-20 | 上海交通大学医学院附属仁济医院 | Science popularization information initiative matching and pushing system based on Internet hospital |
| CN119226603A (en) * | 2024-08-23 | 2024-12-31 | 山东省大数据中心 | A vertical search engine ranking method and system based on multi-dimensional features |
| CN118820389A (en) * | 2024-09-18 | 2024-10-22 | 戎行技术有限公司 | Keyword-based data association storage method and device |
| CN118820389B (en) * | 2024-09-18 | 2024-12-17 | 戎行技术有限公司 | Keyword-based data association storage method and device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN106815297B (en) | 2020-04-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106815297B (en) | Academic resource recommendation service system and method | |
| US7844592B2 (en) | Ontology-content-based filtering method for personalized newspapers | |
| White et al. | Predicting user interests from contextual information | |
| US8650172B2 (en) | Searchable web site discovery and recommendation | |
| Elmeleegy et al. | Mashup advisor: A recommendation tool for mashup development | |
| CN101520785B (en) | Information retrieval method and system therefor | |
| CN103425799B (en) | Individuation research direction commending system and recommend method based on theme | |
| CN102929928B (en) | Multidimensional-similarity-based personalized news recommendation method | |
| Li et al. | Community detection using hierarchical clustering based on edge-weighted similarity in cloud environment | |
| US7519588B2 (en) | Keyword characterization and application | |
| CN104484431B (en) | A kind of multi-source Personalize News webpage recommending method based on domain body | |
| CN106682152B (en) | A personalized message recommendation method | |
| CN112989215B (en) | A Knowledge Graph Enhanced Recommendation System Based on Sparse User Behavior Data | |
| WO2001025947A1 (en) | Method of dynamically recommending web sites and answering user queries based upon affinity groups | |
| CN106547864B (en) | A Personalized Information Retrieval Method Based on Query Expansion | |
| CN106204156A (en) | A kind of advertisement placement method for network forum and device | |
| Godoy et al. | Interface agents personalizing Web-based tasks | |
| KR100954842B1 (en) | Web page classification method using category tag information, system and recording medium recording the same | |
| CN105677838A (en) | User profile creating and personalized search ranking method and system based on user requirements | |
| CN116431895A (en) | Personalized recommendation method and system for safety production knowledge | |
| CN103823847A (en) | Keyword extension method and device | |
| Maidel et al. | Ontological content‐based filtering for personalised newspapers: A method and its evaluation | |
| Ma et al. | Book recommendation model based on wide and deep model | |
| CN114528469A (en) | Recommendation method and device, electronic equipment and storage medium | |
| Ahamed et al. | Deduce user search progression with feedback session |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |